How to use AnalyticDB Spark to accelerate gene data analysis - AnalyticDB

This topic describes a distributed gene analysis solution that is based on the AnalyticDB Spark. This solution uses distributed computing and GPU acceleration to significantly improve the efficiency of large-scale genomic data analysis. It is suitable for scenarios such as gene screening and disease prediction, which helps advance life science research and precision medicine.

Background

Life sciences is a rapidly growing field. Applications based on DNA analysis, from bacterial culture identification in the food industry to rapid cancer diagnosis, are constantly emerging. However, gene analysis applications also face many challenges. Because data volumes are growing rapidly, traditional single-node tools are no longer sufficient. Many new technologies and methods have been developed and applied to gene sequence analysis, such as Spark, Field-Programmable Gate Arrays (FPGAs), and GPU coprocessor acceleration. These technologies enable most life science applications to achieve efficient parallel processing without complex Message Passing Interface (MPI) programming. In addition, Spark in-memory computing technology significantly improves analysis efficiency, optimizes workflows, and shortens analysis time. This facilitates new discoveries in scientific research.

The volume of global genomics data is growing at an astonishing rate, doubling every seven months. However, most traditional tools for processing genomics data still run on a single node. These tools lack extensibility and cannot effectively handle the exponential growth in data volume.

This topic describes how to use the distributed computing capabilities of AnalyticDB Spark to accelerate gene analysis tasks, including gene screening and disease prediction.

Solution introduction

Traditional single-node processing solution

Traditional gene analysis workflows typically rely on command line interface (CLI) toolchains and single-node computing. This approach is suitable for processing small- to medium-scale data, such as data from the 1000 Genomes Project. A typical workflow is as follows:

Tool preparation

Before you start the analysis, install various tools and dependency libraries. These include basic tools, Python libraries, and more than ten R packages.

# Install basic tools (Linux)
sudo apt-get install plink bcftools r-base python3-pip

# Install Python libraries
pip3 install pandas numpy matplotlib pysam

# Install R packages
Rscript -e "install.packages(c('qqman', 'data.table'), repos='https://mirrors.tuna.tsinghua.edu.cn/CRAN/')"

Data preparation

Use the BCFtools and PLINK tools to convert data formats and perform quality control. First, convert the raw data from Variant Call Format (VCF) to a PLINK binary format suitable for analysis, such as bed, bim, or fam. Then, filter low-quality sites. For example, remove sites with a minor allele frequency (MAF) below 1% or a missing rate above 5%. Also, remove abnormal samples, such as those with gender inconsistencies or close kinship. Finally, split the data by chromosome to enable subsequent parallel processing and improve computing efficiency.

Association analysis

Use the PLINK and SAIGE tools to perform single-node association analysis. Run logistic regression models or chi-square tests to calculate the association between each single nucleotide polymorphism (SNP) and the target phenotype, such as stroke risk. Use methods such as Bonferroni correction or false discovery rate (FDR) to correct for multiple testing and reduce the false positive rate.

AnalyticDB processing solution

GATK

GATK is a widely used toolkit for genomic data analysis. AnalyticDB Spark supports distributed parallel execution and GPU-accelerated execution of GATK, which significantly improves runtime efficiency.

DeepVariant

DeepVariant is a deep learning-based tool for detecting genomic variants. Research shows that it outperforms GATK HaplotypeCaller in accuracy and performance.

AnalyticDB Spark supports running DeepVariant on both CPUs and GPUs. This fully utilizes hardware resources to accelerate variant detection tasks.

Solution comparison

Dimension	Traditional solution	AnalyticDB solution
Data volume	Small to medium data (sample size < 100,000, WES/WGS single chromosome).	Very large data (sample size ≥ 1,000,000, whole genome).
Computing architecture	Single-node or simple multi-core parallel processing.	Distributed cluster (CPU/GPU resource pooling).
Development efficiency	Requires manually writing Shell or Python scripts to chain tools, which makes debugging complex.	Provides a unified API (Scala/Python) for structured code that is easy to maintain.
Performance bottleneck	Single-node I/O and memory limitations.	Network communication and sharding strategy optimization.
Feature extensibility	Depends on the tool ecosystem. Extending functionality requires developing new tools.	Native support for UDFs, SQL, and machine learning pipelines.
Typical scenarios	Single research projects and rapid prototype validation.	Enterprise-level multi-project and production-grade pipelines.

Performance comparison

Dataset

Dataset: The 1000 Genomes Project is an international research collaboration. The project built the most detailed map of human genetic variation, including SNPs, structural variations, and their haplotype environments. The final phase of the project collected gene sequences from more than 2,500 people from 26 populations around the world and established a complete set of phased haplotypes that includes more than 80 million variants from these individuals.
Data volume: 20 GB.

GATK algorithm time consumption comparison

Time consumption results

Step	Traditional single-node processing solution	AnalyticDB distributed processing solution on CPU	AnalyticDB distributed processing solution on GPU (single card)	AnalyticDB distributed processing solution on GPU (dual cards)
MarkDuplicates	14.37 min	7.84 min	1.61 min	1.6 min
BaseRecalibrator	41.17 min	4.25 min	1.53 min	1.25 min
ApplyBQSR	18.8 min	8.56 min	1.31 min	1.26 min
HaplotypeCaller	200+ min	11.53 min	4.31 min	2.48 min

Conclusion

With the same CPU resources, the GATK algorithm runs more efficiently using Spark distributed execution. The time required for HaplotypeCaller can be reduced by more than 90%.
Using a GPU significantly improves the performance of the GATK algorithm. At a lower unit price, variant detection performance increases by more than 40 times, and pre-processing performance increases by 9 to 27 times.

DeepVariant algorithm time consumption comparison

Instance types

On CPU: ecs.g7.4xlarge (16 cores, 64 GB).
On GPU: ecs.gn7i-2x.4xlarge (16 cores, 64 GB + 1 × A10).

Execution time

DeepVariant variant detection is a three-step process. The core step, call_variants, can be run on a GPU.

Step	On CPU	On GPU
make_examples	41.88 min	41.80 min
call_variants	22.31 min	2.84 min
postprocess_variants	23.203 s	23.255 s

Conclusion

The call_variants step of the DeepVariant algorithm runs more efficiently on a GPU. The time required is reduced by 87.2%.