This topic uses BWA, GATK, and SAMtools software as an example to show how to perform high-performance computing by using an Elastic High Performance Computing (E-HPC) cluster.
Background information
People have been trying to discover the origin of human life for hundreds of years. Stimulated by that, life sciences thrive. Thanks to gene sequencing, the gene sequences familiar to people are increasing exponentially. However, sequence similarity searches, alignments, variant discoveries on such a large number of genes require complicated data processing and parallel computing. High-performance computing is equipped with powerful computing capabilities. A variety of schedulers are used to improve concurrency and graphics processing units (GPUs) are used to accelerate computing.
This topic takes whole genome sequencing (WGS) and GATK as an example to show how to perform gene sequencing. You can prepare a test based on your business requirements, such as adding quality control and filtering processes for variant discoveries.
When you perform DNA sequencing, you can use the Burrows-Wheeler Alignment (BWA) tool to build indexes and generate alignments, use SAMtools to sort the alignments, and then use the Genome Analysis Toolkit (GATK) to remove duplicates, recalibrate base quality scores, and discover variants.
Burrows-Wheeler Aligner (BWA) is a software package that maps low-divergent sequences against a large reference genome, such as the human genome.
Genome Analysis Toolkit (GATK) is a software package for analyzing next-generation DNA sequencing data. It is used to remove duplicates, recalibrate base quality scores, and discover variants.
SAMtools is a set of utilities that interact with and post-process short DNA sequence read alignments in the SAM, BAM, and CRAM formats. It supports complex tasks like the viewing, sorting, indexing, data extraction, and format conversion of alignments. It sorts alignments based on SAM flags and tags.
Procedure
Create and log on to a cluster.
Log on to the E-HPC console.
Create a cluster named Genome.
For more information, see Create a cluster. Set the following parameters:
Compute Node: Select ecs.ebmc5s.24xlarge.
VNC: Turn on the VNC switch. Then, you can remotely log on to the cloud desktop or app of E-HPC by using the E-HPC console.
Create an ordinary user named Genome.
For more information, see Create a user.
On the Cluster page of the E-HPC console, find Genome, and click Connect.
In the Connect panel, specify a username, password, and port number for Genome. Then, click Connect via SSH.
Install BWA, GATK, and SAMtools.
For more information, visit the BWA, GATK, and SAMtools pages of the GitHub official website.
Download and decompress the files of the reference genome, gene sequencing sample 1, and gene sequencing sample 2.
wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_human_g1k_v37.fasta wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz gunzip gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz gunzip gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
Build an index.
Build an index that is required during sequence alignment.
bwa index b37_human_g1k_v37.fasta
Create a FASTA index file.
samtools faidx b37_human_g1k_v37.fasta
Align the reads to the reference genome.
Align the reads to the reference genome. Then, generate alignments in a BAM file to save disk space.
time bwa mem -t 52 \ -R '@RG\tID:ehpc\tPL:illumina\tLB:library\tSM:b37' b37_human_g1k_v37.fasta \ NA20318_ERR250971_1.filt.fastq NA20318_ERR250971_2.filt.fastq \ | samtools view -S -b - > ERR250971.bam
Sort the alignments in ascending order.
time samtools sort -@ 52 -O bam -o ERR250971.sorted.bam ERR250971.bam
Remove duplicates.
Use MarkDuplicates to remove the reads that are likely artifacts from the PCR amplification.
time gatk MarkDuplicates \ -I ERR250971.sorted.bam -O ERR250971.markdup.bam \ -M ERR250971.sorted.markdup_metrics.txt
Create an index for the sorted BAM file.
samtools index ERR250971.markdup.bam
Recalibrate base quality scores.
Build the recalibration model and create a recalibration table based on existing SNP and Indels databases.
time gatk BaseRecalibrator \ -R b37_human_g1k_v37.fasta \ -I ERR250971.markdup.bam \ --known-sites b37_1000G_phase1.indels.b37.vcf.gz \ --known-sites b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz \ --known-sites b37_dbsnp_138.b37.vcf.gz -O ERR250971.BQSR.table
Adjust base quality scores.
time gatk ApplyBQSR \ --bqsr-recal-file ERR250971.BQSR.table \ -R b37_human_g1k_v37.fasta \ -I ERR250971.markdup.bam \ -O ERR250971.BQSR.bam
Create an index for the sorted BAM file.
time samtools index ERR250971.BQSR.bam
Discover variants.
Generate a DICT file for the reference genome.
time gatk CreateSequenceDictionary \ -R b37_human_g1k_v37.fasta \ -O b37_human_g1k_v37.dict
Generate a VCF file for the variant discovery output.
time gatk HaplotypeCaller \ -R b37_human_g1k_v37.fasta \ -I ERR250971.BQSR.bam \ -O ERR250971.HC.vcf.gz
Time consumed for the test
Step | Operation | Consumed time (minutes) |
---|---|---|
bwa index | Indexing | 50.60 |
bwa mem | Aligning | 21.98 |
sort | Sorting | 1.62 |
MarkDuplicates | Duplicate removal | 21.60 |
BaseRecalibrator | Recalibration table creation | 38.47 |
ApplyBQSR | Recalibration | 17.22 |
CreateSequenceDictionary | Dictionary creation | 0.18 |
HaplotypeCaller | Variant discovery | 175.93 |