All Products
Search
Document Center

Use BWA, GATK, and SAMtools to perform high-performance computing

Last Updated: Sep 03, 2021

This topic uses BWA, GATK, and SAMtools software as an example to show how to perform high-performance computing by using an Elastic High Performance Computing (E-HPC) cluster.

Background information

People have been trying to discover the origin of human life for hundreds of years. Stimulated by that, life sciences thrive. Life sciences cover a wide range of fields, including next-generation sequencing (NGS), third-generation sequencing, biopharmaceuticals, and molecular dynamics. Thanks to gene sequencing, the gene sequences familiar to people are increasing exponentially. However, sequence similarity searches, alignments, variant discoveries on such a large number of genes require complicated data processing and parallel computing. High-performance computing is equipped with powerful computing capabilities. A variety of schedulers are used to improve concurrency and graphics processing units (GPUs) are used to accelerate computing.

This topic takes whole genome sequencing (WGS) and GATK as an example to show how to perform gene sequencing. You can prepare a test based on your business requirements, such as adding quality control and filtering processes for variant discoveries.

When you perform gene sequencing, you can use BWA to build indexes and generate alignments, use SAMtools to sort the alignments, and then use GATK to remove duplicates, recalibrate base quality scores, and discover variants.

  • Burrows-Wheeler Aligner (BWA) is a software package that maps low-divergent sequences against a large reference genome, such as the human genome.

  • Genome Analysis Toolkit (GATK) is a software package for analyzing next-generation DNA sequencing data. It is used to remove duplicates, recalibrate base quality scores, and discover variants.

  • SAMtools is a set of utilities that interact with and post-process short DNA sequence read alignments in the SAM, BAM, and CRAM formats. It supports complex tasks like the viewing, sorting, indexing, data extraction, and format conversion of alignments. It sorts alignments based on SAM flags and tags.

Procedure

  1. Create and log on to a cluster.

    1. Log on to the E-HPC console.

    2. Create a cluster named Genome.

      For more information, see Create a cluster. Set the following parameters:

      • Compute Node: Select ecs.ebmc5s.24xlarge.

      • VNC: Turn on the VNC switch. Then, you can remotely log on to the cloud desktop or app of E-HPC by using the E-HPC console.

    3. Create an ordinary user named Genome.

      For more information, see Create a user.

    4. On the Cluster page of the E-HPC console, find Genome, and click Connect.

    5. In the Connect panel, specify a username, password, and port number for Genome. Then, click Connect via SSH.

  2. Install BWA, GATK, and SAMtools.

    For more information, visit the BWA, GATK, and SAMtools pages of the GitHub official website.

  3. Download and decompress the files of the reference genome, gene sequencing sample 1, and gene sequencing sample 2.

    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_human_g1k_v37.fasta
    
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
    gunzip gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
    gunzip gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
  4. Build an index.

    1. Build an index that is required during sequence alignment.

      bwa index b37_human_g1k_v37.fasta
    2. Create a FASTA index file.

      samtools faidx b37_human_g1k_v37.fasta 
  5. Align the reads to the reference genome.

    Align the reads to the reference genome. Then, generate alignments in a BAM file to save disk space.

    time bwa mem -t 52 \ -R '@RG\tID:ehpc\tPL:illumina\tLB:library\tSM:b37' b37_human_g1k_v37.fasta \
    NA20318_ERR250971_1.filt.fastq NA20318_ERR250971_2.filt.fastq \
    | samtools view -S -b - > ERR250971.bam
  6. Sort the alignments in ascending order.

    time samtools sort -@ 52 -O bam -o ERR250971.sorted.bam ERR250971.bam
  7. Remove duplicates.

    1. Use MarkDuplicates to remove the reads that are likely artifacts from the PCR amplification.

      time gatk MarkDuplicates \
      -I ERR250971.sorted.bam -O ERR250971.markdup.bam \
      -M ERR250971.sorted.markdup_metrics.txt
    2. Create an index for the sorted BAM file.

      samtools index ERR250971.markdup.bam
  8. Recalibrate base quality scores.

    1. Build the recalibration model and create a recalibration table based on existing SNP and Indels databases.

      time gatk BaseRecalibrator \
      -R b37_human_g1k_v37.fasta \
      -I ERR250971.markdup.bam  \
      --known-sites b37_1000G_phase1.indels.b37.vcf.gz \
      --known-sites b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
      --known-sites b37_dbsnp_138.b37.vcf.gz -O ERR250971.BQSR.table
    2. Adjust base quality scores.

      time gatk ApplyBQSR \
      --bqsr-recal-file ERR250971.BQSR.table \
      -R b37_human_g1k_v37.fasta \
      -I ERR250971.markdup.bam \
      -O ERR250971.BQSR.bam
    3. Create an index for the sorted BAM file.

      time samtools index ERR250971.BQSR.bam
  9. Discover variants.

    1. Generate a DICT file for the reference genome.

      time gatk CreateSequenceDictionary  \
      -R b37_human_g1k_v37.fasta \
      -O b37_human_g1k_v37.dict
    2. Generate a VCF file for the variant discovery output.

      time gatk HaplotypeCaller \
      -R b37_human_g1k_v37.fasta  \
      -I ERR250971.BQSR.bam \
      -O ERR250971.HC.vcf.gz

Time consumed for the test

In this topic, ecs.ebmc5s.24xlarge is selected as the instance type of the compute nodes to test the process of WGS. The following table lists the time consumed in each step, not the minimum time consumed for WGS.
You can optimize the sequencing process based on your needs to improve the sequencing performance. For example, you can adjust the scheduler to improve concurrency during chromosome splitting, use GATK to improve sequencing efficiency, or use FPGA to accelerate sequencing. For technical support, submit a ticket.

Step

Operation

Consumed time (minutes)

bwa index

Indexing

50.60

bwa mem

Aligning

21.98

sort

Sorting

1.62

MarkDuplicates

Duplicate removal

21.60

BaseRecalibrator

Recalibration table creation

38.47

ApplyBQSR

Recalibration

17.22

CreateSequenceDictionary

Dictionary creation

0.18

HaplotypeCaller

Variant discovery

175.93