All Products
Search
Document Center

Elastic High Performance Computing:Use BWA, GATK, and SAMtools to perform gene sequencing

Last Updated:Apr 23, 2025

This topic describes how to run Burrows-Wheeler Alignment (BWA), Genome Analysis Toolkit (GATK), and SAMtools software in an Elastic High Performance Computing (E-HPC) cluster to perform gene sequencing.

Background information

Fast-developing gene sequencing technology is helping people discover more gene sequences. However, operations that involve a large number of genes, such as sequence similarity search, alignment, and variant discovery, require heavy-load data processing and parallel computing. High-performance computing provides powerful computing capabilities. This process uses various types of schedulers to improve concurrency and graphics processing units (GPUs) to accelerate computing.

This topic describes how to perform gene sequencing by using the widely used whole genome sequencing (WGS) process in combination with GATK. You can prepare a test based on your business requirements. For example, you can add quality control and filtering processes for variant discoveries.

When you perform gene sequencing, you can use the BWA tool to build indexes and generate alignments, use SAMtools to sort the alignments, and then use GATK to remove duplicates, recalibrate base quality scores, and discover variants.

  • BWA is a software package that maps low-divergent sequences against a large reference genome, such as the human genome.

  • GATK is a next-generation software package that is used to analyze DNA sequencing data. It is used to remove duplicates, recalibrate base quality scores, and discover variants.

  • SAMtools is a set of utilities that interact with and post-process short DNA sequence read alignments in the SAM, BAM, and CRAM formats. SAMtools supports complex tasks such as the viewing, sorting, indexing, data extraction, and format conversion of alignments. SAMtools also sorts alignments based on SAM flags and tags.

Preparations

  1. Create an E-HPC cluster. For more information, see Create a standard cluster.

    In this example, the following configurations are used for the cluster:

    Item

    Configuration

    Series

    Standard Edition

    Deployment Mode

    Public cloud cluster

    Cluster Type

    SLURM

    Node configurations

    One management node, one logon node, and one compute node with the following specifications:

    • Management node: Elastic Compute Service (ECS) instance of the ecs.c7.large type and with 2 vCPUs and 4 GiB memory

    • Logon node: ECS instance of the ecs.c7.large type and with 2 vCPUs and 4 GiB memory

    • Compute node: ECS instance of the ecs.c7.16xlarge type and with 64 vCPUs and 256 GiB memory

    Image

    CentOS 7.6 public image

  2. Create a cluster user. For more information, see Manage users.

    The user is used to log on to the cluster, compile software, and submit jobs. The following settings are used in this example:

    • Username: testuser

    • User group: sudo permissions group

  3. Install software.

    The following software versions are used in this example:

    Software

    Version

    Download URL

    BWA

    4.2.0.0

    BWA

    GATK

    0.7.17

    GATK

    Samtools

    1.14

    Samtools

Step 1: Connect to the E-HPC cluster

  1. Connect to the cluster. For more information, see Connect to a cluster.

  2. Run the following command to create a directory for job execution. In this example, /home/data/human is used.

    Note

    To ensure sufficient storage space for data downloading and processing, we recommend that use a directory under the mounted shared directory /home.

    sudo mkdir -p /home/data/human

Step 2: Download and pre-process data

  1. Run the following command to create a job script named data_pre.slurm and open the script:

    sudo vim data_pre.slurm

    Sample script:

    #!/bin/bash -l
    #SBATCH -J data_pre
    #SBATCH --partition comp
    #SBATCH -N 1
    #SBATCH -n 1
    #SBATCH --output=data_pre.log
    
    cd /home/data/human/
    
    # Download and decompress the files of the reference genome, gene sequencing sample 1, and gene sequencing sample 2.
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_human_g1k_v37.fasta
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_1000G_phase1.indels.b37.vcf
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_Mills_and_1000G_gold_standard.indels.b37.vcf
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_dbsnp_138.b37.vcf.zip
    unzip b37_dbsnp_138.b37.vcf.zip
    
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
    gunzip gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
    
    wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
    gunzip gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
    
    # Compress the files with BGZip.
    bgzip -f b37_1000G_phase1.indels.b37.vcf
    bgzip -f b37_Mills_and_1000G_gold_standard.indels.b37.vcf
    bgzip -f b37_dbsnp_138.b37.vcf
    
    # Build an index with Tabix.
    tabix -p vcf  b37_1000G_phase1.indels.b37.vcf.gz
    tabix -p vcf  b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz
    tabix -p vcf  b37_dbsnp_138.b37.vcf.gz
  2. Run the following command to submit the job:

    sbatch data_pre.slurm

    Sample output for successful submission:

    image

  3. Run the following command to view the job status:

    squeue
  4. After the job is executed, run the following command to view the data_pre.log file:

    cat data_pre.log

Step 3: Build a reference genome index

  1. Run the following command to create a job script named bwa_index.slurm and open the script:

    sudo vim bwa_index.slurm

    Sample script:

    #!/bin/bash -l
    #SBATCH -J bwa_index
    #SBATCH --partition comp
    #SBATCH -N 1
    #SBATCH -n 1
    #SBATCH --output=bwa_index.log
    
    cd /home/data/human/
    
    # Build an index that is required during sequence alignment.
    time bwa index b37_human_g1k_v37.fasta
    
    # Create a FASTA index file.
    samtools faidx b37_human_g1k_v37.fasta
  2. Run the following command to submit the job:

    sbatch --dependency=afterok:jobid1 bwa_index.slurm
  3. Run the following command to view the job status:

    squeue
  4. After the job is executed, run the following command to view the bwa_index.log file:

    cat bwa_index.log

Step 4: Perform gene sequencing

  1. Run the following command to create a job script named wgs_main_process.slurm and open the script:

    sudo vim wgs_main_process.slurm

    Sample script:

    Important

    Replace compute000 with the actual running node of the cluster for the #SBATCH -w compute000 field in the sixth line.

    #!/bin/bash -l
    #SBATCH -J wgs_main_process
    #SBATCH --partition comp
    #SBATCH -N 1
    #SBATCH -n 64
    #SBATCH -w compute000
    #SBATCH --output=wgs_main_process.log
    
    cd /home/data/human/
    
    # Align the reads to the reference genome. Then, generate alignments in a BAM file to minimize disk usage.
    time bwa mem -t 64 \
    -R '@RG\tID:ehpc\tPL:illumina\tLB:library\tSM:b37' \
    b37_human_g1k_v37.fasta \
    gatk-examples_example1_NA20318_ERR250971_1.filt.fastq \
    gatk-examples_example1_NA20318_ERR250971_2.filt.fastq | samtools view -S -b - > ERR250971.bam
    
    # Sort the alignments in ascending order.
    time samtools sort -@ 64 -O bam -o ERR250971.sorted.bam ERR250971.bam
    
    # Use MarkDuplicates to remove the reads that are likely artifacts from the PCR amplification.
    time gatk MarkDuplicates \
    -I ERR250971.sorted.bam -O ERR250971.markdup.bam \
    -M ERR250971.sorted.markdup_metrics.txt
    
    # Create an index for the sorted BAM file.
    samtools index ERR250971.markdup.bam
    
    # Generate a DICT file for the reference genome.
    time gatk CreateSequenceDictionary  \
    -R b37_human_g1k_v37.fasta \
    -O b37_human_g1k_v37.dict
    
    # Build the recalibration model and create a recalibration table based on existing SNP and Indels databases.
    time gatk BaseRecalibrator \
    -R b37_human_g1k_v37.fasta \
    -I ERR250971.markdup.bam  \
    --known-sites b37_1000G_phase1.indels.b37.vcf.gz \
    --known-sites b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
    --known-sites b37_dbsnp_138.b37.vcf.gz -O ERR250971.BQSR.table
    
    # Adjust base quality scores.
    time gatk ApplyBQSR \
    --bqsr-recal-file ERR250971.BQSR.table \
    -R b37_human_g1k_v37.fasta \
    -I ERR250971.markdup.bam \
    -O ERR250971.BQSR.bam
    
    # Create an index for the sorted BAM file.
    time samtools index ERR250971.BQSR.bam
    
    # Generate a VCF file for the variant discovery output.
    time gatk HaplotypeCaller \
    -R b37_human_g1k_v37.fasta  \
    -I ERR250971.BQSR.bam \
    -O ERR250971.HC.vcf.gz
  2. Run the following command to submit the job:

    sbatch --dependency=afterok:jobid2 wgs_main_process.slurm
  3. Run the following command to view the job status:

    squeue
  4. After the job is executed, run the following command to view the wgs_main_process.log file:

    cat wgs_main_process.log

Duration of gene sequencing

In this topic, ecs.g7.16xlarge is selected as the instance type of the compute node to test the process of WGS. The entire process may take six to seven hours. The following table describes the estimated time that is required for each step:

Process

Description

Duration (minutes)

bwa index

Indexing

54

bwa mem

Alignment

25

samtools sort

Sorting

2

gatk MarkDuplicates

Duplicate removal

13

gatk CreateSequenceDictionary

Dictionary creation

0.15

gatk BaseRecalibrator

Recalibration table creation

40

gatk ApplyBQSR

Recalibration

21

gatk HaplotypeCaller

Variant discovery

211