All Products
Search
Document Center

Elastic High Performance Computing:Use BWA, GATK, and SAMtools to perform gene sequencing

Last Updated:Jul 27, 2023

This topic describes how to run Burrows-Wheeler Alignment (BWA), Genome Analysis Toolkit (GATK), and SAMtools software in an Elastic High Performance Computing (E-HPC) cluster to perform gene sequencing.

Background information

Fast-developing gene sequencing technology is helping people discover more gene sequences. However, several operations that involve a large number of genes, such as sequence similarity searches, alignments, and variant discoveries, require complicated data processing techniques and parallel computing. High-performance computing provides powerful computing capabilities. This process uses various types of schedulers to improve concurrency and graphics processing units (GPUs) to accelerate computing.

This topic describes how to perform gene sequencing by using whole genome sequencing (WGS) and GATK. You can prepare a test based on your business requirements. For example, you can add quality control and filtering processes for variant discoveries.

When you perform gene sequencing, you can use the BWA tool to build indexes and generate alignments, use SAMtools to sort the alignments, and then use GATK to remove duplicates, recalibrate base quality scores, and discover variants.

  • BWA is a software package that maps low-divergent sequences against a large reference genome, such as the human genome.

  • GATK is a software package that is used to analyze next-generation DNA sequencing data. It is used to remove duplicates, recalibrate base quality scores, and discover variants.

  • SAMtools is a set of utilities that interact with and post-process short DNA sequence read alignments in the SAM, BAM, and CRAM formats. SAMtools supports complex tasks such as the viewing, sorting, indexing, data extraction, and format conversion of alignments. SAMtools also sorts alignments based on SAM flags and tags.

Preparations

  1. Create an E-HPC cluster. For more information, see Create a cluster by using the wizard.

    Configure the parameters of the E-HPC cluster. The following table describes the parameters.

    Parameter

    Description

    Hardware settings

    Deploy a cluster in Tiny mode that contains one management node and one compute node. The management node and compute node must have the following specifications:

    • Management node: The management node must be an ecs.c7.large Elastic Compute Service (ECS) instance that has 2 vCPUs and 4 GiB of memory.

    • Compute node: The compute node must be an ecs.ebmc5s.24xlarge ECS instance that has 96 vCPUs and 192 GiB of memory.

    Software settings

    Deploy a CentOS 7.6 public image and the PBS scheduler. Turn on VNC.

  2. Create a cluster user. For more information, see Create a user.

    The user is used to log on to the cluster, compile LAMMPS, and submit jobs. In this example, the following configurations are used to create the user:

    • Username: testuser.

    • User group: sudo permission group.

  3. Install software.

    • Install BWA. For more information, visit BWA.

    • Install GATK. For more information, visit GATK.

    • Install SAMtools. For more information, visit SAMtools.

Step 1: Connect to the cluster

Use one of the following methods to connect to the cluster:

  • Use an E-HPC client to log on to a cluster

    The scheduler of the cluster must be PBS. Make sure that you have downloaded and installed an E-HPC client and deployed the environment required for the client. For more information, see Deploy an environment for an E-HPC client.

    1. Start and log on to your E-HPC client.

    2. In the left-side navigation pane, click Session Management.

    3. In the upper-right corner of the Session Management page, click terminal to open the Terminal window.

  • Log on to a cluster in the E-HPC console

    1. Log on to the E-HPC console.

    2. In the upper-left corner of the top navigation bar, select a region.

    3. In the left-side navigation pane, click Cluster.

    4. On the Cluster page, find the cluster and click Connect.

    5. In the Connect panel, enter a username and a password, and click Connect via SSH.

Step 2: Perform gene sequencing

  1. Log on to the compute node.

    In this example, the compute000 node is used:

    ssh compute000
  2. (Optional) Download tmux and create a tmux session.

    Note

    The gene sequencing process includes indexing, alignment, sorting, duplicate removal, dictionary creation, recalibration table creation, recalibration, and variant discovery. This process requires 6 to 7 hours. We recommend that you perform the process in a tmux session to avoid interruptions due to ECS disconnection. For more information about the duration of each step, see the "Duration of gene sequencing" section in this topic.Duration of gene sequencing

    sudo yum install tmux
    tmux
  3. Prepare the gene files that are required for gene sequencing.

    1. Download and decompress the GZip files of the reference genome, gene sequencing sample 1, and gene sequencing sample 2.

      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_human_g1k_v37.fasta
      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_1000G_phase1.indels.b37.vcf
      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_Mills_and_1000G_gold_standard.indels.b37.vcf
      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/b37_dbsnp_138.b37.vcf.zip
      unzip b37_dbsnp_138.b37.vcf.zip
      
      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
      gunzip gatk-examples_example1_NA20318_ERR250971_1.filt.fastq.gz
      
      wget https://public-ehpc-package.oss-cn-hangzhou.aliyuncs.com/lifescience/gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
      gunzip gatk-examples_example1_NA20318_ERR250971_2.filt.fastq.gz
    2. Compress and index the gene files.

      1. Compress the files with BGZip.

        bgzip -f b37_1000G_phase1.indels.b37.vcf
        bgzip -f b37_Mills_and_1000G_gold_standard.indels.b37.vcf
        bgzip -f b37_dbsnp_138.b37.vcf
      2. Build an index with Tabix.

        tabix -p vcf  b37_1000G_phase1.indels.b37.vcf.gz
        tabix -p vcf  b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz
        tabix -p vcf  b37_dbsnp_138.b37.vcf.gz
  4. Build an index.

    1. Build an index that is required for sequence alignment.

      time bwa index b37_human_g1k_v37.fasta

      The following code shows an example of the returned output:

      bwa index..png
    2. Create a FASTA index file.

      samtools faidx b37_human_g1k_v37.fasta 
  5. Align the reads to the reference genome.

    Align the reads to the reference genome. Then, generate alignments in a BAM file to maximize disk space.

    time bwa mem -t 52 \
    -R '@RG\tID:ehpc\tPL:illumina\tLB:library\tSM:b37' \
    b37_human_g1k_v37.fasta \
    gatk-examples_example1_NA20318_ERR250971_1.filt.fastq \
    gatk-examples_example1_NA20318_ERR250971_2.filt.fastq | samtools view -S -b - > ERR250971.bam

    The following code shows an example of the returned output:

    bwa men..png
  6. Sort the alignments in ascending order.

    time samtools sort -@ 52 -O bam -o ERR250971.sorted.bam ERR250971.bam
  7. Remove duplicates.

    1. Use MarkDuplicates to remove the reads that are artifacts generated during PCR amplification.

      time gatk MarkDuplicates \
      -I ERR250971.sorted.bam -O ERR250971.markdup.bam \
      -M ERR250971.sorted.markdup_metrics.txt

      The following code shows an example of the returned output:

      BWA-4.png
    2. Create an index for the sorted BAM file.

      samtools index ERR250971.markdup.bam
  8. Generate a DICT file for the reference genome.

    time gatk CreateSequenceDictionary  \
    -R b37_human_g1k_v37.fasta \
    -O b37_human_g1k_v37.dict
  9. Recalibrate base quality scores.

    1. Build the recalibration model and create a recalibration table based on existing SNP and Indels databases.

      time gatk BaseRecalibrator \
      -R b37_human_g1k_v37.fasta \
      -I ERR250971.markdup.bam  \
      --known-sites b37_1000G_phase1.indels.b37.vcf.gz \
      --known-sites b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz \
      --known-sites b37_dbsnp_138.b37.vcf.gz -O ERR250971.BQSR.table

      The following code shows an example of the returned output:

      BWA-6.png
    2. Adjust base quality scores.

      time gatk ApplyBQSR \
      --bqsr-recal-file ERR250971.BQSR.table \
      -R b37_human_g1k_v37.fasta \
      -I ERR250971.markdup.bam \
      -O ERR250971.BQSR.bam

      The following code shows an example of the returned output:

      BWA-7.png
    3. Create an index for the sorted BAM file.

      time samtools index ERR250971.BQSR.bam
  10. Generate a VCF file for the variant discovery output.

    time gatk HaplotypeCaller \
    -R b37_human_g1k_v37.fasta  \
    -I ERR250971.BQSR.bam \
    -O ERR250971.HC.vcf.gz

    The following code shows an example of the returned output:

    BWA-8.png

Duration of gene sequencing

In this topic, ecs.ebmc5s.24xlarge is selected as the instance type of the compute nodes to test the process of WGS. The entire process may take 6 to 7 hours. The following table describes the estimated time that is required for each step.

Step

Operation

Duration (minutes)

bwa index

Indexing

54

bwa mem

Alignment

25

samtools sort

Sorting

2

gatk MarkDuplicates

Duplicate removal

13

gatk CreateSequenceDictionary

Dictionary creation

0.15

gatk BaseRecalibrator

Recalibration table creation

40

gatk ApplyBQSR

Recalibration

21

gatk HaplotypeCaller

Variant discovery

211