All Products
Search
Document Center

Batch Compute:Best practices for GTX_FPGA

更新时间:Feb 20, 2024

gtx-fpga

Overview

Developed by GTX-Laboratory, GTX-FPGA is a tool that uses CPUs and field-programmable gate arrays (FPGAs) to accelerate whole genome sequencing in a heterogeneous manner and leverages their characteristics to ensure the high-performance computing of genetic data. GTX-FPGA helps shorten the time to analyze 30X whole genome sequencing data from 30 hours to only 30 minutes and 100X whole exome sequencing data from 6 hours to only 5 minutes.

GTX-FPGA analysis focuses on index building (index), genome alignment (align), variant calling (vc), and whole genome sequencing (wgs) that integrates genome alignment and variant calling. The GTX one process mentioned in the following section is also the whole genome sequencing process.

This topic describes how to use GTX-FPGA in Alibaba Cloud Batch Compute to run analysis jobs of whole genome sequencing data and whole exome sequencing data with a few clicks.

Constraints

  • GTX-FPGA supports only instances of the f3 instance family in Alibaba Cloud Elastic Compute Service (ECS). Each instance must be equipped with an SSD. The SSD capacity is determined by the FASTA file size. The SSD capacity required by genome alignment (align) is the sum of two FASTQ file sizes multiplied by 2. For example, if the size of File FASTQ1 is 40 GB, and the size of File FASTQ2 is 42 GB, the required SSD capacity is 164 GB. The SSD capacity required by whole genome sequencing (wgs) is the sum of the original data size and the calculation result. For example, if the original data size is 100 GB for 30X whole genome sequencing data and the calculation result is 150 GB, then the required SSD capacity for whole genome sequencing (wgs) is 250 GB. If you want to calculate the data disk size for human genomes, you can use the default values shown in the following demo example.

  • GTX-FPGA supports only testing in the China (Beijing) region.

  • GTX-FPGA is in public preview. During the public preview, GTX-FPGA is free of charge. You are charged only for the instances that are required for jobs and resource storage.

Prerequisites

  • You are logged on to the Alibaba Cloud Management Console, and the account balance is sufficient.

  • The Batch Compute service is activated to analyze data.

  • The Object Storage Service (OSS) service is activated to upload your sequencing data and save the analysis results. A bucket is created. For example, you created a bucket named gtx-wgs-demo.

  • The AccessKey pair of your Alibaba Cloud account is created and can be viewed. If you use a RAM user, make sure that the RAM user has the permissions on Batch Compute and OSS. For more information, see Quick start . The AccessKey ID and AccessKey secret can be copied for subsequent use. In this example, the AccessKey ID LTAI8xxxxx and the AccessKey secret vVGZVE8qUNjxxxxxxxx are used.

Procedure

GTX-FPGA supports the running of jobs in the workflow description language (WDL) mode and directed acyclic graph (DAG) mode. The following table describes the required parameters.

1 GTX command format

command

parameter

Parameter description

index

-f

Forcibly overwrite an existing index file

-h

Print the help documentation

-m

Specify that the path to intermediate temporary files defaults to /ssd-cache

--disable-gtx-index

Disable index for gtx

--disable-bwa-index

Disable index for bwa

--enable-bwa2-index

Enable index for mem2

align

-o

Output bam file

-R

The header message for read group defaults to "'@RG\\tID:foo\\tSM:bar'\n"

-A

Match score, default to 1

-B

Mismatch penalty, default to 4

-E

The gap extension penalty score, which defaults to 1

-t

The number of threads, the default is 32 (best performance in all-in-one)

--bwa

The accuracy of the comparison results with this parameter is comparable to that of BWA-mem

--disable-mark-duplicate

Disable mark duplicate

wgs

-o

Output vcf file

-b

Output bam file

-R

The header information of the read group, the default is "'@RG\\tID:foo\\tSM:bar'\n"

-A

Match score, default to 1

-B

Mismatch penalty, default to 4

-E

The gap extension penalty score, which defaults to 1

-t

The number of threads, the default 32 (best performance in all-in-one)

-L

Specify one chromosome (eg.chr1:1-200) or multiple chromosomes (bed file) for calculation

-g

Outputs a gvcf format file

--bwa

The accuracy of the comparison results with this parameter is comparable to that of BWA-mem

--disable-mark-duplicate

Disable mark duplicate

--metrics

Outputs the metrics in the deduplication process

vc

-o

Output vcf file

-r

fasta file

-i

Enter the bam file after sorting and deduplication

-t

The number of open threads, the default is 32 (best performance in all-in-one)

-L

Specify one chromosome (eg.chr1:1-200) or multiple chromosomes (bed file) for calculation

-g

Outputs a gvcf format file

--gtz-rbin1

This parameter represents that when the input file fastq1 is a gtz file, the rbin used is used to decompress the calculation

--gtz-rbin2

This parameter represents that when the input file fast2 is a gtz file, the rbin that needs to be used to extract the rbin file says please refer to the official documentation of gtz https://github.com/Genetalks/gtz

2 WDL mode

For more information about the WDL mode, see related documents.

3 DAG mode

3.1 Sample scripts

Download the sample code of a DAG job.

When you use the sample code, take note of the following items:

Note

genGtxIndexCmd is the command to build an index. For more information about how to run the command, see the help information in the code. genGtxWgsCmd is the command of GTX one. For more information about how to run the command, see the help information in the code. genGtxAlignCmd is the command to align genomes. For more information about how to run the command, see the help information in the code. genGtxVcCmd is the command to detect mutations. For more information about how to run the command, see the help information in the code.

  • You can configure custom values for each GTX parameter in the preceding steps or follow the default values.

  • The operations to build an index are not necessary in this topic. In this demo, an index is built by default. If you want to build an index, you must add the description of the isNeedIndex parameter when you execute scripts.

  • You can pass the value of the read_group_header parameter by using CLIs, or you can use the default value.

  • By default, the sample code runs the GTX one (alignment and variant calling) process. If you want to separately perform operations by step, you must configure the related parameters.

  • You can run the pip install -upgrade batchcompute command to update the Batch Compute SDK for Python to the latest version.

3.2 Commands

python test.py --reference oss://xxx/ref/hg19.fa --fastq1 oss://xxx/input/human30x_10m_1.fastq --fastq2 oss://xxxx/_input/human30x_10m_2.fastq --output oss://xxx/testoutput/

3.3 Results

gtx-intl-output-dirgtx-eng-output-file