All Products
Search
Document Center

GATK

Last Updated: Nov 02, 2018

The GATK software analysis process is jointly provided by Alibaba Cloud and Broad Institute. For the GATK process provided by Broad Institute, it is best to use Workflow Definition Language (WDL) for programming and use BatchCompute’s integrated Cromwell workflow engine for parsing. You are billed for the computing and storage resources actually consumed during jobs and do not have to pay any additional fees.

The Broad Institute GATK website and forum provide more background information, documentation, and support for GATK tools and WDL.

To use WDL to program a universal workflow, see In-app use —3. Cromwell workflow engine and WDL support.

* Currently, the GATK and WDL support functions are open for testing. To test the functions, please open a ticket.

1. Preparation

(1) Use OSS for storage

To run GATK on BatchCompute, the input and output files must be stored in OSS. Therefore, you must first activate OSS and create a bucket.NOTE: You must create the bucket in the same region in which you plan to run GATK on BatchCompute.

(2) Install the batchcompute-cli command line interface

  1. pip install batchcompute-cli

After installing the interface, you must configure it.

For specific configuration instructions, click here.

2. GATK demo

Run this command to generate the demo code:

  1. bcs gen ./demo -t gatk

This command generates the following directory structure:

  1. demo
  2. |____main.sh
  3. |____Readme.md
  4. |____src
  5. | |____PublicPairedSingleSampleWf.inputs.json
  6. | |____PublicPairedSingleSampleWf.md
  7. | |____PublicPairedSingleSampleWf.options.json
  8. | |____PublicPairedSingleSampleWf.wdl

Run the GATK demoThe GATK demo uses the human reference genome build 38 to process whole genome sequencing data. The input file is in unmatched BAM format.In this example, we use the public data in NA12878, with free storage for this data provided by Alibaba Cloud.

Now, run the following demo on your terminal:

  1. bcs asub cromwell gatk-job\
  2. --input_from_file_WDL src/PublicPairedSingleSampleWf.wdl\
  3. --input_from_file_WORKFLOW_INPUTS src/PublicPairedSingleSampleWf.inputs.json\
  4. --input_from_file_WORKFLOW_OPTIONS src/PublicPairedSingleSampleWf.options.json\
  5. --input_WORKING_DIR oss://luogc-shenzhen/gatkdemo/worker_dir/\
  6. --output_OUTPUTS_DIR oss://luogc-shenzhen/gatkdemo/output

This command is already written in main.sh, so alternatively, you can simply run:

  1. sh main.sh

The following message indicates submission was successful:

  1. Job created: job-0000000059DC658400006822000001E3

job-0000000059DC658400006822000001E3 is the ID of the submitted job.

Check the job status:

  1. bcs j # Get the job list
  2. bcs j job-0000000059DC658400006822000001E3 # View job details

View a job log:

  1. bcs log job-0000000059DC658400006822000001E3

Verify results:

To view process data and information in the workspace:

  1. bcs o ls oss://my_bucket/my_key/worker_dir/

View all output files:

  1. bcs o ls oss://my_bucket/my_key/outputs/

You have now successfully run Broad Institute GATK on BatchCompute.