Alibaba Genomic Service (AGS) enables rapid processing of Whole Genome Sequencing (WGS) tasks, including gene comparison, sequencing, deduplication, and variant detection. This topic describes how to manage WGS workflows through AGS.

Preparations

  • You have requested a preview of AGS and the request is authorized.
  • Configure permission settings and prepare data.
    1. Set up AGS.

      For more information about AGS download and installation, see Introduction to AGS CLI.

      ags config init
    2. Specify your OSS bucket and grant AGS read and write permissions on the bucket and the GetBucketInfo permission.
      Usage:
      ags config oss <your bucket name>
      
      e.g.
      ags config oss my-test-shenzhen
    3. Upload FASTQ data to your OSS bucket through ossutil.
      For more information about ossutil download and installation, see Download and installation.
      Note Currently, AGS only supports WGS and WES of human genome data. The comparisons of methylation data and plant and animal genome data are not supported.
      Usage:
      
      ossutil cp -r <local dir of fastq> <path of  oss bucket >
      
      e. g.
      ossutil cp -r . /MGISEQ oss://my-test-shenzhen/MGISEQ

Start a WGS workflow

We recommend that you use version hs37d5 of human reference genome hg19 as reference. This is also the default genome.

Note Version hs37d5 of human reference genome hg19 has the following features:
  • Excludes ALT contigs
  • Hard masks PARs on chrY
  • Includes decoy contig

AGS is ALT-Aware, which enables it to identify and process ALT contigs. Genome UCSC hg19 includes ALT contigs but does not support the other two features, which lowers the accuracy of variant detection. For more information, visit this link.

Usage:

ags remote run wgs \
--region cn-shenzhen # region of oss, e.g. cn-shenzhen, cn-beijing and etc\
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz # filename of fastq pair 2, fastq-path\filename \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  # filename of fastq pair 1\
--bucket my-test-shenzhen # Bucket name\
--output-bam bam/MGISEQ_NA12878_hs37d5.bam, # Output BAM to bucket,  By default empty, non output of BAM \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_5.vcf # Output filename \
--service "g" #SLA: [n:normal|s:silver|g:gold|p:platinum]\
--reference [hg19|hg38|<reference path on OSS>] # hg19: it is hs37d5 version, GRCh37/hg19 include decoy contig, no support for UCSC hg19. hg38: GRCh38/hg38 include decoy

e.g.
ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz  \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_5.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_5.bam \
--service "s" \
--reference hg19

### Batch process FASTQ files including multiple lanes and samples
MGISAMPLE001 is a set of WGS sequencing samples of multiple lanes. You can combine and compute the sequencing results of multiple lanes by specifying the sample directory --fastq1 MGISAMPLE001 or --fastq2 MGISAMPLE001.
oss://my-test-shenzhen/MGISAMPLE001/L1/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L2/MGISEQ2000_PCR-free_NA12878_1_V100003043_L02_1.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L1/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz
oss://my-test-shenzhen/MGISAMPLE001/L2/MGISEQ2000_PCR-free_NA12878_1_V100003043_L02_2.fq.gz

ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISAMPLE001 \
--fastq2 MGISAMPLE001  \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_6.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_6.bam \
--service "g" \
--reference hg19

ags remote run wgs \
--region cn-shenzhen \
--fastq1 MGISAMPLE002 \
--fastq2 MGISAMPLE002  \
--bucket my-test-shenzhen \
--output-vcf vcf/MGISEQ_NA12878_hs37d5_7.vcf \
--output-bam bam/MGISEQ_NA12878_hs37d5_7.bam \
--service "g" \
--reference hg19

Start a Mapping workflow

Use --fastq1 and --fastq2 to specify fastq, and use --output to specify the output path of bam.

Usage:

ags remote run mapping \
--region cn-shenzhen # region of oss, e.g. cn-shenzhen, cn-beijing and etc\
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz # filename of fastq pair 2, fastq-path\filename \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  # filename of fastq pair 1\
--bucket my-test-shenzhen # Bucket name\
--output-bam bam/MGISEQ_NA12878_hs37d5.bam # Output filename of BAM \
--service "g" #SLA: [n:normal|s:silver|g:gold|p:platinum]\
--markdup [true|false|default true] #Mark Duplicated, by default true
--reference [hg19|hg38|<reference path on OSS>]

e.g.

ags remote run mapping \
--region cn-shenzhen \
--fastq1 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_1.fq.gz  \
--fastq2 MGISEQ/MGISEQ2000_PCR-free_NA12878_1_V100003043_L01_2.fq.gz  \
--bucket my-test-shenzhen \
--output-bam bam/MGISEQ_NA12878_hs37d5.bam # Output filename of BAM \
--service "g" \
--markdup "true" \
--reference hg19
			

List remote workflows

Usage:
ags remote list

e.g.
ags remtoe list
+---------------+-------------------------------+
|   JOB NAME    |          CREATE TIME          |
+---------------+-------------------------------+
| wgs-gpu-ckw96 | 2020-01-07 19:08:32 +0000 UTC |
| wgs-gpu-djzws | 2020-01-07 18:31:22 +0000 UTC |
| wgs-gpu-pd659 | 2020-01-03 20:34:09 +0000 UTC |
+---------------+-------------------------------+

Obtain workflow details

Usage:
ags remote get <workflow id> --show
--show show detail of input parameters of workflow

e.g.
ags remote get wgs-gpu-sjtlw
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
|   JOB NAME    |  JOB NAMESPACE   |  STATUS   |          CREATE TIME          | DURATION |          FINISH TIME          |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
| wgs-gpu-sjtlw | XXXXXXXXXXXXXXXX | Succeeded | 2020-01-07 21:38:05 +0800 CST | 12m25s   | 2020-01-07 21:50:30 +0800 CST |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+

ags remote get wgs-gpu-97xfn --show

+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
|   JOB NAME    |  JOB NAMESPACE   |  STATUS   |          CREATE TIME          | DURATION |          FINISH TIME          |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+
| wgs-gpu-sjtlw | XXXXXXXXXXXXXXXX | Succeeded | 2020-01-07 21:38:05 +0800 CST | 12m25s   | 2020-01-07 21:50:30 +0800 CST |
+---------------+------------------+-----------+-------------------------------+----------+-------------------------------+


+-----------------------+---------------------------------+
|      JOB DETAIL       |                                 |
+-----------------------+---------------------------------+
| wgs_reference_file    | hg19                          |
| wgs_service           | g                               |
| wgs_oss_region        | cn-shenzhen                     |
| wgs_fastq_first_name  | MGISAMPLE001                    |
| wgs_fastq_second_name | MGISAMPLE001                    |
| wgs_bucket_name       | my-test-shenzhen                |
| wgs_vcf_file_name     | vcf/MGISEQ_NA12878_hs37d5_6.vcf |
| wgs_bam_file_name     | bam/MGISEQ_NA12878_hs37d5_6.bam |
+-----------------------+---------------------------------+

			

Cancel a running workflow

Usage:

ags remote cancel  <workflow id>

e.g.

ags remote cancel wgs-gpu-zls6r
INFO[0000] Successed to cancel wgs-gpu-zls6r

Remove a finished workflow

You can remove successful and failed workflows, but cannot remove running workflows.

Usage: 

ags remote remove <workflow id>

e.g.

ags remote remove wgs-gpu-zls6r
INFO[0000] Successed to remove wgs-gpu-zls6r