Workflow Description Language (WDL) is a language developed by Broad Institute. WDL specifies data processing workflows with a human-readable and writable syntax. You can use WDL to create bioinformatics workflows in an efficient manner. This topic describes how to use Alibaba Cloud Genomics Service (AGS) to create and run a WDL workflow in a Container Service for Kubernetes (ACK) cluster.

Prerequisites

  • An ACK cluster is created. For more information, see Create an ACK managed cluster.
  • One or more storage services, such as Apsara File Storage NAS (NAS), Object Storage Service (OSS), or file systems that support the Network File System (NFS) protocol, are deployed. These storage services are used to store input and output data. For more information, see Mount a dynamically provisioned NAS volume.

Benefits of running WDL workflows in ACK clusters

  • WDL is compatible with open source Cromwell, which supports WDL scripts. You can directly run WDL workflows in ACK clusters without the need to modify the legacy workflows. For more information about WDL, see WDL.
  • WDL allows you to assign the Guaranteed quality of service (QoS) class to pods to optimize resource allocation for tasks. This prevents load increases and performance degradation caused by resource contention on nodes.
  • WDL is seamlessly integrated with Alibaba Cloud storage services. WDL can directly access OSS and NAS, and allows you to ingest data from multiple data sources.

Differences between WDL workflows and AGS workflows

  • Compared with workflows run by a Cromwell server, cloud-native AGS workflows have the following benefits:
    • Resource control based on parameters such as CPU, Mem min, and Mem max.
    • Scheduling optimization, automatic retry, and dynamic adjustments of resource quotas.
    • Monitoring and logging.
  • WDL has a lower resource usage level than AGS workflows. Therefore, the success rate of delivering multiple samples at the same time is also lower than AGS workflows. To create a large number of recurring workflows, we recommend that you use AGS workflows. For more information, see Create a workflow.
  • To reduce the cost and accelerate CPU-intensive tasks such as mapping, HaplotypeCaller (HC), and Mutect2, we recommend that you call the AGS API. For more information, see Use AGS to process WGS tasks.

Step 1: Deploy an application

Select all components that are required to create a WDL workflow and package them into a Helm chart. You can upload the chart to the Marketplace module in the ACK console. This way, you can install these components in an application that is deployed by using the chart from Marketplace with a few clicks.

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, choose Marketplace > App Catalog.
  3. On the Marketplace page, click the App Catalog tab. Then, find and click ack-arms-pilot.
  4. On the ack-ags-wdl page, click Deploy.
  5. In the Deploy wizard, select a cluster and namespace, and then click Next.
  6. On the Parameters wizard page, configure the parameters and click OK.
    # PVCs to be mounted
    naspvcs:
      - naspvc1
      - naspvc2
    osspvcs:
      - osspvc1
      - osspvc2
      
    # Complete the configurations in the transfer-pvc.yaml and storageclass.yaml files.
    naspvc1:
      # modify it to actual url of NAS/NFS server .
      server : "XXXXXX-fbi71.cn-beijing.nas.aliyuncs.com"
      # absolute path of your data on NAS/NFS, pls modify it to your actual path of data root
      path: "/tarTest" #eg "/tarTest"
      # strorage driver. flexVolume or csi,default is csi, if your kubernetes use flexVolume,chang it to flexVolume
      driver: "csi" # by default is csi, you could change to flexVolume for early version of K8s (<= 1.14.x)
      nasbasepath: "/ags-wdl-nas"
      # mountoptions. if you use local NFS, modify it to your mount options.
      # Complete the configurations in the storageclass.yaml and transfer-pvc.yaml files.
      mountVers: "3" #mount version-svc:
      mountOptions: "nolock,tcp,noresvport"  # mount options, for local NFS server, you could remove noresvport option
    
    naspvc2:
      # modify it to actual url of NAS/NFS server .
      server : "XXXXXXX-fbi71.cn-beijing.nas.aliyuncs.com"
      # absolute path of your data on NAS/NFS, pls modify it to your actual path of data root
      path: "/tarTest/bwatest" #eg "/tarTest"
      # strorage driver. flexVolume or csi,default is csi, if your kubernetes use flexVolume,chang it to flexVolume
      driver: "csi" # by default is csi, you could change to flexVolume for early version of K8s (<= 1.14.x)
      nasbasepath: "/ags-wdl-nas2"
      # mountoptions. if you use local NFS, modify it to your mount options.
      # Complete the configurations in the storageclass.yaml and transfer-pvc.yaml files.
      mountVers: "3" #mount version-svc:
      mountOptions: "nolock,tcp,noresvport"  # mount options, for local NFS server, you could remove noresvport option
    
    osspvc1:
      # modify it to actual bucket name
      bucket: "oss-test-tsk"
      # modify it to actual bucket url
      url: "oss-cn-beijing.aliyuncs.com"
      # mount options. It is not changed by default.
      options: "-o max_stat_cache_size=0 -o allow_other"
      akid: "XXXXXXX"
      aksecret: "XXXXXXX"
      # absolute path of your data on OSS
      path: "/"
      ossbasepath: "/ags-wdl-oss"
    
    osspvc2:
      bucket: "oss-test-tsk"
      url: "oss-cn-beijing.aliyuncs.com"
      options: "-o max_stat_cache_size=0 -o allow_other"
      akid: "XXXXXXX"
      aksecret: "XXXXXXX"
      path: "/input"
      ossbasepath: "/ags-wdl-oss-input"
    
    # Relative root path of input data and output data in your wdl.json, your path should add basepath before the relative path of the mount.
    # e.g. 
    # {
    # "wf.bwa_mem_tool.reference": "/ags-wdl/reference/subset_assembly.fa.gz",# The names of the reference files. 
    # "wf.bwa_mem_tool.reads_fq1": "/ags-wdl/fastq_sample/SRR1976948_1.fastq.gz",#fastq1 filename
    # "wf.bwa_mem_tool.reads_fq2": "/ags-wdl/fastq_sample/SRR1976948_2.fastq.gz",#fastq2 filename
    # "wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxxxxxxx.cn-beijing.nas.aliyuncs.com:/mydata_root/bwatest/output,you should make sure the path exists.
    # "wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task. 
    # }
    
    # workdir, you can choose nasbasepath or ossbasepath as your workdir
    workdir : "/ags-wdl-oss"
    # provisiondir,you can choose nasbasepath or ossbasepath as your provisiondir. task will create a volume dynamic to input/output date.
    provisiondir: "/ags-wdl-oss-input"
    #Scheduling the task to the specified node. e.g. nodeselector: "node-type=wdl"
    nodeselector: "node-type=wdl"
    #config in cromwellserver-svc.yaml
    cromwellserversvc:
      nodeport : "32567" # cromwellserver nodeport, for internal access, you could use LB instead for external access to cromwell server.
    
    #config in config.yaml
    config:
      # The project namespace. Default value: wdl. You do not need to modify the value in most scenarios.
      namespace: "wdl" # namespace
      # The TESK backand domain name. Default value: tesk-api. You do not need to modify the value in most scenarios.
      teskserver: "tesk-api"
      # The domain name of the Cromwell server. Default value: cromwellserver. You do not need to modify the value in most scenarios.
      cromwellserver: "cromwellserver"
      # The port of the Cromwell server. Default value: 8000. You do not need to modify the value in most scenarios.
      cromwellport: "8000"
    • Mount volumes to store application data.
      You can use NAS and OSS volumes to store application data.
      Parameter Description Required Configuration method
      naspvcs The NAS file system that you want to mount to the ACK cluster as a volume. Yes If you want to mount two persistent volume claims (PVCs) named naspvc1 and naspvc2, add the following configuration:
      naspvcs:
        - naspvc1
        - naspvc2
      osspvcs The OSS volumes that you want to mount to your cluster. Yes If you want to mount two PVCs named osspvc1 and osspvc2, add the following configuration:
      osspvcs:
        - osspvc1
        - osspvc2
    • Configure the NAS volume.
      Each NAS volume is mounted to a container path that is specified by the nasbasepath parameter. To set the naspvcs parameter, you must configure each NAS volume that you want to mount.
      Note Use the default settings for other parameters.
      Parameter Description Required Configuration method
      server NAS IP Yes The mount target of the NAS file system that is used to store input and output data. You must change the values of the parameters to the URL and path of your NAS file system.
      path The subdirectory of the NAS file system that you want to mount. Yes
      driver Flexvolume or CSI Yes The volume plug-in that is used by your cluster. The default plug-in is Container Storage Interface (CSI). If your cluster uses FlexVolume, set the value to Flexvolume.
      nasbasepath The application directory to which you want to mount the NAS subdirectory. Yes The command or script directory of the application to which you want to mount the NAS subdirectory. To write data to and read data from the NAS file system, you must replace server:path in the absolute path with nasbasepath. For example, you replace xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl with /ags-wdl. In this case, the NAS path of the input file test.fq.gz is xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/test/test.fq.gz. The application can use the /ags-wdl/test/test.fq.gz path to access the input data. In the following examples, server is set to xxx-xxx.cn-beijing.nas.aliyuncs.com, the absolute path is set to /wdl, and nasbasepath is mapped to /ags-wdl.
      mountOptions Mount options. Yes Set the parameters that are required to mount the NAS file system. If you use a self-managed NAS file system, you must modify this parameter.
      mountVers The version of the NFS protocol. Yes
    • Configure the OSS volume.
      Each OSS volume is mounted to a container path that is specified by the nasbasepath parameter. To set the osspvcs parameter, you must configure each OSS volume that you want to mount.
      Parameter Description Required Configuration method
      bucket oss bucket Yes Change the values of the parameters to the name and URL of your OSS bucket.
      url oss url Yes
      options Mount parameters. Yes Default value: -o max_stat_cache_size=0 -o allow_other. You do not need to modify the value.
      akid AccessKey ID Yes The AccessKey information of your account that is stored in a Secret in the cluster.
      aksecret AccessKey Secret Yes
      path The subdirectory of the OSS bucket that you want to mount. Yes Specify the subdirectory of the OSS bucket that you want to mount.
      ossbasepath The application directory to which you want to mount the OSS subdirectory. Yes The application directory to which you want to mount the OSS subdirectory. To write data to and read data from the OSS bucket, you must replace bucket:path in the absolute path with ossbasepath. For example, bucket is set to shenzhen-test, path is set to /wdl, and ossbasepath is set to /ags-wdl. The OSS path of the input file test.fq.gz is shenzhen-test:/wdl/test/test.fq.gz. When you enter the path of the input file, enter /ags-wdl/test/test.fq.gz. Then, the application can access the input data.
    • Set other parameters.
      Parameter Description Required Configuration method
      workdir WorkingDir Yes Specify nasbasepath or ossbasepath as the directory. All files including scripts and error log files that are generated by tasks are stored in the specified directory.
      provisiondir The directory to which dynamically provisioned persistent volumes (PVs) are mounted. Yes Specify nasbasepath or ossbasepath as the directory. All input and output data generated by tasks is stored in the PVs that are mounted to the specified directory.
      nodeselector The labels that are used to schedule tasks. No You can set this parameter to schedule tasks to nodes with specified labels. If you set node-type=wdl, all tasks are scheduled to nodes with the label node-type=wdl.
      nodeport The port that you want to open on the Cromwell server. Yes Enter the port that you want to open on the Cromwell server. If you want to use the Widdler command-line tool to submit tasks, you must set this parameter. Default value: 32567.
      namespace The namespace that is used by the project. Yes Enter the namespace where the application is deployed. Default value: wdl.
      teskserver The domain name of the tesk service. Yes Enter the domain name of the tesk service. You can use the default value.
      cromwellserver CromwellServer Yes Enter the domain name of the Cromwell server. You can use the default value.
      cromwellport The port that you want to open on the Cromwell server. Yes Enter the port that you want to open on the Cromwell server. You can use the default value.
    Run the following command. If the output shows that the cromwellcli, cromwellserver, and tesk-api components run as expected, the application is deployed.
    kubectl get pods -n wdl

    Expected output:

    NAME                                READY   STATUS      RESTARTS   AGE
    cromwellcli-85cb66b98c-bv4kt        1/1     Running     0          5d5h
    cromwellserver-858cc5cc8-np2mc      1/1     Running     0          5d5h
    tesk-api-5d8676d597-wtmhc           1/1     Running     0          5d5h

Step 2: Submit tasks

You can use AGS or a CLI to submit tasks to a cluster from your on-premises machine. We recommend that you use AGS to submit tasks.

Use AGS to submit tasks

The latest version of the AGS CLI allows you to submit WDL tasks. To use AGS to submit tasks, download the AGS CLI and specify the address of the Cromwell server. For more information about how to download the AGS CLI, see Introduction to AGS CLI.

  1. Create a bwa.wdl file and a bwa.json file.
    • Sample bwa.wdl file:
      task bwa_mem_tool {
        Int threads
        Int min_seed_length
        Int min_std_max_min
        String reference
        String reads_fq1
        String reads_fq2
        String outputdir
        String fastqFolder
        command {
              mkdir -p /bwa/${fastqFolder}
              cd /bwa/${fastqFolder}
              rm -rf SRR1976948*
              wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
              wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
              wget  https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz
              gzip -c ${reference} > subset_assembly.fa
              gunzip -c ${reads_fq1} | head -800000 > SRR1976948.1
              gunzip -c ${reads_fq2} | head -800000 > SRR1976948.2
              bwa index subset_assembly.fa
              bwa aln subset_assembly.fa SRR1976948.1 > ${outputdir}/SRR1976948.1.untrimmed.sai
              bwa aln subset_assembly.fa SRR1976948.2 > ${outputdir}/SRR1976948.2.untrimmed.sai
        }
        output {
          File sam1 = "${outputdir}/SRR1976948.1.untrimmed.sai"
          File sam2 = "${outputdir}/SRR1976948.2.untrimmed.sai"
        }
        runtime {
          docker: "registry.cn-hangzhou.aliyuncs.com/plugins/wes-tools:v3"
          memory: "2GB"
          cpu: 1
        }
      }
      workflow wf {
        call bwa_mem_tool
      }
    • Sample bwa.json file:
      {
      "wf.bwa_mem_tool.reference": "subset_assembly.fa.gz",# The name of the reference file. 
      "wf.bwa_mem_tool.reads_fq1": "SRR1976948_1.fastq.gz",#fastq1 filename
      "wf.bwa_mem_tool.reads_fq2": "SRR1976948_2.fastq.gz",#fastq2 filename
      "wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/bwatest/output,you should make sure the path exists.
      "wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task.
      }
  2. Redirect requests that are destined for port 32567 to the Cromwell server.
    kubectl port-forward svc/cromwellserver 32567:32567
  3. Run the following command to configure the endpoint of the Cromwell server.
    ags config init

    Expected output:

    Please input your AccessKeyID
    xxxxx
    Please input your AccessKeySecret
    xxxxx
    Please input your cromwellserver url
    xxx-xxx.cn-beijing.nas.aliyuncs.com:32567
  4. Run the following command to submit a WDL task:
    ags wdl run resource/bwa.wdl resource/bwa.json --watch # The watch parameter can be used to specify that the command is synchronous. The next task does not start until the current one succeeds or fails. 

    Expected output:

    INFO[0000] bd747360-f82c-4cd2-94e0-b549d775f1c7 Submitted
  5. Optional:You can query or delete WDL tasks from your on-premises machine.
    • Query WDL tasks.
      • Run the following explain command to query a WDL task:
        ags wdl explain bd747360-f82c-4cd2-94e0-b549d775f1c7

        Expected output:

        INFO[0000] bd747360-f82c-4cd2-94e0-b549d775f1c7 Running
      • Run the following query command to query a WDL task:
        ags wdl query bd747360-f82c-4cd2-94e0-b549d775f1c7

        Expected output:

        INFO[0000] bd747360-f82c-4cd2-94e0-b549d775f1c7 {"calls":{"end":"0001-01-01T00:00:00.000Z","executionStatus":null,"inputs":null,"start":"0001-01-01T00:00:00.000Z"},"end":"0001-01-01T00:00:00.000Z","id":"b3aa1563-6278-4b2e-b525-a2ccddcbb785","inputs":{"wf_WGS.Reads":"/ags-wdl-nas/c.tar.gz"},"outputs":{},"start":"2020-10-10T09:34:56.022Z","status":"Running","submission":"2020-10-10T09:34:49.989Z"}
    • Run the command to delete a WDL task:
      ags wdl abort bd747360-f82c-4cd2-94e0-b549d775f1c7

      Expected output:

      INFO[0000] bd747360-f82c-4cd2-94e0-b549d775f1c7 Aborting

Use the CLI to submit tasks

To use the CLI to submit tasks, you must create a WDL file and a JSON file. Then, use an image provided by AGS to submit tasks. Specify the endpoint of the Cromwell server in the following format: cluster IP:node port.

  1. Create a bwa.wdl file and a bwa.json file.
    • Sample bwa.wdl file:
      task bwa_mem_tool {
        Int threads
        Int min_seed_length
        Int min_std_max_min
        String reference
        String reads_fq1
        String reads_fq2
        String outputdir
        String fastqFolder
        command {
              mkdir -p /bwa/${fastqFolder}
              cd /bwa/${fastqFolder}
              rm -rf SRR1976948*
              wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
              wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
              wget  https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz
              gzip -c ${reference} > subset_assembly.fa
              gunzip -c ${reads_fq1} | head -800000 > SRR1976948.1
              gunzip -c ${reads_fq2} | head -800000 > SRR1976948.2
              bwa index subset_assembly.fa
              bwa aln subset_assembly.fa SRR1976948.1 > ${outputdir}/SRR1976948.1.untrimmed.sai
              bwa aln subset_assembly.fa SRR1976948.2 > ${outputdir}/SRR1976948.2.untrimmed.sai
        }
        output {
          File sam1 = "${outputdir}/SRR1976948.1.untrimmed.sai"
          File sam2 = "${outputdir}/SRR1976948.2.untrimmed.sai"
        }
        runtime {
          docker: "registry.cn-hangzhou.aliyuncs.com/plugins/wes-tools:v3"
          memory: "2GB"
          cpu: 1
        }
      }
      workflow wf {
        call bwa_mem_tool
      }
    • Sample bwa.json file:
      {
      "wf.bwa_mem_tool.reference": "subset_assembly.fa.gz",# The name of the reference file. 
      "wf.bwa_mem_tool.reads_fq1": "SRR1976948_1.fastq.gz",#fastq1 filename
      "wf.bwa_mem_tool.reads_fq2": "SRR1976948_2.fastq.gz",#fastq2 filename
      "wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/bwatest/output,you should make sure the path exists.
      "wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task. 
      }
  2. Run the following command to submit a WDL task:
    docker run -e CROMWELL_SERVER=192.16*.*.** -e CROMWELL_PORT=30384 registry.cn-beijing.aliyuncs.com/tes-wes/cromwellcli:v1 run resources/bwa.wdl resources/bwa.json

    Expected output:

    -------------Cromwell Links-------------
    http://192.16*.*.**:30384/api/workflows/v1/5d7ffc57-6883-4658-adab-3f508826322a/metadata
    http://192.16*.*.**:30384/api/workflows/v1/5d7ffc57-6883-4658-adab-3f508826322a/timing
    {
        "status": "Submitted", 
        "id": "5d7ffc57-6883-4658-adab-3f508826322a"
    }