Workflow Description Language (WDL) is a language developed by Broad Institute. WDL
specifies data processing workflows with a human-readable and writable syntax. You
can use WDL to create bioinformatics workflows in an efficient manner. This topic
describes how to use Alibaba Cloud Genomics Service (AGS) to create and run a WDL
workflow in a Container Service for Kubernetes (ACK) cluster.
Prerequisites
- An ACK cluster is created. For more information, see Create an ACK managed cluster.
- One or more storage services, such as Apsara File Storage NAS (NAS), Object Storage
Service (OSS), or file systems that support the Network File System (NFS) protocol,
are deployed. These storage services are used to store input and output data. For
more information, see Mount a dynamically provisioned NAS volume.
Benefits of running WDL workflows in ACK clusters
- WDL is compatible with open source Cromwell, which supports WDL scripts. You can directly
run WDL workflows in ACK clusters without the need to modify the legacy workflows.
For more information about WDL, see WDL.
- WDL allows you to assign the Guaranteed quality of service (QoS) class to pods to
optimize resource allocation for tasks. This prevents load increases and performance
degradation caused by resource contention on nodes.
- WDL is seamlessly integrated with Alibaba Cloud storage services. WDL can directly
access OSS and NAS, and allows you to ingest data from multiple data sources.
Differences between WDL workflows and AGS workflows
- Compared with workflows run by a Cromwell server, cloud-native AGS workflows have
the following benefits:
- Resource control based on parameters such as CPU, Mem min, and Mem max.
- Scheduling optimization, automatic retry, and dynamic adjustments of resource quotas.
- Monitoring and logging.
- WDL has a lower resource usage level than AGS workflows. Therefore, the success rate
of delivering multiple samples at the same time is also lower than AGS workflows.
To create a large number of recurring workflows, we recommend that you use AGS workflows.
For more information, see Create a workflow.
- To reduce the cost and accelerate CPU-intensive tasks such as mapping, HaplotypeCaller
(HC), and Mutect2, we recommend that you call the AGS API. For more information, see
Use AGS to process WGS tasks.
Step 1: Deploy an application
Select all components that are required to create a WDL workflow and package them
into a Helm chart. You can upload the chart to the Marketplace module in the ACK console.
This way, you can install these components in an application that is deployed by using
the chart from Marketplace with a few clicks.
- Log on to the ACK console.
- In the left-side navigation pane of the ACK console, choose .
- On the Marketplace page, click the App Catalog tab. Then, find and click ack-arms-pilot.
- On the ack-ags-wdl page, click Deploy.
- In the Deploy wizard, select a cluster and namespace, and then click Next.
- On the Parameters wizard page, configure the parameters and click OK.
# PVCs to be mounted
naspvcs:
- naspvc1
- naspvc2
osspvcs:
- osspvc1
- osspvc2
# Complete the configurations in the transfer-pvc.yaml and storageclass.yaml files.
naspvc1:
# modify it to actual url of NAS/NFS server .
server : "XXXXXX-fbi71.cn-beijing.nas.aliyuncs.com"
# absolute path of your data on NAS/NFS, pls modify it to your actual path of data root
path: "/tarTest" #eg "/tarTest"
# strorage driver. flexVolume or csi,default is csi, if your kubernetes use flexVolume,chang it to flexVolume
driver: "csi" # by default is csi, you could change to flexVolume for early version of K8s (<= 1.14.x)
nasbasepath: "/ags-wdl-nas"
# mountoptions. if you use local NFS, modify it to your mount options.
# Complete the configurations in the storageclass.yaml and transfer-pvc.yaml files.
mountVers: "3" #mount version-svc:
mountOptions: "nolock,tcp,noresvport" # mount options, for local NFS server, you could remove noresvport option
naspvc2:
# modify it to actual url of NAS/NFS server .
server : "XXXXXXX-fbi71.cn-beijing.nas.aliyuncs.com"
# absolute path of your data on NAS/NFS, pls modify it to your actual path of data root
path: "/tarTest/bwatest" #eg "/tarTest"
# strorage driver. flexVolume or csi,default is csi, if your kubernetes use flexVolume,chang it to flexVolume
driver: "csi" # by default is csi, you could change to flexVolume for early version of K8s (<= 1.14.x)
nasbasepath: "/ags-wdl-nas2"
# mountoptions. if you use local NFS, modify it to your mount options.
# Complete the configurations in the storageclass.yaml and transfer-pvc.yaml files.
mountVers: "3" #mount version-svc:
mountOptions: "nolock,tcp,noresvport" # mount options, for local NFS server, you could remove noresvport option
osspvc1:
# modify it to actual bucket name
bucket: "oss-test-tsk"
# modify it to actual bucket url
url: "oss-cn-beijing.aliyuncs.com"
# mount options. It is not changed by default.
options: "-o max_stat_cache_size=0 -o allow_other"
akid: "XXXXXXX"
aksecret: "XXXXXXX"
# absolute path of your data on OSS
path: "/"
ossbasepath: "/ags-wdl-oss"
osspvc2:
bucket: "oss-test-tsk"
url: "oss-cn-beijing.aliyuncs.com"
options: "-o max_stat_cache_size=0 -o allow_other"
akid: "XXXXXXX"
aksecret: "XXXXXXX"
path: "/input"
ossbasepath: "/ags-wdl-oss-input"
# Relative root path of input data and output data in your wdl.json, your path should add basepath before the relative path of the mount.
# e.g.
# {
# "wf.bwa_mem_tool.reference": "/ags-wdl/reference/subset_assembly.fa.gz",# The names of the reference files.
# "wf.bwa_mem_tool.reads_fq1": "/ags-wdl/fastq_sample/SRR1976948_1.fastq.gz",#fastq1 filename
# "wf.bwa_mem_tool.reads_fq2": "/ags-wdl/fastq_sample/SRR1976948_2.fastq.gz",#fastq2 filename
# "wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxxxxxxx.cn-beijing.nas.aliyuncs.com:/mydata_root/bwatest/output,you should make sure the path exists.
# "wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task.
# }
# workdir, you can choose nasbasepath or ossbasepath as your workdir
workdir : "/ags-wdl-oss"
# provisiondir,you can choose nasbasepath or ossbasepath as your provisiondir. task will create a volume dynamic to input/output date.
provisiondir: "/ags-wdl-oss-input"
#Scheduling the task to the specified node. e.g. nodeselector: "node-type=wdl"
nodeselector: "node-type=wdl"
#config in cromwellserver-svc.yaml
cromwellserversvc:
nodeport : "32567" # cromwellserver nodeport, for internal access, you could use LB instead for external access to cromwell server.
#config in config.yaml
config:
# The project namespace. Default value: wdl. You do not need to modify the value in most scenarios.
namespace: "wdl" # namespace
# The TESK backand domain name. Default value: tesk-api. You do not need to modify the value in most scenarios.
teskserver: "tesk-api"
# The domain name of the Cromwell server. Default value: cromwellserver. You do not need to modify the value in most scenarios.
cromwellserver: "cromwellserver"
# The port of the Cromwell server. Default value: 8000. You do not need to modify the value in most scenarios.
cromwellport: "8000"
- Mount volumes to store application data.
You can use NAS and OSS volumes to store application data.
Parameter |
Description |
Required |
Configuration method |
naspvcs |
The NAS file system that you want to mount to the ACK cluster as a volume. |
Yes |
If you want to mount two persistent volume claims (PVCs) named naspvc1 and naspvc2 , add the following configuration: naspvcs:
- naspvc1
- naspvc2
|
osspvcs |
The OSS volumes that you want to mount to your cluster. |
Yes |
If you want to mount two PVCs named osspvc1 and osspvc2 , add the following configuration: osspvcs:
- osspvc1
- osspvc2
|
- Configure the NAS volume.
Each NAS volume is mounted to a container path that is specified by the
nasbasepath
parameter. To set the
naspvcs
parameter, you must configure each NAS volume that you want to mount.
Note Use the default settings for other parameters.
Parameter |
Description |
Required |
Configuration method |
server |
NAS IP |
Yes |
The mount target of the NAS file system that is used to store input and output data.
You must change the values of the parameters to the URL and path of your NAS file
system.
|
path |
The subdirectory of the NAS file system that you want to mount. |
Yes |
driver |
Flexvolume or CSI |
Yes |
The volume plug-in that is used by your cluster. The default plug-in is Container
Storage Interface (CSI). If your cluster uses FlexVolume, set the value to Flexvolume.
|
nasbasepath |
The application directory to which you want to mount the NAS subdirectory. |
Yes |
The command or script directory of the application to which you want to mount the
NAS subdirectory. To write data to and read data from the NAS file system, you must
replace server:path in the absolute path with nasbasepath . For example, you replace xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl with /ags-wdl. In this case, the NAS path of the input file test.fq.gz is xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/test/test.fq.gz. The application can use the /ags-wdl/test/test.fq.gz path to access the input data. In the following examples, server is set to xxx-xxx.cn-beijing.nas.aliyuncs.com , the absolute path is set to /wdl, and nasbasepath is mapped to /ags-wdl.
|
mountOptions |
Mount options. |
Yes |
Set the parameters that are required to mount the NAS file system. If you use a self-managed
NAS file system, you must modify this parameter.
|
mountVers |
The version of the NFS protocol. |
Yes |
- Configure the OSS volume.
Each OSS volume is mounted to a container path that is specified by the
nasbasepath
parameter. To set the
osspvcs
parameter, you must configure each OSS volume that you want to mount.
Parameter |
Description |
Required |
Configuration method |
bucket |
oss bucket |
Yes |
Change the values of the parameters to the name and URL of your OSS bucket.
|
url |
oss url |
Yes |
options |
Mount parameters. |
Yes |
Default value: -o max_stat_cache_size=0 -o allow_other . You do not need to modify the value.
|
akid |
AccessKey ID |
Yes |
The AccessKey information of your account that is stored in a Secret in the cluster.
|
aksecret |
AccessKey Secret |
Yes |
path |
The subdirectory of the OSS bucket that you want to mount. |
Yes |
Specify the subdirectory of the OSS bucket that you want to mount. |
ossbasepath |
The application directory to which you want to mount the OSS subdirectory. |
Yes |
The application directory to which you want to mount the OSS subdirectory. To write
data to and read data from the OSS bucket, you must replace bucket:path in the absolute path with ossbasepath . For example, bucket is set to shenzhen-test, path is set to /wdl, and ossbasepath is set to /ags-wdl. The OSS path of the input file test.fq.gz is shenzhen-test:/wdl/test/test.fq.gz. When you enter the path of the input file, enter /ags-wdl/test/test.fq.gz. Then, the application can access the input data.
|
- Set other parameters.
Parameter |
Description |
Required |
Configuration method |
workdir |
WorkingDir |
Yes |
Specify nasbasepath or ossbasepath as the directory. All files including scripts and error log files that are generated
by tasks are stored in the specified directory.
|
provisiondir |
The directory to which dynamically provisioned persistent volumes (PVs) are mounted. |
Yes |
Specify nasbasepath or ossbasepath as the directory. All input and output data generated by tasks is stored in the PVs
that are mounted to the specified directory.
|
nodeselector |
The labels that are used to schedule tasks.
|
No |
You can set this parameter to schedule tasks to nodes with specified labels. If you set node-type=wdl , all tasks are scheduled to nodes with the label node-type=wdl .
|
nodeport |
The port that you want to open on the Cromwell server. |
Yes |
Enter the port that you want to open on the Cromwell server. If you want to use the
Widdler command-line tool to submit tasks, you must set this parameter. Default value:
32567.
|
namespace |
The namespace that is used by the project. |
Yes |
Enter the namespace where the application is deployed. Default value: wdl.
|
teskserver |
The domain name of the tesk service.
|
Yes |
Enter the domain name of the tesk service. You can use the default value.
|
cromwellserver |
CromwellServer |
Yes |
Enter the domain name of the Cromwell server. You can use the default value. |
cromwellport |
The port that you want to open on the Cromwell server. |
Yes |
Enter the port that you want to open on the Cromwell server. You can use the default
value.
|
Run the following command. If the output shows that the
cromwellcli,
cromwellserver, and
tesk-api components run as expected, the application is deployed.
kubectl get pods -n wdl
Expected output:
NAME READY STATUS RESTARTS AGE
cromwellcli-85cb66b98c-bv4kt 1/1 Running 0 5d5h
cromwellserver-858cc5cc8-np2mc 1/1 Running 0 5d5h
tesk-api-5d8676d597-wtmhc 1/1 Running 0 5d5h
Step 2: Submit tasks
You can use AGS or a CLI to submit tasks to a cluster from your on-premises machine.
We recommend that you use AGS to submit tasks.
Use AGS to submit tasks
The latest version of the AGS CLI allows you to submit WDL tasks. To use AGS to submit
tasks, download the AGS CLI and specify the address of the Cromwell server. For more
information about how to download the AGS CLI, see Introduction to AGS CLI.
- Create a bwa.wdl file and a bwa.json file.
- Sample bwa.wdl file:
task bwa_mem_tool {
Int threads
Int min_seed_length
Int min_std_max_min
String reference
String reads_fq1
String reads_fq2
String outputdir
String fastqFolder
command {
mkdir -p /bwa/${fastqFolder}
cd /bwa/${fastqFolder}
rm -rf SRR1976948*
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz
gzip -c ${reference} > subset_assembly.fa
gunzip -c ${reads_fq1} | head -800000 > SRR1976948.1
gunzip -c ${reads_fq2} | head -800000 > SRR1976948.2
bwa index subset_assembly.fa
bwa aln subset_assembly.fa SRR1976948.1 > ${outputdir}/SRR1976948.1.untrimmed.sai
bwa aln subset_assembly.fa SRR1976948.2 > ${outputdir}/SRR1976948.2.untrimmed.sai
}
output {
File sam1 = "${outputdir}/SRR1976948.1.untrimmed.sai"
File sam2 = "${outputdir}/SRR1976948.2.untrimmed.sai"
}
runtime {
docker: "registry.cn-hangzhou.aliyuncs.com/plugins/wes-tools:v3"
memory: "2GB"
cpu: 1
}
}
workflow wf {
call bwa_mem_tool
}
- Sample bwa.json file:
{
"wf.bwa_mem_tool.reference": "subset_assembly.fa.gz",# The name of the reference file.
"wf.bwa_mem_tool.reads_fq1": "SRR1976948_1.fastq.gz",#fastq1 filename
"wf.bwa_mem_tool.reads_fq2": "SRR1976948_2.fastq.gz",#fastq2 filename
"wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/bwatest/output,you should make sure the path exists.
"wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task.
}
- Redirect requests that are destined for port 32567 to the Cromwell server.
kubectl port-forward svc/cromwellserver 32567:32567
- Run the following command to configure the endpoint of the Cromwell server.
ags config init
Expected output:
Please input your AccessKeyID
xxxxx
Please input your AccessKeySecret
xxxxx
Please input your cromwellserver url
xxx-xxx.cn-beijing.nas.aliyuncs.com:32567
- Run the following command to submit a WDL task:
ags wdl run resource/bwa.wdl resource/bwa.json --watch # The watch parameter can be used to specify that the command is synchronous. The next task does not start until the current one succeeds or fails.
Expected output:
INFO[0000] bd747360-f82c-4cd2-94e0-b549d775f1c7 Submitted
- Optional:You can query or delete WDL tasks from your on-premises machine.
Use the CLI to submit tasks
To use the CLI to submit tasks, you must create a WDL file and a JSON file. Then,
use an image provided by AGS to submit tasks. Specify the endpoint of the Cromwell
server in the following format: cluster IP:node port.
- Create a bwa.wdl file and a bwa.json file.
- Sample bwa.wdl file:
task bwa_mem_tool {
Int threads
Int min_seed_length
Int min_std_max_min
String reference
String reads_fq1
String reads_fq2
String outputdir
String fastqFolder
command {
mkdir -p /bwa/${fastqFolder}
cd /bwa/${fastqFolder}
rm -rf SRR1976948*
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
wget https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz
gzip -c ${reference} > subset_assembly.fa
gunzip -c ${reads_fq1} | head -800000 > SRR1976948.1
gunzip -c ${reads_fq2} | head -800000 > SRR1976948.2
bwa index subset_assembly.fa
bwa aln subset_assembly.fa SRR1976948.1 > ${outputdir}/SRR1976948.1.untrimmed.sai
bwa aln subset_assembly.fa SRR1976948.2 > ${outputdir}/SRR1976948.2.untrimmed.sai
}
output {
File sam1 = "${outputdir}/SRR1976948.1.untrimmed.sai"
File sam2 = "${outputdir}/SRR1976948.2.untrimmed.sai"
}
runtime {
docker: "registry.cn-hangzhou.aliyuncs.com/plugins/wes-tools:v3"
memory: "2GB"
cpu: 1
}
}
workflow wf {
call bwa_mem_tool
}
- Sample bwa.json file:
{
"wf.bwa_mem_tool.reference": "subset_assembly.fa.gz",# The name of the reference file.
"wf.bwa_mem_tool.reads_fq1": "SRR1976948_1.fastq.gz",#fastq1 filename
"wf.bwa_mem_tool.reads_fq2": "SRR1976948_2.fastq.gz",#fastq2 filename
"wf.bwa_mem_tool.outputdir": "/ags-wdl/bwatest/output", #output path,the result will output in xxx-xxx.cn-beijing.nas.aliyuncs.com:/wdl/bwatest/output,you should make sure the path exists.
"wf.bwa_mem_tool.fastqFolder": "fastqfolder" # The working directory of the task.
}
- Run the following command to submit a WDL task:
docker run -e CROMWELL_SERVER=192.16*.*.** -e CROMWELL_PORT=30384 registry.cn-beijing.aliyuncs.com/tes-wes/cromwellcli:v1 run resources/bwa.wdl resources/bwa.json
Expected output:
-------------Cromwell Links-------------
http://192.16*.*.**:30384/api/workflows/v1/5d7ffc57-6883-4658-adab-3f508826322a/metadata
http://192.16*.*.**:30384/api/workflows/v1/5d7ffc57-6883-4658-adab-3f508826322a/timing
{
"status": "Submitted",
"id": "5d7ffc57-6883-4658-adab-3f508826322a"
}