Genomics computing is complex and data-intensive and may pose challenges to scientists and life science data analysts in aggregating and analyzing an explosively increasing amount of data with efficiency and accuracy. An effective way to tackle the challenges is to automate workflow orchestration. Argo Workflows is an outstanding workflow engine that features containerization, flexibility, and usability. This topic describes how to use Argo Workflows to orchestrate genomics computing workflows.
Background Information
Genomics computing workflows
Genomics computing workflows are used for data analytics in genomics research. A genomic computing workflow consists of a collection of interrelated computing tasks and data processing steps that are executed in a specific order. Genomics computing workflows involve complex steps such as data processing, sequence alignment, mutation detection, gene expression analysis, and phylogenetic tree building.
Argo Workflows for genomics computing workflow orchestration
Argo Workflows is an open source Kubernetes-native workflow engine that facilitates flexible and efficient orchestration of workflows in containerized environments. Argo Workflows is particularly suitable for genomics computing due to the following advantages:
Containerization and environment consistency: Genomic analysis needs to use various software tools and dependency libraries. Argo Workflows can encapsulate all analysis steps in Docker containers, which can be deployed across platforms and environments. This ensures cross-platform consistency and reusability of genomic analysis tasks.
Flexible orchestration: Most genomics workflows involve multiple steps, conditional branches, and parallel jobs. Argo Workflows supports complex execution logic and conditions, allowing you to customize workflows in a simplified and explicit manner.
However, Argo Workflows also faces the following challenges:
Large-scale O&M: When genomics computing workflows consist of a large number of tasks, optimization and maintenance policies of the cluster cannot be efficiently implemented, especially when users have limited experience in cluster maintenance.
Complex workflow orchestration: Scientific experiments involve a large parameter space and a large number of steps. A scientific experiment may need to run tens of thousands of jobs. In this case, open source workflow engines are incapable of orchestrating workflows for large-scale scientific experiments.
Cost optimization and resource elasticity: Genomic analysis consumes a large amount of compute resources. Users want to enable intelligent resource scheduling for workloads to maximize resource utilization. In addition, users want to automate resource scaling based on the resource demand of workloads. Open source solutions may fail to meet the preceding requirements.
To tackle the preceding challenges in large-scale O&M, complex workflow orchestration, cost optimization, and resource elasticity, Distributed Cloud Container Platform for Kubernetes (ACK One) provides Kubernetes clusters for distributed Argo workflows.
Kubernetes clusters for distributed Argo workflows
Kubernetes clusters for distributed Argo workflows (workflow clusters) are deployed on top of a serverless architecture. This type of cluster runs Argo workflows on elastic container instances and optimizes cluster parameters to schedule large-scale workflows with efficiency, elasticity, and cost-effectiveness. Workflow clusters can run workflows concurrrently, loop workflows, or retry workflows. These workflow execution policies are suitable for genomics computing and complex workflow orchestration.
Argo Workflows is suitable for genomics computing and data-intensive scientific tasks due to its advantages in containerization, flexibility, and usability. Argo Workflows can greatly improve the automation level, resource utilization, and data analysis efficiency of workflows. The Alibaba Cloud ACK One team is one of the first teams to apply Argo Workflows to large-scale workflow orchestration. The ACK One team has rich experience in best practices for Argo Workflows in scenarios including genomics computing, autonomous driving, and financial simulation. You can join the DingTalk group (35688562) to contact the ACK One team.
Use Argo Workflows to orchestrate genomics computing workflows
This section provides an example on how to use Argo Workflows to configure and run a sequence alignment workflow with the Burrows-Wheeler Alignment algorithm (BWA).
Mount an Object Storage Service (OSS) volume to the workflow. This way, the workflow can access files in the OSS volume in the same way the workflow accesses local files. For more information, see Use volumes.
Create a workflow based on the following YAML content. For more information, see Create a workflow.
The workflow consists of three steps:
bwaprepare: the data preparation step. This step downloads and decompresses FASTQ files and a reference file, and then creates an index of the reference genome.bwamap: the sequence alignment step. This step aligns the sequence data to the reference genome. This step processes multiple files in parallel.bwaindex: the result generation step. This step generates a binary alignment and map (BAM) file to record the alignments, sorts the alignments, and then creates an index of the BAM file. You can also view the alignments.
After the workflow starts running, you can go to the Workflow Console (Argo) to view the directed acyclic graph (DAG) process and the result.

You can also log on to the OSS console to check whether a file that stores the alignments is generated.
References
For more information about Kubernetes clusters for distributed Argo workflows, see Overview of Kubernetes clusters for distributed Argo workflows.
For more information about Argo Workflows, see Open source Argo Workflows.