All Products
Search
Document Center

Container Service for Kubernetes:Orchestrate gene computing workflows with Argo Workflows

Last Updated:Apr 03, 2026

Argo Workflows is an open source, Kubernetes-native workflow engine for automating multi-step gene computing pipelines. This guide shows you how to run a complete Burrows-Wheeler Aligner (BWA) sequence alignment pipeline on an Alibaba Cloud ACK One distributed workflow cluster.

Background

Gene computing workflows

A gene computing workflow is a sequence of computational tasks designed to achieve a specific analysis goal in genomics research. Typical workflows include data preprocessing, sequence alignment, variant calling, gene expression analysis, and phylogenetic tree construction—each step producing output that feeds the next.

Why Argo Workflows for gene computing

Argo Workflows handles the complexity of multi-step genomic pipelines in two ways:

  • Containerized execution: Each analysis step runs in its own Docker container, ensuring consistent results across environments and eliminating tool dependency conflicts.

  • Flexible orchestration: Workflows can branch conditionally, run steps in parallel, and loop over large datasets—all defined in a single YAML file.

Challenges with open-source Argo Workflows

Running Argo Workflows at scale in genomics introduces three operational challenges:

  • Cluster operations: Optimizing and maintaining a large Kubernetes cluster requires deep infrastructure expertise that most research teams lack.

  • Scale: Scientific experiments often involve vast parameter spaces and thousands of concurrent jobs, pushing beyond what standard open-source configurations can handle.

  • Resource efficiency: Gene data analysis is resource-intensive. Intelligently scheduling workloads and scaling compute on demand requires additional tooling on top of open-source Argo.

The Alibaba Cloud ACK One team built the distributed workflow Argo-based cluster to address these challenges directly.

Distributed workflow Argo-based cluster

The distributed workflow Argo-based cluster is built on open-source Argo Workflows with a serverless model: workflows run on Elastic Container Instance (ECI) rather than long-lived nodes. This design provides:

  • Elastic scaling: Kubernetes cluster parameters are optimized for large-scale job scheduling. Compute capacity scales up and down automatically with your workload.

  • Cost efficiency: Workflows use preemptible ECI instances to reduce compute costs.

  • Rich execution strategies: Concurrency, loops, retries, and DAG dependencies are all supported—covering the full range of gene computing orchestration needs.

image
Argo Workflows is widely used in gene computing, autonomous driving, and financial simulation. To discuss best practices with the ACK One team, join their DingTalk group (ID: 35688562).

Run a BWA sequence alignment workflow

This example implements a classic BWA sequence alignment pipeline with three stages. BWA sequence alignment maps raw sequencing reads back to a reference genome—the first step in identifying genetic variants.

Stage What it does
bwaprepare Downloads and decompresses the FASTQ sequencing files and reference genome, then builds the BWA index
bwamap Aligns the sequencing reads to the reference genome; runs in parallel for each FASTQ file
bwaindex Generates paired-end alignments, then creates, sorts, and indexes the final BAM file

The three stages run as a Directed Acyclic Graph (DAG): bwaprepare must complete before bwamap starts, and bwamap must complete before bwaindex starts. Within bwamap, the two FASTQ files are processed in parallel.

Prerequisites

Before you begin, ensure that you have:

Create and run the workflow

Create the workflow using the following YAML. For details on submitting a workflow, see Create a workflow.

To use this workflow with your own data, update the four parameters at the top of the arguments block:

Parameter Description Example value
fastqFolder Directory on the mounted OSS volume where files are stored /gene
reference URL of your reference genome (.fa.gz) https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
fastq1 URL of the first FASTQ sequencing file https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
fastq2 URL of the second FASTQ sequencing file https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz

The rest of the YAML defines the pipeline logic and stays the same for any dataset.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: bwa-oss-
spec:
  entrypoint: bwa-oss
  arguments:
    parameters:
    - name: fastqFolder     # Directory on the OSS volume where files are saved.
      value: /gene
    - name: reference       # Reference genome file URL.
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/subset_assembly.fa.gz
    - name: fastq1          # First raw sequencing data file URL.
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_1.fastq.gz
    - name: fastq2          # Second raw sequencing data file URL.
      value: https://ags-public.oss-cn-beijing.aliyuncs.com/alignment/SRR1976948_2.fastq.gz

  volumes:                  # Mount the OSS volume so all stages share the same storage.
  - name: ossdir
    persistentVolumeClaim:
      claimName: pvc-oss

  templates:
  - name: bwaprepare        # Stage 1: Download files and build the reference genome index.
    container:
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      command: [sh,-c]
      args:
      - mkdir -p /bwa{{workflow.parameters.fastqFolder}}; cd /bwa{{workflow.parameters.fastqFolder}}; rm -rf SRR1976948*;
        wget {{workflow.parameters.reference}};
        wget {{workflow.parameters.fastq1}};
        wget {{workflow.parameters.fastq2}};
        gzip -d subset_assembly.fa.gz;
        gunzip -c SRR1976948_1.fastq.gz | head -800000 > SRR1976948.1;
        gunzip -c SRR1976948_2.fastq.gz | head -800000 > SRR1976948.2;
        bwa index subset_assembly.fa;
      volumeMounts:
      - name: ossdir
        mountPath: /bwa
    retryStrategy:          # Retry up to 3 times to handle transient network failures during file downloads.
      limit: 3

  - name: bwamap            # Stage 2: Align one FASTQ file to the reference genome.
    inputs:
      parameters:
      - name: object
    container:
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      command:
      - sh
      - -c
      args:
      - cd /bwa{{workflow.parameters.fastqFolder}};
        bwa aln subset_assembly.fa {{inputs.parameters.object}} > {{inputs.parameters.object}}.untrimmed.sai;
      volumeMounts:
      - name: ossdir
        mountPath: /bwa
    retryStrategy:
      limit: 3

  - name: bwaindex          # Stage 3: Generate paired-end alignments and produce the final sorted BAM file.
    container:
      command:
      - sh
      - -c
      args:
      - cd /bwa{{workflow.parameters.fastqFolder}};
        bwa sampe subset_assembly.fa SRR1976948.1.untrimmed.sai SRR1976948.2.untrimmed.sai SRR1976948.1 SRR1976948.2 > SRR1976948.untrimmed.sam;
        samtools import subset_assembly.fa SRR1976948.untrimmed.sam SRR1976948.untrimmed.sam.bam;
        samtools sort SRR1976948.untrimmed.sam.bam -o SRR1976948.untrimmed.sam.bam.sorted.bam;
        samtools index SRR1976948.untrimmed.sam.bam.sorted.bam;
        samtools tview SRR1976948.untrimmed.sam.bam.sorted.bam subset_assembly.fa -p k99_13588:1000 -d T;
      image: registry.cn-beijing.aliyuncs.com/geno/alltools:v0.2
      imagePullPolicy: Always
      volumeMounts:
      - mountPath: /bwa/
        name: ossdir
    retryStrategy:
      limit: 3

  - name: bwa-oss           # DAG orchestrator: defines stage dependencies and parallel execution.
    dag:
      tasks:
      - name: bwaprepare
        template: bwaprepare

      - name: bwamap
        template: bwamap
        dependencies: [bwaprepare]          # Runs after bwaprepare completes.
        arguments:
          parameters:
          - name: object
            value: "{{item}}"
        withItems: ["SRR1976948.1","SRR1976948.2"]   # Processes both files in parallel.

      - name: bwaindex
        template: bwaindex
        dependencies: [bwamap]              # Runs after both bwamap tasks complete.

Verify the results

After the workflow completes, check the results in two places:

  1. Open the Workflow Console (Argo) to view the DAG and confirm all stages succeeded.

    Workflow Console showing the DAG execution

  2. Log in to the OSS console and find the alignment result files in your OSS folder. The final output is SRR1976948.untrimmed.sam.bam.sorted.bam and its index.

    image

What's next