All Products
Search
Document Center

Container Service for Kubernetes:Migrate batch jobs to a Kubernetes cluster for distributed Argo workflows

Last Updated:Jan 16, 2025

Batch jobs are commonly used in the data processing, simulation, and scientific computing sectors. Batch jobs usually require large amounts of computing resources. Kubernetes clusters for distributed Argo workflows are developed based on the open source Argo Workflows project and comply with the standards of open source workflows. In workflow clusters, you can easily orchestrate workflows, run each step in containers, and complete compute-intensive jobs such as large-scale machine learning and simulation within a short period of time. You can also quickly run Continuous Integration and Continuous Delivery (CI/CD) pipeline jobs. Migrating scheduled jobs and batch jobs to workflow clusters helps you efficiently reduce O&M complexity and costs.

Background

Workflow clusters are developed based on Kubernetes clusters to host open source Argo Workflows and use a serverless workflow engine.

image

Terms used in batch computing

image

Job

After you submit task units, such as Shell scripts, Linux executable files, or Docker container images, to the batch computing system, the batch computing system will allocate compute resources and then start a job.

Array job

An array job is a collection of similar jobs that are batch submitted and run. All jobs in an array job use the same job definition. You can use indexes to identify different jobs. The datasets or jobs processed by each job instance may differ.

Job definition

A job definition specifies how a job runs. You need to create a job definition before you can run a job.

A job definition usually consists of the image used to run the job, commands and parameters, required amounts of CPU and memory resources, environment variables, and disk space.

Job queue

Jobs that you submit to the batch computing system are delivered to a job queue. A job leaves the queue after the job is scheduled. You can specify the priorities of jobs in a job queue and associate a job queue with a compute environment.

Compute environment

A compute environment consists of compute resources that are used to run jobs. You need to specify the vSwitch model, maximum number and minimum number of vCPUs, and unit price of preemptible instances for each compute environment.

Terms used in distributed Argo workflows

image

Template

A template defines a task (or job). Templates are a part of workflows. Each workflow must contain at least one template. A template also contains the configuration of Kubernetes containers and the input and output parameters.

Workflow

A workflow consists of one or more tasks (or templates). You can orchestrate tasks in a variety of ways. For example, you can serialize tasks, run tasks in parallel, or run only tasks that meet the specified conditions. After a workflow is created, the tasks in the workflow run in the pods of a Kubernetes cluster.

Workflow template

Workflow templates are reusable static workflow definitions, which are similar to functions. A workflow template can be referenced and run in different workflows. You can reuse existing workflow templates when defining complex workflows.

ACK Serverless cluster

A Kubernetes cluster for distributed Argo workflows comes with a compute environment, which saves you the need to manually create and manage compute environments. After a workflow is submitted, the batch computing system runs the tasks in the workflow on serverless elastic container instances. This saves you the need to maintain Kubernetes nodes. With the help of elastic container instances, you can run large-scale workflows with tens of thousands of pods and use hundreds of thousands of vCPUs. Compute resources are automatically released after the workflows are complete. You can benefit from Kubernetes clusters for distributed Argo workflows in terms of workflow acceleration and cost savings.

Compare batch computing and Argo workflows

Batch computing

  • You need to lean the specifications and usage notes of job definitions. You may need to purchase devices or software from the designated vendors.

  • You also need to manage compute environments, and specify the model and number of vSwitches. The overall O&M cost is high because batch computing is not serverless.

  • Due to the limited number of compute environments, you need to specify the priorities of jobs in the job queue, which makes the configuration even more complex.

Argo workflows

  • Argo workflows are cloud-native workflows developed based on Kubernetes clusters and open source Argo Workflows. Therefore, you do not need to purchase products or software from the designated vendors.

  • Argo workflows support complex task orchestration to meet the requirements in data processing, simulation, and scientific scenarios.

  • Argo workflows run on nodeless elastic container instances provided by Alibaba Cloud.

  • You can deploy compute resources on a large scale based business requirements and pay for the resources on a pay-as-you-go basis. Workflows can be run on demand and no workflow queue is needed. This greatly improves efficiency and reduces costs.

Feature mappings

Category

Batch computing

Argo Workflows

User experience

Batch computing CLI

Argo Workflows CLI

JSON-defined jobs

YAML-defined jobs

SDK

SDK

Key features

Jobs

Workflows

Array jobs

Argo Workflows - Loops

Job dependencies

Argo Workflows - DAG

Job environments variables

Argo Workflows - Parameters

Automated job retries

Argo Workflows - Retrying

Job timeouts

Argo Workflows - Timeouts

N/A

Argo Workflows - Artifacts

N/A

Argo Workflows - Conditions

N/A

Argo Workflows - Recursion

N/A

Argo Workflows - Suspending/Resuming

GPU jobs

Run workflows on a specified type of ECS instances

Volumes

Volumes

Job priority

Argo Workflows - Priority

Job definitions

Workflows templates

Compute environment

Job queues

Serverless and elastic. No job queue is needed.

Compute environments

ACK Serverless clusters

Ecosystem integration

Eventing

Eventing

Observability

Observability

Examples of Argo workflows

Simple workflows

The following workflow creates a pod that uses the alpine image and runs the Shell command echo helloworld.

You can modify this workflow to run the specified Shell commands or run the commands in a custom image in Argos.

cat > helloworld.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: main         # invoke the main template
  templates:
    - name: main              # name of the template
      container:
        image: registry.cn-hangzhou.aliyuncs.com/acs/alpine:3.18-update
        command: [ "sh", "-c" ]
        args: [ "echo helloworld" ]
EOF
argo submit helloworld.yaml

Loops

In the following loop, a text file named pets.input and a script named print-pet.sh are packaged in the image named print-pet. The input parameter of the print-pet.sh script is job-index. The loop prints the pet in the job-index row of the pets.input file. For more information about the files, see GitHub repository.

The loop creates five pods at a time and passes an input parameter (from job-index 1 to job-index 5) to each pod. Each pod prints the pet in the job-index row.

Loops can be used to quickly process large amounts of data in sharding and parallel computing scenarios. For more information about sample loops, see Argo Workflows - Loops.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: loops-
spec:
  entrypoint: loop-example
  templates:
  - name: loop-example
    steps:
    - - name: print-pet
        template: print-pet
        arguments:
          parameters:
          - name: job-index
            value: "{{item}}"
        withSequence:  # loop to run print-pet template with parameter job-index 1 ~ 5 respectively.
          start: "1"
          end: "5"
  - name: print-pet
    inputs:
      parameters:
      - name: job-index
    container:
      image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/print-pet
      command: [/tmp/print-pet.sh]
      args: ["{{inputs.parameters.job-index}}"] # input parameter job-index as args of container

DAGs (MapReduce)

Multiple jobs may need to collaborate in batch computing scenarios. In this case, you can create a DAG to specify the dependencies of each job.

Mainstream batch computing systems require you to specify the dependencies of a job by specifying the job ID. However, the ID of a job is returned only after the job is submitted. To resolve this problem, you need to write a script to specify the dependencies of each job, as shown in the following sample code. When the number of jobs grows, the dependencies in the script become complex and the cost for maintaining the script also increases.

//The dependencies of each job in a batch computing system. Job B depends on Job A. Job B is started only after Job A is complete. 
batch submit JobA | get job-id
batch submit JobB --dependency job-id (JobA)

Argo workflows allow you to create a DAG to specify the dependencies of each task, as shown in the following figure:

image
  • Task B and Task C depend on Task A.

  • Task D depends on Task B and Task C.

# The following workflow executes a diamond workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-diamond-
spec:
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "B && C"
        template: echo
        arguments:
          parameters: [{name: message, value: D}]
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]

In the Git repository, we also provide a sample MapReduce workflow which can be used to create shards and aggregate computing results. For more information, see map-reduce.

Migrate from a batch computing system to Argo workflows

  1. Assessment and planning

    Assess the existing batch jobs, including dependencies, resource requests, and parameters. Learn the features and best practices of Argo workflows, and choose proper Argo workflow features to replace those used in the batch computing system. You can skip the step for designing compute environments and configuring job priorities because Kubernetes clusters for distributed Argo workflows use serverless elastic container instances.

  2. Create Kubernetes cluster for distributed Argo workflows

  3. Convert job definitions

    Convert batch jobs to Argo workflows based on the feature mappings between batch computing and Argo workflows. You can also call the Argo workflow SDK to automate workflow creation and integration.

  4. Prepare storage services

    Make sure that the Kubernetes cluster for distributed Argo workflows can access the data required for running workflows. You can mount Object Storage Service (OSS) buckets, File Storage NAS (NAS) file systems, CPFS file systems, or disks to the cluster. For more information, see Use volumes.

  5. Verify workflows

    Verify workflows, data access, output data, and resource usage.

  1. O&M: monitoring and logging

    Enable observability for the Kubernetes cluster for distributed Argo workflows and check the status and logs of the workflows.

Usage notes

  • Argo workflows can replace mainstream batch computing systems in terms of user experience, core features, compute environments, and ecosystem integration. In addition, Argo workflows outperform batch computing in terms of complex workflow orchestration and compute environment management.

  • Workflow clusters are developed based on Kubernetes. Workflow definitions comply with the Kubernetes YAML specifications and task definitions comply with the Kubernetes container specifications. If you are using Kubernetes to host your applications, you can quickly get started with workflow clusters by using Kubernetes as the base to serve applications deployed in staging and production environments.

  • The compute environment of workflow clusters uses elastic container instances, which are nodeless. You can deploy compute resources on a large scale based business requirements and pay for resources on a pay-as-you-go basis. Workflows can be run on demand and no workflow queue is needed. This greatly improves efficiency and reduces costs.

  • Using preemptible instances also helps reduce expenses on computing resources.

  • Distributed workflows are suitable for CI/CD, data processing, simulation, and scientific computing.

References