Batch jobs are commonly used in the data processing, simulation, and scientific computing sectors. Batch jobs usually require large amounts of computing resources. Kubernetes clusters for distributed Argo workflows are developed based on the open source Argo Workflows project and comply with the standards of open source workflows. In workflow clusters, you can easily orchestrate workflows, run each step in containers, and complete compute-intensive jobs such as large-scale machine learning and simulation within a short period of time. You can also quickly run Continuous Integration and Continuous Delivery (CI/CD) pipeline jobs. Migrating scheduled jobs and batch jobs to workflow clusters helps you efficiently reduce O&M complexity and costs.
Background
Workflow clusters are developed based on Kubernetes clusters to host open source Argo Workflows and use a serverless workflow engine.
Terms used in batch computing
Job
After you submit task units, such as Shell scripts, Linux executable files, or Docker container images, to the batch computing system, the batch computing system will allocate compute resources and then start a job.
Array job
An array job is a collection of similar jobs that are batch submitted and run. All jobs in an array job use the same job definition. You can use indexes to identify different jobs. The datasets or jobs processed by each job instance may differ.
Job definition
A job definition specifies how a job runs. You need to create a job definition before you can run a job.
A job definition usually consists of the image used to run the job, commands and parameters, required amounts of CPU and memory resources, environment variables, and disk space.
Job queue
Jobs that you submit to the batch computing system are delivered to a job queue. A job leaves the queue after the job is scheduled. You can specify the priorities of jobs in a job queue and associate a job queue with a compute environment.
Compute environment
A compute environment consists of compute resources that are used to run jobs. You need to specify the vSwitch model, maximum number and minimum number of vCPUs, and unit price of preemptible instances for each compute environment.
Terms used in distributed Argo workflows
Template
A template defines a task (or job). Templates are a part of workflows. Each workflow must contain at least one template. A template also contains the configuration of Kubernetes containers and the input and output parameters.
Workflow
A workflow consists of one or more tasks (or templates). You can orchestrate tasks in a variety of ways. For example, you can serialize tasks, run tasks in parallel, or run only tasks that meet the specified conditions. After a workflow is created, the tasks in the workflow run in the pods of a Kubernetes cluster.
Workflow template
Workflow templates are reusable static workflow definitions, which are similar to functions. A workflow template can be referenced and run in different workflows. You can reuse existing workflow templates when defining complex workflows.
ACK Serverless cluster
A Kubernetes cluster for distributed Argo workflows comes with a compute environment, which saves you the need to manually create and manage compute environments. After a workflow is submitted, the batch computing system runs the tasks in the workflow on serverless elastic container instances. This saves you the need to maintain Kubernetes nodes. With the help of elastic container instances, you can run large-scale workflows with tens of thousands of pods and use hundreds of thousands of vCPUs. Compute resources are automatically released after the workflows are complete. You can benefit from Kubernetes clusters for distributed Argo workflows in terms of workflow acceleration and cost savings.
Compare batch computing and Argo workflows
Batch computing
You need to lean the specifications and usage notes of job definitions. You may need to purchase devices or software from the designated vendors.
You also need to manage compute environments, and specify the model and number of vSwitches. The overall O&M cost is high because batch computing is not serverless.
Due to the limited number of compute environments, you need to specify the priorities of jobs in the job queue, which makes the configuration even more complex.
Argo workflows
Argo workflows are cloud-native workflows developed based on Kubernetes clusters and open source Argo Workflows. Therefore, you do not need to purchase products or software from the designated vendors.
Argo workflows support complex task orchestration to meet the requirements in data processing, simulation, and scientific scenarios.
Argo workflows run on nodeless elastic container instances provided by Alibaba Cloud.
You can deploy compute resources on a large scale based business requirements and pay for the resources on a pay-as-you-go basis. Workflows can be run on demand and no workflow queue is needed. This greatly improves efficiency and reduces costs.
Feature mappings
Category | Batch computing | Argo Workflows |
User experience | Batch computing CLI | |
JSON-defined jobs | YAML-defined jobs | |
SDK | ||
Key features | Jobs | |
Array jobs | ||
Job dependencies | ||
Job environments variables | ||
Automated job retries | ||
Job timeouts | ||
N/A | ||
N/A | ||
N/A | ||
N/A | ||
GPU jobs | ||
Volumes | ||
Job priority | ||
Job definitions | ||
Compute environment | Job queues | Serverless and elastic. No job queue is needed. |
Compute environments | ||
Ecosystem integration | Eventing | |
Observability |
Examples of Argo workflows
Simple workflows
The following workflow creates a pod that uses the alpine image and runs the Shell command echo helloworld.
You can modify this workflow to run the specified Shell commands or run the commands in a custom image in Argos.
cat > helloworld.yaml << EOF
apiVersion: argoproj.io/v1alpha1
kind: Workflow # new type of k8s spec
metadata:
generateName: hello-world- # name of the workflow spec
spec:
entrypoint: main # invoke the main template
templates:
- name: main # name of the template
container:
image: registry.cn-hangzhou.aliyuncs.com/acs/alpine:3.18-update
command: [ "sh", "-c" ]
args: [ "echo helloworld" ]
EOF
argo submit helloworld.yaml
Loops
In the following loop, a text file named pets.input and a script named print-pet.sh are packaged in the image named print-pet. The input parameter of the print-pet.sh script is job-index. The loop prints the pet in the job-index row of the pets.input file. For more information about the files, see GitHub repository.
The loop creates five pods at a time and passes an input parameter (from job-index 1 to job-index 5) to each pod. Each pod prints the pet in the job-index row.
Loops can be used to quickly process large amounts of data in sharding and parallel computing scenarios. For more information about sample loops, see Argo Workflows - Loops.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: loops-
spec:
entrypoint: loop-example
templates:
- name: loop-example
steps:
- - name: print-pet
template: print-pet
arguments:
parameters:
- name: job-index
value: "{{item}}"
withSequence: # loop to run print-pet template with parameter job-index 1 ~ 5 respectively.
start: "1"
end: "5"
- name: print-pet
inputs:
parameters:
- name: job-index
container:
image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/print-pet
command: [/tmp/print-pet.sh]
args: ["{{inputs.parameters.job-index}}"] # input parameter job-index as args of container
DAGs (MapReduce)
Multiple jobs may need to collaborate in batch computing scenarios. In this case, you can create a DAG to specify the dependencies of each job.
Mainstream batch computing systems require you to specify the dependencies of a job by specifying the job ID. However, the ID of a job is returned only after the job is submitted. To resolve this problem, you need to write a script to specify the dependencies of each job, as shown in the following sample code. When the number of jobs grows, the dependencies in the script become complex and the cost for maintaining the script also increases.
//The dependencies of each job in a batch computing system. Job B depends on Job A. Job B is started only after Job A is complete.
batch submit JobA | get job-id
batch submit JobB --dependency job-id (JobA)
Argo workflows allow you to create a DAG to specify the dependencies of each task, as shown in the following figure:
Task B and Task C depend on Task A.
Task D depends on Task B and Task C.
# The following workflow executes a diamond workflow
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dag-diamond-
spec:
entrypoint: diamond
templates:
- name: diamond
dag:
tasks:
- name: A
template: echo
arguments:
parameters: [{name: message, value: A}]
- name: B
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: B}]
- name: C
depends: "A"
template: echo
arguments:
parameters: [{name: message, value: C}]
- name: D
depends: "B && C"
template: echo
arguments:
parameters: [{name: message, value: D}]
- name: echo
inputs:
parameters:
- name: message
container:
image: alpine:3.7
command: [echo, "{{inputs.parameters.message}}"]
In the Git repository, we also provide a sample MapReduce workflow which can be used to create shards and aggregate computing results. For more information, see map-reduce.
Migrate from a batch computing system to Argo workflows
Assessment and planning
Assess the existing batch jobs, including dependencies, resource requests, and parameters. Learn the features and best practices of Argo workflows, and choose proper Argo workflow features to replace those used in the batch computing system. You can skip the step for designing compute environments and configuring job priorities because Kubernetes clusters for distributed Argo workflows use serverless elastic container instances.
Create Kubernetes cluster for distributed Argo workflows
Convert job definitions
Convert batch jobs to Argo workflows based on the feature mappings between batch computing and Argo workflows. You can also call the Argo workflow SDK to automate workflow creation and integration.
Prepare storage services
Make sure that the Kubernetes cluster for distributed Argo workflows can access the data required for running workflows. You can mount Object Storage Service (OSS) buckets, File Storage NAS (NAS) file systems, CPFS file systems, or disks to the cluster. For more information, see Use volumes.
Verify workflows
Verify workflows, data access, output data, and resource usage.
O&M: monitoring and logging
Enable observability for the Kubernetes cluster for distributed Argo workflows and check the status and logs of the workflows.
Usage notes
Argo workflows can replace mainstream batch computing systems in terms of user experience, core features, compute environments, and ecosystem integration. In addition, Argo workflows outperform batch computing in terms of complex workflow orchestration and compute environment management.
Workflow clusters are developed based on Kubernetes. Workflow definitions comply with the Kubernetes YAML specifications and task definitions comply with the Kubernetes container specifications. If you are using Kubernetes to host your applications, you can quickly get started with workflow clusters by using Kubernetes as the base to serve applications deployed in staging and production environments.
The compute environment of workflow clusters uses elastic container instances, which are nodeless. You can deploy compute resources on a large scale based business requirements and pay for resources on a pay-as-you-go basis. Workflows can be run on demand and no workflow queue is needed. This greatly improves efficiency and reduces costs.
Using preemptible instances also helps reduce expenses on computing resources.
Distributed workflows are suitable for CI/CD, data processing, simulation, and scientific computing.
References
For more information about open source Argo Workflows, see Open source Argo Workflows.
For more information about how workflow clusters work and the relevant operations, see Overview of Kubernetes clusters for distributed Argo workflows.
For more information about how to create a workflow cluster, see Create a workflow cluster.
For more information about how to create a workflow, see Create a workflow.
For more information about how to mount volumes to a workflow cluster, see Use volumes.