By Yu Zhuang
In the workflow orchestration process, to increase the efficiency of processing large tasks, Fan-out/Fan-in task orchestration can be used to split large tasks into small ones, run the small tasks in parallel, and finally aggregate the results.
As shown in the previous figure, Directed Acyclic Graph (DAG) can be used to orchestrate fan-out/fan-in tasks. The splitting of sub-tasks can be static or dynamic, corresponding to static DAG and dynamic DAG respectively. Dynamic DAG Fan-out/Fan-in can also be understood as MapReduce. Each sub-task is Map, and the final result aggregation is Reduce.
Static DAG: The classification of split sub-tasks is fixed. For example, data is simultaneously gathered from Database 1 and Database 2, with the results being aggregated afterward.
Dynamic DAG: The classification of split sub-tasks is dynamic and depends on the output of the previous task. For example, in data processing, Task A can scan the datasets to be processed and start Sub-tasks Bn for each sub-dataset (for example, a sub-directory). After all Sub-tasks Bn are run, the results are aggregated in Sub-task C. The number of Sub-tasks B started depends on the output of Task A. You can customize the splitting rules for sub-tasks in Task A based on your business scenario.
In practical business scenarios, in order to speed up the execution of large tasks and improve efficiency, it is often necessary to divide a large task into thousands of sub-tasks. To ensure that thousands of sub-tasks can run at the same time, tens of thousands of CPU cores need to be scheduled. The simultaneous running of multiple tasks means resource contention, and general IDC offline task clusters are often unable to meet these requirements. For example, in the simulation task of autonomous driving, the regression test after modifying the algorithm needs to simulate all driving scenarios, and each small driving scenario simulation can be run by one sub-task. To speed up iteration, the development team requires that all sub-scenario tests be executed in parallel.
In scenarios such as data processing, simulation computing, and scientific computing, you can use you can use Alibaba Cloud's ACK One distributed Argo workflow clusters [1] to orchestrate tasks by using the dynamic DAG or to schedule tens of thousands of CPU cores to accelerate tasks.
ACK One distributed Argo workflow clusters provide product-based managed Argo Workflows [2] and also offer after-sales support. They support dynamic DAG fan-out/fan-in task orchestration, and allow on-demand scheduling of cloud computing power. Using the elastic cloud, they support the scheduling of tens of thousands of CPU cores to run large-scale sub-tasks in parallel, which reduces running time and saves costs by quickly recycling resources after running. They support business scenarios such as data processing, machine learning, simulation computing, scientific computing, and CICD.
Argo Workflow is an open-source CNCF project, focusing on workflow orchestration in the cloud-native field. It uses Kubernetes CRD to orchestrate offline tasks and DAG workflows, and uses Kubernetes pods to schedule and run in clusters.
This article explains how to use Argo Workflow to orchestrate dynamic DAG fan-out/fan-in tasks.
We need to build a dynamic DAG fan-out/fan-in workflow to read a large log file in Alibaba Cloud Object Storage Service (OSS). After splitting the log file into multiple small files (split), we need to start multiple sub-tasks to calculate the number of keywords in each small file (count), and finally aggregate the results (merge).
1. Create a workflow cluster [3].
2. Mount an Alibaba Cloud OSS volume. The workflow can operate files on OSS in the same way as operating local files. For operation reference, see Use volumes [4].
3. Create a workflow by using the following workflow YAML. For more information, see: Create a workflow [5]. For specific description, see the annotations in the code.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: dynamic-dag-map-reduce-
spec:
entrypoint: main
# claim a OSS PVC, workflow can read/write file in OSS through PVC.
volumes:
- name: workdir
persistentVolumeClaim:
claimName: pvc-oss
# how many tasks to split, default is 5.
arguments:
parameters:
- name: numParts
value: "5"
templates:
- name: main
# DAG definition.
dag:
tasks:
# split log files to several small files, based on numParts.
- name: split
template: split
arguments:
parameters:
- name: numParts
value: "{{workflow.parameters.numParts}}"
# multiple map task to count words in each small file.
- name: map
template: map
arguments:
parameters:
- name: partId
value: '{{item}}'
depends: "split"
# run as a loop, partId from split task json outputs.
withParam: '{{tasks.split.outputs.result}}'
- name: reduce
template: reduce
arguments:
parameters:
- name: numParts
value: "{{workflow.parameters.numParts}}"
depends: "map"
# The `split` task split the big log file to several small files. Each file has a unique ID (partId).
# Finally, it dumps a list of partId to stdout as output parameters
- name: split
inputs:
parameters:
- name: numParts
container:
image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
command: [python]
args: ["split.py"]
env:
- name: NUM_PARTS
value: "{{inputs.parameters.numParts}}"
volumeMounts:
- name: workdir
mountPath: /mnt/vol
# One `map` per partID is started. Finds its own "part file" and processes it.
- name: map
inputs:
parameters:
- name: partId
container:
image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
command: [python]
args: ["count.py"]
env:
- name: PART_ID
value: "{{inputs.parameters.partId}}"
volumeMounts:
- name: workdir
mountPath: /mnt/vol
# The `reduce` task takes the "results directory" and returns a single result.
- name: reduce
inputs:
parameters:
- name: numParts
container:
image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
command: [python]
args: ["merge.py"]
env:
- name: NUM_PARTS
value: "{{inputs.parameters.numParts}}"
volumeMounts:
- name: workdir
mountPath: /mnt/vol
outputs:
artifacts:
- name: result
path: /mnt/vol/result.json
4. Implement the Dynamic DAG
(1) After the split task which splits a large file into small ones, a JSON string will be output in the standard output, including partId to be processed by the sub-task, for example:
["0", "1", "2", "3", "4"]
(2) The map task uses withParam to reference the output of the split task, parses the JSON string to obtain all {{item}}s, and uses each {{item}} as an input parameter to start multiple map tasks.
- name: map
template: map
arguments:
parameters:
- name: partId
value: '{{item}}'
depends: "split"
withParam: '{{tasks.split.outputs.result}}'
For more information, see the Open source Argo workflow documentation[6].
5. After the workflow is run, view the DAG process and running result of the task in the Argo workflow cluster console[7].
6. In the Alibaba Cloud OSS file list, log-count-data.txt is the input log file, split-output and cout-output are intermediate result directories, and result.json is the final result file.
7. For the source code in the example, see AliyunContainerService GitHub argo-workflow-examples[8]。
Argo workflows focus on workflow orchestration in the cloud-native field. Argo workflows use Kubernetes CRD to orchestrate offline tasks and DAG workflows and use Kubernetes pods to schedule and run in clusters.
Alibaba Cloud's ACK One distributed Argo workflow clusters provide product-based managed Argo Workflows, along with after-sales services. The hardened control plane enables stable and efficient scheduling of tens of thousands of sub-tasks (pods). The data plane supports serverless scheduling of large-scale cloud computing power without the need for cluster or node O&M. With elastic cloud capabilities, the data plane supports on-demand scheduling of cloud computing power and the scheduling of tens of thousands of CPU cores to run large-scale sub-tasks in parallel, reducing running time. Business scenarios such as data processing, machine learning, simulation computing, scientific computing, and CICD are all supported.
[1] Alibaba Cloud Kubernetes clusters for distributed Argo workflows
https://www.alibabacloud.com/help/en/ack/overview-12
[2] Argo Workflow
https://argo-workflows.readthedocs.io/en/latest/
[3] Create a workflow cluster
https://www.alibabacloud.com/help/en/ack/create-a-workflow-cluster
[4] Use volumes
https://www.alibabacloud.com/help/en/ack/use-volumes
[5] Create a workflow
https://www.alibabacloud.com/help/en/ack/create-a-workflow
[6] Open source Argo workflow documentation
https://argo-workflows.readthedocs.io/en/latest/walk-through/loops/
[7] Argo workflow cluster console
https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Fcs.console.aliyun.com%2Fone%3Fspm%3Da2c4g.11186623.0.0.7e2f1428OwzMip#/argowf/cluster/detail
[8] Container Service GitHub argo-workflow-examples
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/log-count
ACK One: Building a Hybrid Cloud Zone-Disaster Recovery System
154 posts | 29 followers
FollowAlibaba Container Service - August 30, 2024
Alibaba Cloud Native Community - March 11, 2024
Alibaba Developer - September 7, 2020
Alibaba Container Service - July 1, 2024
Alibaba Container Service - August 16, 2024
Alibaba Container Service - July 24, 2024
154 posts | 29 followers
FollowProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreMore Posts by Alibaba Container Service