ACK One Argo Workflows: Implementing Dynamic Fan-out/Fan-in Task Orchestration

By Yu Zhuang

What Is Fan-out/Fan-in?

In the workflow orchestration process, to increase the efficiency of processing large tasks, Fan-out/Fan-in task orchestration can be used to split large tasks into small ones, run the small tasks in parallel, and finally aggregate the results.

As shown in the previous figure, Directed Acyclic Graph (DAG) can be used to orchestrate fan-out/fan-in tasks. The splitting of sub-tasks can be static or dynamic, corresponding to static DAG and dynamic DAG respectively. Dynamic DAG Fan-out/Fan-in can also be understood as MapReduce. Each sub-task is Map, and the final result aggregation is Reduce.

Static DAG: The classification of split sub-tasks is fixed. For example, data is simultaneously gathered from Database 1 and Database 2, with the results being aggregated afterward.

Dynamic DAG: The classification of split sub-tasks is dynamic and depends on the output of the previous task. For example, in data processing, Task A can scan the datasets to be processed and start Sub-tasks Bn for each sub-dataset (for example, a sub-directory). After all Sub-tasks Bn are run, the results are aggregated in Sub-task C. The number of Sub-tasks B started depends on the output of Task A. You can customize the splitting rules for sub-tasks in Task A based on your business scenario.

ACK One for Distributed Argo Workflow Clusters

In practical business scenarios, in order to speed up the execution of large tasks and improve efficiency, it is often necessary to divide a large task into thousands of sub-tasks. To ensure that thousands of sub-tasks can run at the same time, tens of thousands of CPU cores need to be scheduled. The simultaneous running of multiple tasks means resource contention, and general IDC offline task clusters are often unable to meet these requirements. For example, in the simulation task of autonomous driving, the regression test after modifying the algorithm needs to simulate all driving scenarios, and each small driving scenario simulation can be run by one sub-task. To speed up iteration, the development team requires that all sub-scenario tests be executed in parallel.

In scenarios such as data processing, simulation computing, and scientific computing, you can use you can use Alibaba Cloud's ACK One distributed Argo workflow clusters [1] to orchestrate tasks by using the dynamic DAG or to schedule tens of thousands of CPU cores to accelerate tasks.

ACK One distributed Argo workflow clusters provide product-based managed Argo Workflows [2] and also offer after-sales support. They support dynamic DAG fan-out/fan-in task orchestration, and allow on-demand scheduling of cloud computing power. Using the elastic cloud, they support the scheduling of tens of thousands of CPU cores to run large-scale sub-tasks in parallel, which reduces running time and saves costs by quickly recycling resources after running. They support business scenarios such as data processing, machine learning, simulation computing, scientific computing, and CICD.

Argo Workflow is an open-source CNCF project, focusing on workflow orchestration in the cloud-native field. It uses Kubernetes CRD to orchestrate offline tasks and DAG workflows, and uses Kubernetes pods to schedule and run in clusters.

This article explains how to use Argo Workflow to orchestrate dynamic DAG fan-out/fan-in tasks.

Orchestrate Fan-out/Fan-in Task by Using Argo Workflows

We need to build a dynamic DAG fan-out/fan-in workflow to read a large log file in Alibaba Cloud Object Storage Service (OSS). After splitting the log file into multiple small files (split), we need to start multiple sub-tasks to calculate the number of keywords in each small file (count), and finally aggregate the results (merge).

1. Create a workflow cluster [3].

2. Mount an Alibaba Cloud OSS volume. The workflow can operate files on OSS in the same way as operating local files. For operation reference, see Use volumes [4].

3. Create a workflow by using the following workflow YAML. For more information, see: Create a workflow [5]. For specific description, see the annotations in the code.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dynamic-dag-map-reduce-
spec:
  entrypoint: main
  # claim a OSS PVC, workflow can read/write file in OSS through PVC. 
  volumes:
    - name: workdir
      persistentVolumeClaim:
        claimName: pvc-oss
  # how many tasks to split, default is 5.
  arguments:
    parameters:
      - name: numParts
        value: "5"
  templates:
    - name: main
      # DAG definition.
      dag:
        tasks:
          # split log files to several small files, based on numParts.
          - name: split
            template: split
            arguments:
              parameters:
                - name: numParts
                  value: "{{workflow.parameters.numParts}}"
          # multiple map task to count words in each small file.
          - name: map
            template: map
            arguments:
              parameters:
                - name: partId
                  value: '{{item}}'
            depends: "split"
            # run as a loop, partId from split task json outputs.
            withParam: '{{tasks.split.outputs.result}}'
          - name: reduce
            template: reduce
            arguments:
              parameters:
                - name: numParts
                  value: "{{workflow.parameters.numParts}}"
            depends: "map"
    # The `split` task split the big log file to several small files. Each file has a unique ID (partId).
    # Finally, it dumps a list of partId to stdout as output parameters
    - name: split
      inputs:
        parameters:
          - name: numParts
      container:
        image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
        command: [python]
        args: ["split.py"]
        env:
        - name: NUM_PARTS
          value: "{{inputs.parameters.numParts}}"
        volumeMounts:
        - name: workdir
          mountPath: /mnt/vol
    # One `map` per partID is started. Finds its own "part file" and processes it.
    - name: map
      inputs:
        parameters:
          - name: partId
      container:
        image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
        command: [python]
        args: ["count.py"]
        env:
        - name: PART_ID
          value: "{{inputs.parameters.partId}}"
        volumeMounts:
        - name: workdir
          mountPath: /mnt/vol
    # The `reduce` task takes the "results directory" and returns a single result.
    - name: reduce
      inputs:
        parameters:
          - name: numParts
      container:
        image: acr-multiple-clusters-registry.cn-hangzhou.cr.aliyuncs.com/ack-multiple-clusters/python-log-count
        command: [python]
        args: ["merge.py"]
        env:
        - name: NUM_PARTS
          value: "{{inputs.parameters.numParts}}"
        volumeMounts:
        - name: workdir
          mountPath: /mnt/vol
      outputs:
        artifacts:
          - name: result
            path: /mnt/vol/result.json

4. Implement the Dynamic DAG

(1) After the split task which splits a large file into small ones, a JSON string will be output in the standard output, including partId to be processed by the sub-task, for example:

["0", "1", "2", "3", "4"]

(2) The map task uses withParam to reference the output of the split task, parses the JSON string to obtain all {{item}}s, and uses each {{item}} as an input parameter to start multiple map tasks.


          - name: map
            template: map
            arguments:
              parameters:
                - name: partId
                  value: '{{item}}'
            depends: "split"
            withParam: '{{tasks.split.outputs.result}}'

For more information, see the Open source Argo workflow documentation[6].

5. After the workflow is run, view the DAG process and running result of the task in the Argo workflow cluster console[7].

6. In the Alibaba Cloud OSS file list, log-count-data.txt is the input log file, split-output and cout-output are intermediate result directories, and result.json is the final result file.

7. For the source code in the example, see AliyunContainerService GitHub argo-workflow-examples[8]。

Summary

Argo workflows focus on workflow orchestration in the cloud-native field. Argo workflows use Kubernetes CRD to orchestrate offline tasks and DAG workflows and use Kubernetes pods to schedule and run in clusters.

Alibaba Cloud's ACK One distributed Argo workflow clusters provide product-based managed Argo Workflows, along with after-sales services. The hardened control plane enables stable and efficient scheduling of tens of thousands of sub-tasks (pods). The data plane supports serverless scheduling of large-scale cloud computing power without the need for cluster or node O&M. With elastic cloud capabilities, the data plane supports on-demand scheduling of cloud computing power and the scheduling of tens of thousands of CPU cores to run large-scale sub-tasks in parallel, reducing running time. Business scenarios such as data processing, machine learning, simulation computing, scientific computing, and CICD are all supported.

References

[1] Alibaba Cloud Kubernetes clusters for distributed Argo workflows
https://www.alibabacloud.com/help/en/ack/overview-12
[2] Argo Workflow
https://argo-workflows.readthedocs.io/en/latest/
[3] Create a workflow cluster
https://www.alibabacloud.com/help/en/ack/create-a-workflow-cluster
[4] Use volumes
https://www.alibabacloud.com/help/en/ack/use-volumes
[5] Create a workflow
https://www.alibabacloud.com/help/en/ack/create-a-workflow
[6] Open source Argo workflow documentation
https://argo-workflows.readthedocs.io/en/latest/walk-through/loops/
[7] Argo workflow cluster console
https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Fcs.console.aliyun.com%2Fone%3Fspm%3Da2c4g.11186623.0.0.7e2f1428OwzMip#/argowf/cluster/detail
[8] Container Service GitHub argo-workflow-examples
https://github.com/AliyunContainerService/argo-workflow-examples/tree/main/log-count

Community

ACK One Argo Workflows: Implementing Dynamic Fan-out/Fan-in Task Orchestration

What Is Fan-out/Fan-in?

ACK One for Distributed Argo Workflow Clusters

Orchestrate Fan-out/Fan-in Task by Using Argo Workflows

Summary

References

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

ACK One

Container Service for Kubernetes

Cloud-Native Applications Management Solution

Managed Service for Prometheus