All Products
Search
Document Center

Container Service for Kubernetes:Use ack-kube-queue to manage job queues

Last Updated:Mar 13, 2024

ack-kube-queue is designed to manage AI, machine learning, and batch workloads in Kubernetes. It allows system administrators to customize job queue management and improve the flexibility of queues. Integrated with a quota system, ack-kube-queue automates and optimizes the management of workloads and resource quotas to maximize resource utilization in Kubernetes clusters. This topic describes how to install and use ack-kube-queue.

Prerequisites

The cloud-native AI suite is activated.

Limits

Only Container Service for Kubernetes (ACK) Pro clusters whose Kubernetes versions are 1.18.aliyun.1 or later are supported.

Install ack-kube-queue

This section describes how to install ack-kube-queue in two scenarios.

Scenario 1: The cloud-native AI suite is not installed

  1. Log on to the ACK console and click Clusters in the left-side navigation pane.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.

  3. In the lower part of the Cloud-native AI Suite page, click Deploy.

  4. In the Scheduling section, select Kube Queue. In the Interactive Mode section, select Arena. In the lower part of the page, click Deploy Cloud-native AI Suite.

Scenario 2: The cloud-native AI suite is installed

  1. Log on to the ACK console and click Clusters in the left-side navigation pane.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Applications > Cloud-native AI Suite in the left-side navigation pane.

  3. Install ack-arena and ack-kube-queue.

    • On the Cloud-native AI Suite page, find ack-arena and click Deploy in the Actions column. In the Parameters panel, click OK.

    • On the Cloud-native AI Suite page, find ack-kube-queue and click Deploy in the Actions column. In the message that appears, click Confirm.

    After ack-arena and ack-kube-queue are installed, Deployed is displayed in the Status column of the Components section.

Supported job types

ack-kube-queue supports TensorFlow jobs, PyTorch jobs, MPI jobs, Argo workflows, Ray jobs, Spark applications, and Kubernetes-native jobs.

Limits

  • You must use the Operator provided by ack-arena to submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue.

  • To submit Kubernetes-native jobs to a queue, make sure that the Kubernetes version of the cluster is 1.22 or later.

  • You can submit MPI jobs to a queue only by using Arena.

  • You can submit only an entire Argo workflow to a queue, but not the steps in the Argo workflow. You can add the following annotation to declare the resources requested by an Argo workflow.

    ```
     annotations:
       kube-queue/min-resources: |
         cpu: 5
         memory: 5G
    ```

Submit different types of jobs to a queue

By default, ack-kube-queue supports TensorFlow jobs and PyTorch jobs. You can change the job types supported by ack-kube-queue as needed.

Versions before v0.4.0

The kube-queue feature for each job type is controlled by an individual Deployment. You can set the number of replicated pods in the Extension section to 0 for the Deployment in the kube-queue namespace to disable the kube-queue feature.

Versions v0.4.0 and later

Except for Argo workflows, the kube-queue feature for other job types is controlled by Job-Extensions. You can enable or disable the feature for a job type by modifying the value of --enabled-extensions in the command. Separate different job types with commas (,). The following table describes the job types and their names used in the command.

TfJob

tfjob

Pytorchjob

pytorchjob

Job

job

SparkApplication

sparkapp

RayJob

rayjob

RayJob(v1alpha1)

rayjobv1alpha1

MpiJob

mpiv1

Submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue

You must add the scheduling.x-k8s.io/suspend="true" annotation to a job. The following sample code submits a TensorFlow job to a queue.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "job1"
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
  ...

Submit Kubernetes-native jobs to a queue

You must set the suspend field of the job to true. The following sample code submits a Kubernetes-native job to a queue.

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  suspend: true
  completions: 1
  parallelism: 1
  template:
    spec:
      schedulerName: default-scheduler
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["sleep",  "3s"]
        resources:
          requests:
            cpu: 100m
          limits:
            cpu: 100m
      restartPolicy: Never

In the preceding example, the job that requests 100m of CPU resources is queued. When the job is dequeued, the value of the suspend field of the job is changed to false, and the job is executed by the cluster component.

Submit Argo workflows to a queue

Note

Install ack-workflow from the marketplace in the console.

You must add a custom suspend template named kube-queue-suspend to the Argo workflow and set the suspend field to true when you submit the workflow. Example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: $example-name
spec:
  suspend: true # Set this field to true.
  entrypoint: $example-entrypoint
  templates:
  # Add a suspend template named kube-queue-suspend.
  - name: kube-queue-suspend
    suspend: {}
  - name: $example-entrypoint
  ...

Submit Spark applications to a queue

Note

Install ack-spark-operator from the marketplace in the console.

You must add the scheduling.x-k8s.io/suspend="true" annotation to a Spark application.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  generateName: spark-pi-suspend-
  namespace: spark-operator
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
  type: Scala
  mode: cluster
  image: registry-cn-beijing.ack.aliyuncs.com/acs/spark:v3.1.1
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar
  sparkVersion: "3.1.1"
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    serviceAccount: ack-spark-operator3.0-spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"

Submit Ray jobs to a queue

Note

Install ack-kuberay-operator from the marketplace in the console.

You must set the spec.suspend field of a Ray job to true when you submit the Ray job to a queue.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  suspend: true

  rayClusterSpec:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.9.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: small-group
        rayStartParams: {}
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: rayproject/ray:2.9.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"

Change the type of quota system

If a cluster is used by multiple users, you must allocate a fixed amount of resources to each user in case the users compete for resources. The traditional method is to use Kubernetes resource quotas to allocate a fixed amount of resources to each user. However, the resource requirements may be different among users. By default, ack-kube-queue uses elastic quotas to improve the overall resource utilization. If you want to use Kubernetes resource quotas, perform the following steps: For more information about elastic quotas, see ElasticQuota.

  1. Run the following command to switch from elastic quotas to Kubernetes resource quotas:

    kubectl edit deploy kube-queue-controller -nkube-queue
  2. Change the environment variable from elasticquota to resourcequota.

    env:
    - name: QueueGroupPlugin
        value: resourcequota
  3. Save the file after you modify the configuration. Wait for kube-queue-controller to start up. Then, use Kubernetes resource quotas to allocate resources.

Enable blocking queues

Same as kube-scheduler, ack-kube-queue processes jobs in a round robin manner by default. The jobs in a queue request resources one after one. Jobs that fail to obtain resources are submitted to the Unschedulable queue and wait for the next round of job scheduling. When a cluster contains a large number of jobs whose resource demand is low, it is time-consuming to handle these jobs in a round robin manner. In this scenario, jobs whose resource demand is high are more likely to become pending because jobs with low resource demand compete for resources. To avoid this problem, ack-kube-queue provides blocking queues. After you enable this feature, only the first job in a queue is scheduled. This way, more resources can be scheduled to jobs with high resource demand.

Procedure

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Deployments in the left-side navigation pane.

  3. Set the Namespace parameter to kube-queue. Then, click Edit in the Actions column of kube-queue-controller.

  4. Click Add to the right of Environment Variable and add the following environment variable.

    Parameter

    Value

    Type

    Custom

    Variable Key

    StrictPriority

    Value/ValueFrom

    true

  5. Click Update on the right side of the page. In the OK message, click Confirm.

Enable strict priority scheduling

By default, jobs that fail to obtain resources are submitted to the Unschedulable queue and wait for the next round of job scheduling. After a job is completed, resources occupied by the job are released. These resources are not scheduled to jobs with high priorities because these jobs are still in the Unschedulable queue. Consequently, idle resources are scheduled to jobs with low priorities. To preferably schedule idle resources to jobs with high priorities, ack-kube-queue provides the strict priority scheduling feature. After a job releases resources, the system attempts to schedule resources to the job with the highest priority, which is the first job in the queue. This ensures that idle resources are preferably scheduled to jobs with high priorities.

Note

Jobs with low priorities can compete for idle resources when the idle resources are insufficient to fulfill jobs with high priorities.

Procedure

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Deployments in the left-side navigation pane.

  3. Set the Namespace parameter to kube-queue. Then, click Edit in the Actions column of kube-queue-controller.

  4. Click Add to the right of Environment Variable and add the following environment variable.

    Parameter

    Value

    Type

    Custom

    Variable Key

    StrictConsistency

    Value/ValueFrom

    true

  5. Click Update on the right side of the page. In the OK message, click Confirm.

Use case of resource quotas

ElasticQuota

  1. Use the following YAML template to create an ElasticQuotaTree:

    View YAML content

    apiVersion: v1
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""
    items:
      - apiVersion: scheduling.sigs.k8s.io/v1beta1
        kind: ElasticQuotaTree
        metadata:
          name: elasticquotatree
          namespace: kube-system
        spec:
          root: # The resource quota of the root is the total resource quota of the quota group. The maximum amount of resources of the root must be equal to or larger than the sum of the maximum resources of all leaves below the root. 
            name: root
            namespaces: []
            max:
              cpu: "4"
              memory: 4Gi
              nvidia.com/gpu: "64"
              aliyun.com/gpu-mem: "32"
            min:
              cpu: "0"
              memory: 0M
              nvidia.com/gpu: "0"
              aliyun.com/gpu-mem: "0"
            children: # You can specify multiple leaves. In most cases, each leaf corresponds to one namespace. 
              - name: root.defaultQuotaGroup
                namespaces:
                  - default
                max:
                  cpu: "4"
                  memory: 4Gi
                  nvidia.com/gpu: "64"
                  aliyun.com/gpu-mem: "16"
                min:
                  cpu: "0"
                  memory: 0M
                  nvidia.com/gpu: "0"
                  aliyun.com/gpu-mem: "0"
                children: null
  2. Run the following command to check whether the ElasticQuotaTree is created:

    kubectl get elasticquotatree -A

    Expected output:

    NAMESPACE     NAME               AGE
    kube-system   elasticquotatree   7s

    The ElasticQuotaTree is created.

  3. Create jobs.

    Note
    • To test the job queue management feature of ack-kube-queue, the resource quota for the jobs that you create must be less than the total amount of resources requested by the jobs.

    • To simplify the test, the TensorFlow image used by the TensorFlow jobs is replaced by the BusyBox image. Each container sleeps for 30 seconds to simulate the training process.

    1. Use the following YAML template to create two TensorFlow jobs:

      View YAML content

      apiVersion: "kubeflow.org/v1"
      kind: "TFJob"
      metadata:
        name: "job1"
        annotations:
          scheduling.x-k8s.io/suspend: "true"
      spec:
        tfReplicaSpecs:
          PS:
            replicas: 1
            restartPolicy: Never
            template:
              spec:
                containers:
                  - name: tensorflow
                    image: busybox
                    command:
                      - /bin/sh
                      - -c
                      - --
                    args:
                      - "sleep 30s"
                    resources:
                      requests:
                        cpu: 1
                        memory: 1Gi
                      limits:
                        cpu: 1
                        memory: 1Gi
      
          Worker:
            replicas: 2
            restartPolicy: Never
            template:
              spec:
                containers:
                  - name: tensorflow
                    image: busybox
                    command:
                      - /bin/sh
                      - -c
                      - --
                    args:
                      - "sleep 30s"
                    resources:
                      requests:
                        cpu: 1
                        memory: 1Gi
                      limits:
                        cpu: 1
                        memory: 1Gi
      --
      apiVersion: "kubeflow.org/v1"
      kind: "TFJob"
      metadata:
        name: "job2"
        annotations:
          scheduling.x-k8s.io/suspend: "true"
      spec:
        tfReplicaSpecs:
          PS:
            replicas: 1
            restartPolicy: Never
            template:
              spec:
                containers:
                  - name: tensorflow
                    image: busybox
                    command:
                      - /bin/sh
                      - -c
                      - --
                    args:
                      - "sleep 30s"
                    resources:
                      requests:
                        cpu: 1
                        memory: 1Gi
                      limits:
                        cpu: 1
                        memory: 1Gi
      
          Worker:
            replicas: 2
            restartPolicy: Never
            template:
              spec:
                containers:
                  - name: tensorflow
                    image: busybox
                    command:
                      - /bin/sh
                      - -c
                      - --
                    args:
                      - "sleep 30s"
                    resources:
                      requests:
                        cpu: 1
                        memory: 1Gi
                      limits:
                        cpu: 1
                        memory: 1Gi
    2. Query the status of the jobs after you submit the jobs.

      1. Run the following command to query the status of the jobs:

        kubectl get tfjob

        Expected output:

        NAME   STATE     AGE
        job1   Running   3s
        job2   Queuing   2s

        The output indicates that job1 is in the Running state and job2 is in the Queuing state. This is because each TensorFlow job requests three vCores but the ElasticQuotaTree that you created allocates four vCores to the default namespace. Consequently, the two TensorFlow jobs cannot run at the same time.

      2. Wait a period of time and run the following command again:

        kubectl get tfjob

        Expected output:

        NAME   STATE       AGE
        job1   Succeeded   77s
        job2   Running     77s

        The output indicates that job1 is completed. After job1 is completed, job2 starts to run. The output indicates that ack-kube-queue manages job queues as expected.

ResourceQuota

  1. Use the following YAML template to create a ResourceQuota:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: default
    spec:
      hard:
        cpu: "4"
        memory: 4Gi
  2. Run the following command to check whether the ResourceQuota is created:

    kubectl get resourcequota default -o wide

    Expected output:

    NAME      AGE   REQUEST                   LIMIT
    default   76s   cpu: 0/4, memory: 0/4Gi

    The ResourceQuota is created.

  3. Use the following YAML template to create two TensorFlow jobs:

    View YAML content

    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "job1"
      annotations:
        scheduling.x-k8s.io/suspend: "true"
    spec:
      tfReplicaSpecs:
        PS:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: busybox:stable
                  command:
                    - /bin/sh
                    - -c
                    - --
                  args:
                    - "sleep 30s"
                  resources:
                    requests:
                      cpu: 1
                      memory: 1Gi
                    limits:
                      cpu: 1
                      memory: 1Gi
    
        Worker:
          replicas: 2
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: busybox:stable
                  command:
                    - /bin/sh
                    - -c
                    - --
                  args:
                    - "sleep 30s"
                  resources:
                    requests:
                      cpu: 1
                      memory: 1Gi
                    limits:
                      cpu: 1
                      memory: 1Gi
    --
    apiVersion: "kubeflow.org/v1"
    kind: "TFJob"
    metadata:
      name: "job2"
      annotations:
        scheduling.x-k8s.io/suspend: "true"
    spec:
      tfReplicaSpecs:
        PS:
          replicas: 1
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: busybox:stable
                  command:
                    - /bin/sh
                    - -c
                    - --
                  args:
                    - "sleep 30s"
                  resources:
                    requests:
                      cpu: 1
                      memory: 1Gi
                    limits:
                      cpu: 1
                      memory: 1Gi
    
        Worker:
          replicas: 2
          restartPolicy: Never
          template:
            spec:
              containers:
                - name: tensorflow
                  image: busybox:stable
                  command:
                    - /bin/sh
                    - -c
                    - --
                  args:
                    - "sleep 30s"
                  resources:
                    requests:
                      cpu: 1
                      memory: 1Gi
                    limits:
                      cpu: 1
                      memory: 1Gi
  4. After the two jobs are submitted, run the following command to query the status of the jobs:

    kubectl get tfjob
    NAME   STATE     AGE
    job1   Running   5s
    job2   Queuing   5s
    
    kubectl get pods
    NAME            READY   STATUS    RESTARTS   AGE
    job1-ps-0       1/1     Running   0          8s
    job1-worker-0   1/1     Running   0          8s
    job1-worker-1   1/1     Running   0          8s

    job1 is in the Running state and the job2 is in the Queuing state. The result indicates that ack-kube-queue manages job queues as expected. This is because each TensorFlow job requests two vCores but the ResourceQuota that you created allocates a maximum of two vCores to the default namespace. Therefore, the two TensorFlow jobs cannot run at the same time.

  5. Wait a period of time and then run the following command:

    kubectl get tfjob
    NAME   STATE       AGE
    job1   Succeeded   77s
    job2   Running     77s
    
    kubectl get pods
    NAME            READY   STATUS      RESTARTS   AGE
    job1-worker-0   0/1     Completed   0          54s
    job1-worker-1   0/1     Completed   0          54s
    job2-ps-0       1/1     Running     0          22s
    job2-worker-0   1/1     Running     0          22s
    job2-worker-1   1/1     Running     0          21s

    job1 is completed. job2 starts to run after job1 is completed. The result indicates that ack-kube-queue manages job queues as expected.

    Limit the number of jobs that can be concurrently dequeued

    In scenarios where an application can be scaled automatically, the amount of resources required by the application may be unpredictable. In this case, you can limit the number of dequeued jobs. To limit the number of jobs in a queue, you need to define a kube-queue/max-jobs resource in the ElasticQuotaTree. After the limit is set, the number of queue units that can be dequeued below the quota cannot exceed the maximum number of jobs in the queue multiplied by the overcommitment ratio. Example:

    View YAML content

    apiVersion: v1
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""
    items:
      - apiVersion: scheduling.sigs.k8s.io/v1beta1
        kind: ElasticQuotaTree
        metadata:
          name: elasticquotatree
          namespace: kube-system
        spec:
          root: # The resource quota of the root is the total resource quota of the quota group. The maximum amount of resources of the root must be equal to or larger than the sum of the maximum resources of all leaves below the root. 
            name: root
            namespaces: []
            max:
              kube-queue/max-jobs: 10
              cpu: "4"
              memory: 4Gi
              nvidia.com/gpu: "64"
              aliyun.com/gpu-mem: "32"
            min:
              cpu: "0"
              memory: 0M
              nvidia.com/gpu: "0"
              aliyun.com/gpu-mem: "0"
            children: # You can specify multiple leaves. In most cases, each leaf corresponds to one namespace. 
              - name: root.defaultQuotaGroup
                namespaces:
                  - default
                max:
              		kube-queue/max-jobs: 10
                  cpu: "4"
                  memory: 4Gi
                  nvidia.com/gpu: "64"
                  aliyun.com/gpu-mem: "16"
                min:
                  cpu: "0"
                  memory: 0M
                  nvidia.com/gpu: "0"
                  aliyun.com/gpu-mem: "0"
                children: null