All Products
Search
Document Center

Container Service for Kubernetes:Use ack-kube-queue to manage AI and machine learning workloads

Last Updated:Feb 24, 2025

ack-kube-queue is a job queue manager designed to optimize management and improve resource utilization for AI, machine learning, and batch processing workloads in Container Service for Kubernetes (ACK) clusters. ack-kube-queue provides flexible job queue management capabilities, automatically optimizes workload allocation and resource quota management, and helps system administrators improve resource utilization and job execution efficiency for workloads in ACK clusters. This topic describes how to install and configure ack-kube-queue and how to submit jobs after you install ack-kube-queue.

Limits

Only ACK managed clusters, ACK Edge clusters, and ACK Lingjun clusters that run Kubernetes versions 1.18 or later are supported.

Install ack-kube-queue

ACK managed clusters and ACK Edge clusters

The cloud-native AI suite is not installed

  1. Activate the cloud-native AI suite.

  2. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.

  4. In the lower part of the Cloud-native AI Suite page, click Deploy.

  5. In the Scheduling section, select Kube Queue. In the Interactive Mode section, select Arena. In the lower part of the Deploy Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

The cloud-native AI suite is installed

  1. Activate the cloud-native AI suite.

  2. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  3. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.

  4. Install ack-arena and ack-kube-queue.

    • On the Cloud-native AI Suite page, find ack-arena and click Deploy in the Actions column. In the Parameters panel, click OK.

    • On the Cloud-native AI Suite page, find ack-kube-queue and click Deploy in the Actions column. In the panel that appears, click OK.

    After ack-arena and ack-kube-queue are installed, Deployed is displayed in the Status column of the Components section.

ACK Lingjun cluster

  1. Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, enter ack-kube-queue in the search box and click the search icon. After ack-kube-queue is displayed, click its name.

  3. In the upper-right corner of the application details page, click Deploy. In the Basic Information step, set the Cluster, Name, and Release Name parameters. Then, click Next.

  4. In the Parameters step, set Chart Version to the latest version. Then, click OK.

Configure ack-kube-queue

You can use ack-kube-queue to enable queuing for various types of jobs, including TensorFlow jobs, PyTorch jobs, Message Passing Interface (MPI) jobs, Argo workflows, Ray jobs, Spark applications, and Kubernetes-native jobs. By default, ack-kube-queue supports only TensorFlow jobs and PyTorch jobs. You can enable support for other types of jobs based on your business requirements.

Limits

  • You must use the Operator provided by ack-arena to submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue.

  • If you want to enable queuing for Kubernetes-native jobs, the Kubernetes version of the cluster must be 1.22 or later.

  • You can submit MPI jobs to a queue only by using Arena.

  • You can submit only Argo workflows to a queue. You cannot submit steps in Argo workflows to a queue. You can add the following annotation to specify the resources requested by an Argo workflow:

    ...
     annotations:
       kube-queue/min-resources: |
         cpu: 5
         memory: 5G
    ...

Enable support for specific types of jobs

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Helm.

  3. Find ack-kube-queue and click Update in the Actions column.

  4. Modify the YAML template based on the descriptions in the following table to enable support for specific types of jobs.

    Parameter

    Description

    Set extension.argo.enable to true.

    Enable support for Argo workflows.

    Set extension.mpi.enable to true.

    Enable support for MPI jobs.

    Set extension.ray.enable to true.

    Enable support for Ray jobs.

    Set extension.spark.enable to true.

    Enable support for Spark applications.

    Set extension.tf.enable to true.

    Enable support for TensorFlow jobs.

    Setextension.pytorch.enable to true.

    Enable support for PyTorch jobs.

Submit jobs

Submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue

You must add the scheduling.x-k8s.io/suspend="true" annotation to a job. The following sample code submits a TensorFlow job to a queue:

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "job1"
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

Submit Kubernetes-native jobs to a queue

You must set the suspend field to true. In the preceding example, the job whose CPU request is set to 100m is queued. When the job is dequeued, the value of the suspend field of the job is changed to false, and the job is run by the relevant components in the cluster.

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  suspend: true
...

Submit Argo workflows to a queue

Note

You must first install the Argo Workflows component from the Marketplace page in the ACK console.

Use Argo Workflows to add a custom template named kube-queue-suspend of the suspend type. When you submit a workflow, set suspend to true. Example:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: $example-name
spec:
  suspend: true # Set this field to true. 
  entrypoint: $example-entrypoint
  templates:
  # Add a template of the suspend type named kube-queue-suspend. 
  - name: kube-queue-suspend
    suspend: {}
  - name: $example-entrypoint
...

Submit Spark applications to a queue

Note

You must first install ack-spark-operator from the Marketplace page in the ACK console.

When you submit a Spark application, add the scheduling.x-k8s.io/suspend="true" annotation to the configurations of the Spark application.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  generateName: spark-pi-suspend-
  namespace: spark-operator
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

Submit Ray jobs to a queue

Note

You must first install Kuberay-Operator from the Add-ons page in the ACK console. For more information, see Manage components.

When you submit a Ray job, set spec.suspend to true.

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  suspend: true
...