All Products
Search
Document Center

Container Service for Kubernetes:Manage AI/ML workloads with the ack-kube-queue job queue

Last Updated:Mar 25, 2026

ack-kube-queue is a job queue manager for Container Service for Kubernetes (ACK) that controls when jobs start by holding them in a managed queue until sufficient cluster resources are available. This prevents resource contention and GPU/CPU idle waste for AI, machine learning (ML), and batch processing workloads.

This topic covers how to install ack-kube-queue, enable support for additional job types, and submit jobs to a queue.

How it works

ack-kube-queue uses Kubernetes' native suspend mechanism. When you submit a job with suspend set to true (via a field or annotation depending on job type), the job enters the queue. ack-kube-queue monitors cluster resource availability and sets suspend to false when enough resources are free, allowing the job to start.

Supported cluster types

ack-kube-queue supports:

  • ACK managed clusters running Kubernetes 1.18 or later

  • ACK Edge clusters running Kubernetes 1.18 or later

  • ACK Lingjun clusters running Kubernetes 1.18 or later

Install ack-kube-queue

Installation steps vary by cluster type.

ACK managed clusters and ACK Edge clusters

Choose the procedure that matches your cluster's current state.

If the cloud-native AI suite is not yet installed

  1. Activate the cloud-native AI suiteActivate the cloud-native AI suite.

  2. Log on to the ACK console. In the left navigation pane, click Clusters.

  3. On the Clusters page, find your cluster and click its name. In the left navigation pane, choose Applications > Cloud-native AI Suite.

  4. At the bottom of the Cloud-native AI Suite page, click Deploy. In the Scheduling section, select Kube-Queue. In the Ecosystem Tools section, select Kubeflow and Arena. Then click Deploy Cloud-native AI Suite.

If the cloud-native AI suite is already installed

  1. Activate the cloud-native AI suiteActivate the cloud-native AI suite.

  2. Log on to the ACK console. In the left navigation pane, click Clusters.

  3. On the Clusters page, find your cluster and click its name. In the left navigation pane, choose Applications > Cloud-native AI Suite.

  4. Install ack-arena and ack-kube-queue separately: After both components are installed, the Status column in the Components section shows Deployed.

    • Find ack-arena and click Deploy in the Actions column. In the Parameters panel, click OK.

    • Find ack-kube-queue and click Deploy in the Actions column. In the panel that appears, click OK.

ACK Lingjun clusters

  1. Log on to the ACK console. In the left navigation pane, choose Marketplace > Marketplace.

  2. On the Marketplace page, search for ack-kube-queue and click its name in the results.

  3. In the upper-right corner of the application details page, click Deploy. In the Basic Information step, set Cluster, Namespace, and Release Name, then click Next.

  4. In the Parameters step, set Chart Version to the latest version, then click OK.

Enable support for additional job types

By default, ack-kube-queue supports TensorFlow jobs and PyTorch jobs. To queue other job types—Message Passing Interface (MPI) jobs, Argo workflows, Ray jobs, Spark applications, or Kubernetes-native jobs—enable each type individually.

Note

To queue Kubernetes-native jobs, the cluster must run Kubernetes 1.22 or later.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find your cluster and click its name. In the left navigation pane, choose Applications > Helm.

  3. Find ack-kube-queue and click Update in the Actions column.

  4. In the YAML template, set the parameters for the job types you want to enable:

    ParameterEffect
    extension.argo.enable: trueEnable Argo workflows
    extension.mpi.enable: trueEnable MPI jobs
    extension.ray.enable: trueEnable Ray jobs
    extension.spark.enable: trueEnable Spark applications
    extension.tf.enable: trueEnable TensorFlow jobs
    extension.pytorch.enable: trueEnable PyTorch jobs
  5. Click OK to apply the changes.

Submit jobs

The following sections explain how to submit each supported job type to a queue and verify that the job is queued.

Constraints

Job typeConstraint
TensorFlow, PyTorch, MPIMust use the operator provided by ack-arena
MPICan only be submitted via Arena
Argo workflowsOnly full workflows can be queued; individual workflow steps cannot. Declare the workflow's resource requirements using the kube-queue/min-resources annotation (see the Argo workflows section).
Kubernetes-native jobsCluster must run Kubernetes 1.22 or later

TensorFlow jobs, PyTorch jobs, and MPI jobs

Add the annotation scheduling.x-k8s.io/suspend: "true" to the job manifest.

The following example submits a TensorFlow job to a queue:

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "job1"
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

To verify that the job is queued, run:

kubectl describe tfjob job1

While the job is queued, the status output includes the Suspended condition.

Kubernetes-native jobs

Set spec.suspend to true. When the job is dequeued, ack-kube-queue changes this field to false and the job starts running.

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
spec:
  suspend: true
...

To verify that the job is queued, run:

kubectl get job <job-name>

A queued job shows SUSPENDED in the output. When ack-kube-queue admits the job, SUSPENDED clears and the job starts running.

Argo workflows

Prerequisite: Install the Argo Workflows component from the Marketplace page in the ACK console. For steps, see Install Argo Workflows.

Add a custom template named kube-queue-suspend of the suspend type, and set spec.suspend to true when submitting the workflow. To declare resource requirements for the workflow, add the kube-queue/min-resources annotation:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: <workflow-name>-
  annotations:
    kube-queue/min-resources: |
      cpu: 5
      memory: 5G
spec:
  suspend: true
  entrypoint: <entrypoint-template>
  templates:
  # Required: add a suspend template named kube-queue-suspend
  - name: kube-queue-suspend
    suspend: {}
  - name: <entrypoint-template>
    # ... your workflow steps

Spark applications

Prerequisite: Install ack-spark-operator from the Marketplace page in the ACK console. For steps, see Install ack-spark-operator.

Add the annotation scheduling.x-k8s.io/suspend: "true" to the SparkApplication manifest:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  generateName: spark-pi-suspend-
  namespace: spark-operator
  annotations:
    scheduling.x-k8s.io/suspend: "true"
spec:
...

Ray jobs

Prerequisite: Install Kuberay-Operator from the Add-ons page in the ACK console. For more information, see Manage components.

Set spec.suspend to true:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
  suspend: true
...

Related topics