All Products
Search
Document Center

Container Service for Kubernetes:ack-kube-queue

Last Updated:Mar 26, 2026

ack-kube-queue is the job queue management component in the cloud-native AI suite. It integrates with the Kubernetes scheduler and quota system to queue AI/machine learning (ML) and batch workloads, schedule jobs by priority, and enforce elastic quotas — helping you maximize resource utilization across the cluster.

Overview

AI/ML and batch jobs in Kubernetes spawn large numbers of pods, which increases scheduler load. When multiple teams submit jobs simultaneously, they compete for resources and interfere with each other.

ack-kube-queue addresses this by giving system administrators fine-grained control over job queues. Combined with the quota system, it automates workload and resource quota management so that the cluster's capacity is fully utilized rather than fragmented across competing jobs.

Key capabilities:

  • Job queue management — Queue AI/ML and batch jobs, control submission order, and retrieve job sequence information from queues.

  • Priority scheduling — Schedule jobs based on configurable priorities, including strict priority scheduling that enforces ordering across queues.

  • Blocking queues — Hold jobs in queue until resources are available, preventing resource fragmentation.

  • Elastic quotas — Manage resource quotas dynamically using ElasticQuotaTree, including limits on concurrently dequeued jobs via kube-queue/max-jobs.

  • Multi-queue fair queuing — Distribute jobs across multiple queues with fair queuing to prevent congestion.

  • Re-queuing — Automatically reclaim and resubmit long-pending jobs caused by topology-aware scheduling, node affinity constraints, or resource fragments, releasing occupied quotas and improving overall utilization.

  • Broad job type support — Works with MPI jobs (submitted via Arena) and Argo Workflows.

  • ARM node support — The kube-queue-controller, tf-operator-extension, and pytorch-operator-extension components run on ARM-based nodes.

Supported environments

ack-kube-queue requires one of the following cluster types running Kubernetes 1.18 or later:

  • ACK Pro clusters

  • ACK Serverless Pro clusters

  • ACK Edge Pro clusters

Install ack-kube-queue when you deploy the cloud-native AI suite, or add it after deployment. For installation steps and feature configuration, see Use ack-kube-queue to manage job queues.

Release notes

January 2024

VersionChangeTypeWorkload impact
v0.3.4Fixed head-of-line blocking that occasionally occurs in block mode when the first task in the queue is deleted.Bug fixNone

December 2023

VersionChangeTypeWorkload impact
v0.3.3Setting blocking queues globally via environment variables now refreshes the blocking queue mode for all queues.EnhancementNone

September 2023

VersionChangeTypeWorkload impact
v0.3.1Fixed queue errors that occasionally occur during QueueUnit deletion.Bug fixNone
v0.3.0Job sequence information can now be retrieved from queues.FeatureNone

August 2023

VersionChangeTypeWorkload impact
v0.2.1Fixed an issue where NodeSelector in the template prevents scheduling on worker nodes.Bug fixNone
v0.2.0Added support for submitting Message Passing Interface (MPI) jobs via Arena, queuing Argo Workflows, and limiting concurrently dequeued jobs using kube-queue/max-jobs in ElasticQuotaTree. Improved logs for job dequeuing failures.FeatureNone

July 2023

VersionChangeTypeWorkload impact
v0.1.13Fixed a function issue caused by a missing LastUpdateTime field.Bug fixNone
v0.1.12Added a per-queue switch to enable or disable the blocking queue feature. You can disable the re-queuing feature by setting the timeout parameter in the extension to 0.FeatureNone

June 2023

VersionChangeTypeWorkload impact
v0.1.11QueueUnit status is now synchronized when tasks are updated.EnhancementNone
v0.1.10ARM-based nodes are now supported for kube-queue-controller, tf-operator-extension, and pytorch-operator-extension.FeatureNone

May 2023

VersionChangeTypeWorkload impact
v0.1.9Added re-queuing for long-pending jobs and multi-queue fair queuing. If pods remain pending due to topology-aware scheduling, node affinity, or resource fragments, ack-kube-queue reclaims the job and resubmits it to the queue — releasing the occupied quota and improving overall utilization.FeatureNone

April 2023

VersionChangeTypeWorkload impact
v0.1.8Added blocking queues and strict priority scheduling. See Enable blocking queues and Enable strict priority scheduling.FeatureNone

March 2023

VersionChangeTypeWorkload impact
v0.1.6Fixed an issue where TensorFlow job status is not displayed.Bug fixNone

February 2023

VersionChangeTypeWorkload impact
v0.1.5Fixed an issue where ack-kube-queue occasionally fails to delete jobs.Bug fixNone
v0.1.4Fixed an issue where Used information is occasionally lost after a job queue unit is dequeued.Bug fixNone

January 2023

VersionChangeTypeWorkload impact
v0.1.3Fixed an issue where job queue units are occasionally lost.Bug fixNone
v0.1.2Fixed an issue where jobs cannot be dequeued for extended periods.Bug fixNone
v0.1.1Added multi-queue support. Jobs with different resource quotas are submitted to separate queues to prevent congestion.FeatureNone

October 2022

VersionChangeTypeWorkload impact
v0.1.0Initial release.None