ack-kube-queue is the job queue management component in the cloud-native AI suite. It integrates with the Kubernetes scheduler and quota system to queue AI/machine learning (ML) and batch workloads, schedule jobs by priority, and enforce elastic quotas — helping you maximize resource utilization across the cluster.
Overview
AI/ML and batch jobs in Kubernetes spawn large numbers of pods, which increases scheduler load. When multiple teams submit jobs simultaneously, they compete for resources and interfere with each other.
ack-kube-queue addresses this by giving system administrators fine-grained control over job queues. Combined with the quota system, it automates workload and resource quota management so that the cluster's capacity is fully utilized rather than fragmented across competing jobs.
Key capabilities:
Job queue management — Queue AI/ML and batch jobs, control submission order, and retrieve job sequence information from queues.
Priority scheduling — Schedule jobs based on configurable priorities, including strict priority scheduling that enforces ordering across queues.
Blocking queues — Hold jobs in queue until resources are available, preventing resource fragmentation.
Elastic quotas — Manage resource quotas dynamically using ElasticQuotaTree, including limits on concurrently dequeued jobs via
kube-queue/max-jobs.Multi-queue fair queuing — Distribute jobs across multiple queues with fair queuing to prevent congestion.
Re-queuing — Automatically reclaim and resubmit long-pending jobs caused by topology-aware scheduling, node affinity constraints, or resource fragments, releasing occupied quotas and improving overall utilization.
Broad job type support — Works with MPI jobs (submitted via Arena) and Argo Workflows.
ARM node support — The kube-queue-controller, tf-operator-extension, and pytorch-operator-extension components run on ARM-based nodes.
Supported environments
ack-kube-queue requires one of the following cluster types running Kubernetes 1.18 or later:
ACK Pro clusters
ACK Serverless Pro clusters
ACK Edge Pro clusters
Install ack-kube-queue when you deploy the cloud-native AI suite, or add it after deployment. For installation steps and feature configuration, see Use ack-kube-queue to manage job queues.
Release notes
January 2024
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.3.4 | Fixed head-of-line blocking that occasionally occurs in block mode when the first task in the queue is deleted. | Bug fix | None |
December 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.3.3 | Setting blocking queues globally via environment variables now refreshes the blocking queue mode for all queues. | Enhancement | None |
September 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.3.1 | Fixed queue errors that occasionally occur during QueueUnit deletion. | Bug fix | None |
| v0.3.0 | Job sequence information can now be retrieved from queues. | Feature | None |
August 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.2.1 | Fixed an issue where NodeSelector in the template prevents scheduling on worker nodes. | Bug fix | None |
| v0.2.0 | Added support for submitting Message Passing Interface (MPI) jobs via Arena, queuing Argo Workflows, and limiting concurrently dequeued jobs using kube-queue/max-jobs in ElasticQuotaTree. Improved logs for job dequeuing failures. | Feature | None |
July 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.13 | Fixed a function issue caused by a missing LastUpdateTime field. | Bug fix | None |
| v0.1.12 | Added a per-queue switch to enable or disable the blocking queue feature. You can disable the re-queuing feature by setting the timeout parameter in the extension to 0. | Feature | None |
June 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.11 | QueueUnit status is now synchronized when tasks are updated. | Enhancement | None |
| v0.1.10 | ARM-based nodes are now supported for kube-queue-controller, tf-operator-extension, and pytorch-operator-extension. | Feature | None |
May 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.9 | Added re-queuing for long-pending jobs and multi-queue fair queuing. If pods remain pending due to topology-aware scheduling, node affinity, or resource fragments, ack-kube-queue reclaims the job and resubmits it to the queue — releasing the occupied quota and improving overall utilization. | Feature | None |
April 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.8 | Added blocking queues and strict priority scheduling. See Enable blocking queues and Enable strict priority scheduling. | Feature | None |
March 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.6 | Fixed an issue where TensorFlow job status is not displayed. | Bug fix | None |
February 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.5 | Fixed an issue where ack-kube-queue occasionally fails to delete jobs. | Bug fix | None |
| v0.1.4 | Fixed an issue where Used information is occasionally lost after a job queue unit is dequeued. | Bug fix | None |
January 2023
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.3 | Fixed an issue where job queue units are occasionally lost. | Bug fix | None |
| v0.1.2 | Fixed an issue where jobs cannot be dequeued for extended periods. | Bug fix | None |
| v0.1.1 | Added multi-queue support. Jobs with different resource quotas are submitted to separate queues to prevent congestion. | Feature | None |
October 2022
| Version | Change | Type | Workload impact |
|---|---|---|---|
| v0.1.0 | Initial release. | — | None |