ack-kube-queue - Container Service for Kubernetes - Alibaba Cloud Documentation Center

ack-kube-queue is a kube-queue component provided by the cloud-native AI suite. It works with the scheduler and quota system to allow you to manage job queues, schedule jobs based on priorities, and use elastic quotas. ack-kube-queue can optimize the management and scheduling of AI/machine learning (ML) workloads and batch workloads in Kubernetes. This topic introduces ack-kube-queue and describes the usage notes and release notes for ack-kube-queue.

Introduction

AI/ML jobs or batch jobs in Kubernetes usually create a large number of pods, which increase the loads of the scheduler. In addition, jobs submitted by different users may interfere with each other. ack-kube-queue provides all features of kube-queue to manage AI/ML workloads and batch workloads in Kubernetes. This component allows system admins to customize job queue management to improve the flexibility of queues. Combined with a quota system, ack-kube-queue can automate and optimize the management of workloads and resource quotas to maximize resource utilization in Kubernetes clusters.

Usage notes

Only Container Service for Kubernetes (ACK) Pro clusters, ACK Serverless Pro clusters, and ACK Edge Pro cluster whose Kubernetes versions are 1.18 and later support ack-kube-queue.

You can install ack-kube-queue when you deploy the cloud-native AI suite or install it after the cloud-native AI suite is deployed. After you install ack-kube-queue, you can use features such as blocking queues and strict priority scheduling. For more information about how to install and use ack-kube-queue, see Use ack-kube-queue to manage job queues.

Description

June 2023

Version	Description	Release date	Impact
v0.1.10	ARM-based nodes are supported by components such as kube-queue-controller, tf-operator-extension, and pytorch-operator-extension.	June 14, 2023	No impact on workloads

May 2023

Version	Description	Release date	Impact
v0.1.9	Jobs that remain pending for a long period of time can be resubmitted to the job queue and multi-queue fair queuing is supported. If the pods created by a job remain pending for a long period of time due to topology-aware scheduling, node affinity, or resource fragments, ack-kube-queue reclaims the job and resubmits the job to the queue. This helps release the resource quota occupied by the job and improves the overall resource quota utilization.	2023-05-16	No impact on workloads

April 2023

Version	Description	Release date	Impact
v0.1.8	Blocking queues and strict priority scheduling are supported. For more information, see Enable blocking queues and Enable strict priority scheduling.	2023-04-25	No impact on workloads

March 2023

Version	Description	Release date	Impact
v0.1.6	The issue that the status of TensorFlow jobs is not displayed is fixed.	2023-03-15	No impact on workloads

February 2023

Version	Description	Release date	Impact
v0.1.5	The issue that ack-kube-queue occasionally fails to delete jobs is fixed.	2023-02-28	No impact on workloads
v0.1.4	The issue that the Used information is occasionally lost after a job queue unit is dequeued is fixed.	2023-02-14	No impact on workloads

January 2023

Version	Description	Release date	Impact
v0.1.3	The issue that job queue units are occasionally lost is fixed.	2023-01-12	No impact on workloads
v0.1.2	The occasionally occurred issue that jobs cannot be dequeued for a long period of time is fixed.	2023-01-12	No impact on workloads
v0.1.1	Multi-queue is supported. Jobs with different resource quotas are submitted to different queues to avoid congestion.	2023-01-10	No impact on workloads

October 2022

Version	Description	Release date	Impact
v0.1.0	This is the first release.	2022-10-15	No impact on workloads