ack-kube-queue is a job queue manager designed to optimize management and improve resource utilization for AI, machine learning, and batch processing workloads in Container Service for Kubernetes (ACK) clusters. ack-kube-queue provides flexible job queue management capabilities, automatically optimizes workload allocation and resource quota management, and helps system administrators improve resource utilization and job execution efficiency for workloads in ACK clusters. This topic describes how to install and configure ack-kube-queue and how to submit jobs after you install ack-kube-queue.
Limits
Only ACK managed clusters, ACK Edge clusters, and ACK Lingjun clusters that run Kubernetes versions 1.18 or later are supported.
Install ack-kube-queue
ACK managed clusters and ACK Edge clusters
The cloud-native AI suite is not installed
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.In the lower part of the Cloud-native AI Suite page, click Deploy.
In the Scheduling section, select Kube Queue. In the Interactive Mode section, select Arena. In the lower part of the Deploy Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
The cloud-native AI suite is installed
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.Install ack-arena and ack-kube-queue.
On the Cloud-native AI Suite page, find ack-arena and click Deploy in the Actions column. In the Parameters panel, click OK.
On the Cloud-native AI Suite page, find ack-kube-queue and click Deploy in the Actions column. In the panel that appears, click OK.
After ack-arena and ack-kube-queue are installed, Deployed is displayed in the Status column of the Components section.
ACK Lingjun cluster
Log on to the ACK console. In the left-side navigation pane, choose .
On the Marketplace page, enter ack-kube-queue in the search box and click the search icon. After ack-kube-queue is displayed, click its name.
In the upper-right corner of the application details page, click Deploy. In the Basic Information step, set the Cluster, Name, and Release Name parameters. Then, click Next.
In the Parameters step, set Chart Version to the latest version. Then, click OK.
Configure ack-kube-queue
You can use ack-kube-queue to enable queuing for various types of jobs, including TensorFlow jobs, PyTorch jobs, Message Passing Interface (MPI) jobs, Argo workflows, Ray jobs, Spark applications, and Kubernetes-native jobs. By default, ack-kube-queue supports only TensorFlow jobs and PyTorch jobs. You can enable support for other types of jobs based on your business requirements.
Limits
You must use the Operator provided by ack-arena to submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue.
If you want to enable queuing for Kubernetes-native jobs, the Kubernetes version of the cluster must be 1.22 or later.
You can submit MPI jobs to a queue only by using Arena.
You can submit only Argo workflows to a queue. You cannot submit steps in Argo workflows to a queue. You can add the following annotation to specify the resources requested by an Argo workflow:
... annotations: kube-queue/min-resources: | cpu: 5 memory: 5G ...
Enable support for specific types of jobs
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose
.Find ack-kube-queue and click Update in the Actions column.
Modify the YAML template based on the descriptions in the following table to enable support for specific types of jobs.
Parameter
Description
Set
extension.argo.enable
totrue
.Enable support for Argo workflows.
Set
extension.mpi.enable
totrue
.Enable support for MPI jobs.
Set
extension.ray.enable
totrue
.Enable support for Ray jobs.
Set
extension.spark.enable
totrue
.Enable support for Spark applications.
Set
extension.tf.enable
totrue
.Enable support for TensorFlow jobs.
Set
extension.pytorch.enable
totrue
.Enable support for PyTorch jobs.
Submit jobs
Submit TensorFlow jobs, PyTorch jobs, and MPI jobs to a queue
You must add the scheduling.x-k8s.io/suspend="true"
annotation to a job. The following sample code submits a TensorFlow job to a queue:
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "job1"
annotations:
scheduling.x-k8s.io/suspend: "true"
spec:
...
Submit Kubernetes-native jobs to a queue
You must set the suspend
field to true
. In the preceding example, the job whose CPU request is set to 100m
is queued. When the job is dequeued, the value of the suspend
field of the job is changed to false
, and the job is run by the relevant components in the cluster.
apiVersion: batch/v1
kind: Job
metadata:
generateName: pi-
spec:
suspend: true
...
Submit Argo workflows to a queue
You must first install the Argo Workflows component from the Marketplace page in the ACK console.
Use Argo Workflows to add a custom template named kube-queue-suspend
of the suspend
type. When you submit a workflow, set suspend
to true
. Example:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: $example-name
spec:
suspend: true # Set this field to true.
entrypoint: $example-entrypoint
templates:
# Add a template of the suspend type named kube-queue-suspend.
- name: kube-queue-suspend
suspend: {}
- name: $example-entrypoint
...
Submit Spark applications to a queue
You must first install ack-spark-operator from the Marketplace page in the ACK console.
When you submit a Spark application, add the scheduling.x-k8s.io/suspend="true"
annotation to the configurations of the Spark application
.
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
generateName: spark-pi-suspend-
namespace: spark-operator
annotations:
scheduling.x-k8s.io/suspend: "true"
spec:
...
Submit Ray jobs to a queue
You must first install Kuberay-Operator from the Add-ons page in the ACK console. For more information, see Manage components.
When you submit a Ray job, set spec.suspend
to true
.
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
spec:
# Suspend specifies whether the RayJob controller should create a RayCluster instance.
# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
suspend: true
...