When multiple teams share a cluster for AI, machine learning (ML), and batch workloads, resource contention and underutilization become common problems. Use ElasticQuotaTree, ack-kube-queue, and ack-scheduler together to assign hierarchical resource quotas to organizational units, queue jobs automatically, and reclaim resources when guaranteed minimums cannot be met.
Prerequisites
Before you begin, ensure that you have:
-
The ack-koordinator component installed. See Install ack-koordinator
How it works
Three components work together to manage resource allocation and job scheduling:
-
ElasticQuotaTree: Defines hierarchical resource quotas that map to organizational units (teams, departments). Each leaf node in the tree corresponds to one or more namespaces, and jobs submitted to those namespaces are bound by the node's quota.
-
ack-kube-queue: Monitors the cluster for new jobs, creates a QueueUnit for each job, and routes it to the correct queue based on the job's namespace. Jobs wait in the queue until their resource request fits within the available quota, then get released to the scheduler.
-
ack-scheduler: Selects nodes to run jobs released from the queue.
When a job's minimum resource requirement cannot be satisfied, the scheduling system automatically reclaims resources from quota nodes that are currently using more than their min allocation.
Set up resource quotas with ElasticQuotaTree
ElasticQuotaTree uses a tree structure to assign CPU, memory, and GPU quotas to teams. The following example shows an enterprise with three departments—devops, algorithm (with text and video sub-teams), and infrastructure (with a test sub-team)—each mapped to dedicated namespaces.
Quota rules
Before creating an ElasticQuotaTree, verify that your configuration satisfies the following constraints:
| Rule | Requirement |
|---|---|
| Namespace placement | Mount namespaces only to leaf nodes. Parent nodes cannot have namespaces. |
| Node min/max | On any node, min must be less than or equal to max. |
| Parent min rule | A parent node's min must be less than or equal to the sum of its children's min values. |
| Parent max rule | A parent node's max must be less than or equal to the max of any of its child nodes. |
Parameter semantics:
| Parameter | Default | Behavior |
|---|---|---|
min |
0 |
The guaranteed resource floor. Jobs can still be submitted when min is 0, but the system does not guarantee those resources. If a quota's min cannot be satisfied, the scheduler reclaims resources from nodes that are using more than their min. |
max |
NA (unlimited) |
The resource ceiling. Jobs cannot use more resources than this value. |
Create the namespaces and ElasticQuotaTree
Apply the following manifest to create the namespaces and define the quota tree. Comments in the YAML show how each node maps to the diagram above.
---
apiVersion: v1
kind: Namespace
metadata:
name: devops
---
apiVersion: v1
kind: Namespace
metadata:
name: text1
---
apiVersion: v1
kind: Namespace
metadata:
name: text2
---
apiVersion: v1
kind: Namespace
metadata:
name: video
---
apiVersion: v1
kind: Namespace
metadata:
name: test1
---
apiVersion: v1
kind: Namespace
metadata:
name: test2
---
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
name: elasticquotatree # Only one ElasticQuotaTree is supported.
namespace: kube-system # Must be created in the kube-system namespace to take effect.
spec:
root:
name: root # Root node: total cluster quota
min:
cpu: 100
memory: 50Gi
nvidia.com/gpu: 16
max:
cpu: 100
memory: 50Gi
nvidia.com/gpu: 16
children:
- name: devops # Child of root
min:
cpu: 20
memory: 10Gi
nvidia.com/gpu: 4
max:
cpu: 40
memory: 20Gi
nvidia.com/gpu: 8
namespaces:
- devops
- name: algorithm # Child of root; parent of text and video
min:
cpu: 50
memory: 25Gi
nvidia.com/gpu: 10
max:
cpu: 80
memory: 50Gi
nvidia.com/gpu: 14
children:
- name: text # Child of algorithm
min:
cpu: 40
memory: 15Gi
nvidia.com/gpu: 8
max:
cpu: 40
memory: 30Gi
nvidia.com/gpu: 10
namespaces:
- text1
- text2
- name: video # Child of algorithm
min:
cpu: 12
memory: 12Gi
nvidia.com/gpu: 2
max:
cpu: 14
memory: 14Gi
nvidia.com/gpu: 4
namespaces:
- video
- name: infrastructure # Child of root; parent of test
min:
cpu: 30
memory: 15Gi
nvidia.com/gpu: 2
max:
cpu: 50
memory: 30Gi
nvidia.com/gpu: 4
children:
- name: test # Child of infrastructure
min:
cpu: 30
memory: 15Gi
nvidia.com/gpu: 2
max:
cpu: 50
memory: 30Gi
nvidia.com/gpu: 4
namespaces:
- test1
- test2
Manage job queues with ack-kube-queue
After you apply the ElasticQuotaTree, ack-kube-queue automatically creates a queue for each leaf node in the cluster. Each leaf node's quota maps to one queue. Jobs submitted to a namespace are assigned to the queue that corresponds to that namespace's quota node.
Queue association
A controller in ack-kube-queue automatically manages the queue resources within the cluster. This controller is maintained based on the ElasticQuotaTree and maps the association between quotas and namespaces defined in the ElasticQuotaTree to the corresponding queues.
For example, the video namespace is a leaf node under the algorithm node, which is a child of root. ack-kube-queue creates a queue named root-algorithm-video for this node. When you submit a RayJob in the video namespace, ack-kube-queue creates a QueueUnit and routes it to the root-algorithm-video queue.
If the total resources requested by the RayJob fit within the available quota of root-algorithm-video, the job is dequeued and passed to ack-scheduler for node assignment.
Queue operation and the suspend handoff
ack-kube-queue controls job execution through the spec.suspend field of the RayJob:
-
Submit a RayJob with
spec.suspend: true. This prevents the KubeRay operator from creating pods immediately. -
ack-kube-queue detects the job, creates a QueueUnit, and places it in the corresponding queue.
-
When the queuing policy allows the job to proceed, ack-kube-queue sets
spec.suspendtofalse. -
The KubeRay operator picks up the change and creates the pods. ack-scheduler then assigns the pods to nodes.
If a job appears stuck in the queue, check whetherspec.suspendhas been set tofalse. If it has not, verify that the job's resource request fits within the available quota of its assigned queue.