Optimize Cluster Quotas via ElasticQuotaTree and ack-kube-queue - Container Service for Kubernetes

When multiple teams share a cluster for AI, machine learning (ML), and batch workloads, resource contention and underutilization become common problems. Use ElasticQuotaTree, ack-kube-queue, and ack-scheduler together to assign hierarchical resource quotas to organizational units, queue jobs automatically, and reclaim resources when guaranteed minimums cannot be met.

Prerequisites

Before you begin, ensure that you have:

The ack-koordinator component installed. See Install ack-koordinator

How it works

Three components work together to manage resource allocation and job scheduling:

ElasticQuotaTree: Defines hierarchical resource quotas that map to organizational units (teams, departments). Each leaf node in the tree corresponds to one or more namespaces, and jobs submitted to those namespaces are bound by the node's quota.
ack-kube-queue: Monitors the cluster for new jobs, creates a QueueUnit for each job, and routes it to the correct queue based on the job's namespace. Jobs wait in the queue until their resource request fits within the available quota, then get released to the scheduler.
ack-scheduler: Selects nodes to run jobs released from the queue.

When a job's minimum resource requirement cannot be satisfied, the scheduling system automatically reclaims resources from quota nodes that are currently using more than their min allocation.

Set up resource quotas with ElasticQuotaTree

ElasticQuotaTree uses a tree structure to assign CPU, memory, and GPU quotas to teams. The following example shows an enterprise with three departments—devops, algorithm (with text and video sub-teams), and infrastructure (with a test sub-team)—each mapped to dedicated namespaces.

Quota rules

Before creating an ElasticQuotaTree, verify that your configuration satisfies the following constraints:

Rule	Requirement
Namespace placement	Mount namespaces only to leaf nodes. Parent nodes cannot have namespaces.
Node min/max	On any node, `min` must be less than or equal to `max`.
Parent min rule	A parent node's `min` must be less than or equal to the sum of its children's `min` values.
Parent max rule	A parent node's `max` must be less than or equal to the `max` of any of its child nodes.

Parameter semantics:

Parameter	Default	Behavior
`min`	`0`	The guaranteed resource floor. Jobs can still be submitted when `min` is 0, but the system does not guarantee those resources. If a quota's `min` cannot be satisfied, the scheduler reclaims resources from nodes that are using more than their `min`.
`max`	`NA` (unlimited)	The resource ceiling. Jobs cannot use more resources than this value.

Create the namespaces and ElasticQuotaTree

Apply the following manifest to create the namespaces and define the quota tree. Comments in the YAML show how each node maps to the diagram above.

---
apiVersion: v1
kind: Namespace
metadata:
  name: devops
---
apiVersion: v1
kind: Namespace
metadata:
  name: text1
---
apiVersion: v1
kind: Namespace
metadata:
  name: text2
---
apiVersion: v1
kind: Namespace
metadata:
  name: video
---
apiVersion: v1
kind: Namespace
metadata:
  name: test1
---
apiVersion: v1
kind: Namespace
metadata:
  name: test2
---
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
  name: elasticquotatree  # Only one ElasticQuotaTree is supported.
  namespace: kube-system  # Must be created in the kube-system namespace to take effect.
spec:
  root:
    name: root  # Root node: total cluster quota
    min:
      cpu: 100
      memory: 50Gi
      nvidia.com/gpu: 16
    max:
      cpu: 100
      memory: 50Gi
      nvidia.com/gpu: 16
    children:
    - name: devops  # Child of root
      min:
        cpu: 20
        memory: 10Gi
        nvidia.com/gpu: 4
      max:
        cpu: 40
        memory: 20Gi
        nvidia.com/gpu: 8
      namespaces:
      - devops
    - name: algorithm  # Child of root; parent of text and video
      min:
        cpu: 50
        memory: 25Gi
        nvidia.com/gpu: 10
      max:
        cpu: 80
        memory: 50Gi
        nvidia.com/gpu: 14
      children:
      - name: text  # Child of algorithm
        min:
          cpu: 40
          memory: 15Gi
          nvidia.com/gpu: 8
        max:
          cpu: 40
          memory: 30Gi
          nvidia.com/gpu: 10
        namespaces:
        - text1
        - text2
      - name: video  # Child of algorithm
        min:
          cpu: 12
          memory: 12Gi
          nvidia.com/gpu: 2
        max:
          cpu: 14
          memory: 14Gi
          nvidia.com/gpu: 4
        namespaces:
        - video
    - name: infrastructure  # Child of root; parent of test
      min:
        cpu: 30
        memory: 15Gi
        nvidia.com/gpu: 2
      max:
        cpu: 50
        memory: 30Gi
        nvidia.com/gpu: 4
      children:
      - name: test  # Child of infrastructure
        min:
          cpu: 30
          memory: 15Gi
          nvidia.com/gpu: 2
        max:
          cpu: 50
          memory: 30Gi
          nvidia.com/gpu: 4
        namespaces:
        - test1
        - test2

Manage job queues with ack-kube-queue

After you apply the ElasticQuotaTree, ack-kube-queue automatically creates a queue for each leaf node in the cluster. Each leaf node's quota maps to one queue. Jobs submitted to a namespace are assigned to the queue that corresponds to that namespace's quota node.

Queue association

A controller in ack-kube-queue automatically manages the queue resources within the cluster. This controller is maintained based on the ElasticQuotaTree and maps the association between quotas and namespaces defined in the ElasticQuotaTree to the corresponding queues.

For example, the video namespace is a leaf node under the algorithm node, which is a child of root. ack-kube-queue creates a queue named root-algorithm-video for this node. When you submit a RayJob in the video namespace, ack-kube-queue creates a QueueUnit and routes it to the root-algorithm-video queue.

If the total resources requested by the RayJob fit within the available quota of root-algorithm-video, the job is dequeued and passed to ack-scheduler for node assignment.

Queue operation and the suspend handoff

ack-kube-queue controls job execution through the spec.suspend field of the RayJob:

Submit a RayJob with spec.suspend: true. This prevents the KubeRay operator from creating pods immediately.
ack-kube-queue detects the job, creates a QueueUnit, and places it in the corresponding queue.
When the queuing policy allows the job to proceed, ack-kube-queue sets spec.suspend to false.
The KubeRay operator picks up the change and creates the pods. ack-scheduler then assigns the pods to nodes.

If a job appears stuck in the queue, check whether spec.suspend has been set to false. If it has not, verify that the job's resource request fits within the available quota of its assigned queue.