Deploy a job scheduling system for multiple tenants in an ACK cluster - Container Service for Kubernetes

In Container Service for Kubernetes (ACK) clusters, you can use the job management tool Arena, the queue scheduling management system Kube Queue, the quota management tool ElasticQuotaTree, and Managed Service for Prometheus to create an enterprise-level job scheduling system. This topic describes how to build a job scheduling system in an ACK cluster.

Background information

Batch jobs are commonly used in fields such as data processing, simulation computing, scientific computing, and AI. Batch jobs are mainly used to process data or train models. In most cases, batch jobs consume a large amount of computing resources. Therefore, batch jobs must be queued based on the job priority and the resources available to the job submitter to maximize the utilization of resources in a cluster. To schedule batch jobs, you must resolve issues such as job management, quota management, job scheduling, user isolation, log collection, cluster monitoring, and resource supply in your enterprise.

To resolve the preceding issues, you can use ACK to manage large-scale clusters. ACK provides a comprehensive ecosystem to help your enterprise build a large-scale job scheduling system and simplify the process of building an enterprise-level cluster management system.

User roles

In the job scheduling system of an enterprise, the following two roles are required: developers who submit jobs and O&M engineers who maintain the job scheduling system. Developers and O&M engineers have different requirements for the job scheduling system.

Role	Description
Developer	Developers focus on their business fields and may not be familiar with the Kubernetes system used by the job scheduling system. They only submit jobs in the job scheduling system. Therefore, they require the jog scheduling system to simplify the process of submitting jobs and allow them to view the run logs of the submitted jobs for fixing bugs.
O&M engineer	O&M engineers are familiar with the Kubernetes system but are not familiar with specific business fields. They need to maintain the job scheduling system and make sure that the job scheduling system is efficient and reliable. Therefore, they require the job scheduling system to monitor Kubernetes clusters in a comprehensive manner.

In real scenarios, developers may belong to different departments of an enterprise. Different departments define job priorities based on different rules, and the job priorities defined by different departments are independent of each other. Therefore, the jobs submitted by different departments must be separately queued.

In addition, the resources that are available within an enterprise are limited. To allocate more resources to more important projects, the resource quotas allocated to different departments vary in most cases. Guaranteed resources are allocated to the departments to which more important projects belong. If the resources allocated to these departments are idle, other departments are allowed to temporarily use the guaranteed resources to run their jobs. If the overall resources in the cluster are insufficient, the departments that own the guaranteed resources have the right to evict the jobs of the departments that temporarily use the resources.

How it works

The following figure shows an example. In this example, an enterprise contains two departments: Department-a and Department-b. Department-a has two employees named User-a and User-b, and Department-b has one employee named User-c. The enterprise can build a job scheduling system based on the following process:

O&M engineers create an ACK cluster and install the Kube Queue, Managed Service for Prometheus, and Arena components in the cluster.
- Kube Queue can help the enterprise effectively manage a large number of concurrent jobs and ensure that resources are properly allocated. Kube Queue automatically generates job queues for resource quotas by listening to the ElasticQuotaTree. Jobs that share the same resource quota are distributed to the same queue.
- Managed Service for Prometheus provides real-time monitoring capabilities. This allows O&M engineers to quickly identify and diagnose potential issues in the cluster, ensures the stability of the job scheduling system, and maximizes resource utilization.
- Arena can simplify the processes of applying for GPU resources, submitting jobs, and monitoring the cluster.
O&M engineers configure an ElasticQuotaTree and a ResourcePolicy.
- The ElasticQuotaTree is a mechanism that allows O&M engineers to flexibly manage resource quotas. After O&M engineers configure proper resource quotas for Department-a and Department-b, the departments can share resources in the cluster without affecting each other. This way, the departments do not need to compete for resources and resource utilization is maximized.
- The ResourcePolicy defines the rules and priorities based on which resources are used. O&M engineers can configure the ResourcePolicy to schedule pods to elastic container instances that are deployed as virtual nodes when Elastic Compute Service (ECS) resources are exhausted. This balances the overall resource usage and ensures that resources are efficiently used.
Developers submit jobs on the job submission platform of the enterprise or by using AI Developer Console provided by the cloud-native AI suite.

Step 1: Build an environment

Create an ACK cluster that runs Kubernetes 1.20 or later. For more information, see Create an ACK managed cluster.
For more information about the GPU-accelerated node types supported by ACK clusters, see GPU instance types supported by ACK.
Deploy the cloud-native AI suite for the ACK cluster and install the Kube Queue and Arena components when you deploy the cloud-native AI suite. For more information, see Deploy the cloud-native AI suite.
Kube Queue can manage TensorFlow jobs, PyTorch jobs, and MPI jobs. If you want to use Spark applications, Argo workflows, or Ray jobs, you can install the ack-spark-operator, ack-workflow, or ack-kuberay component on the Marketplace page of the ACK console. These components start the operators of different job types in the cluster. The operators run different types of jobs that are submitted in the cluster.
For more information about how to install components on the Marketplace page, see App Marketplace. For more information about Kube Queue, see Use ack-kube-queue to manage AI and machine learning workloads.
Install the ack-virtual-node component for the ACK cluster. For more information, see the Step 1: Deploy the ack-virtual-node component section of the "Schedule pods to elastic container instances that are deployed as virtual nodes" topic.
After the ack-virtual-node component is installed for the ACK cluster, elastic container instances can be deployed as virtual nodes in the ACK cluster. If pods need to be scheduled to an elastic container instance, the component creates an elastic container instance and associates the elastic container instance with the pods. When the pods are deleted, the component automatically deletes the elastic container instance.
By default, the scheduler does not schedule pods to virtual nodes. In this case, no elastic container instance is created. If you want to quickly scale out resources and save costs, you can configure a ResourcePolicy to schedule pods to elastic container instances that are deployed as virtual nodes when ECS resources are used up. You can also configure the ResourcePolicy to limit the number of pods that can be scheduled to an elastic container instance.
For more information about how to use a ResourcePolicy, see Configure priority-based resource scheduling.
Optional. Enable Managed Service for Prometheus for the ACK cluster and install the log component based on your business requirements. For more information, see the Enable Managed Service for Prometheus section of the "Managed Service for Prometheus" topic.

Step 2: Configure resource quotas for the cluster

ElasticQuotaTree overview

The following figure shows an ElasticQuotaTree that uses the tree structure to display the allocation and management of enterprise resources at different levels, such as department and user. Each user is assigned an independent namespace. Namespaces are associated with child nodes that indicate resource quotas in the tree structure. This way, resources are isolated among users and can be shared if users need to use more resources.

The root node in the ElasticQuotaTree represents the total resource quota of the enterprise, and each child node represents the resource quota of a department such as Department-a or Department-b. If a child node is associated with the namespaces of multiple users, these users share the resources of the department. Each user in the ElasticQuotaTree is assigned an independent namespace, and each child node is associated with the namespaces of one or more users.

Procedure

Create three namespaces named user-a, user-b, and user-c.
For more information, see Manage namespaces and resource quotas.

Configure an ElasticQuotaTree. The following sample code provides a sample YAML file:

In this example, the ElasticQuotaTree is used to configure resource quotas for two departments in an enterprise. The ElasticQuotaTree allows departments to dynamically share and adjust their resources. This ensures that resource quotas meet the basic requirements of departments when resources are insufficient and also maximizes the utilization of resources in the cluster when resources are sufficient.

View the YAML file

# The ElasticQuotaTree configures 100 CPU cores, 100 GiB of memory, and four GPUs for the enterprise. 
# The max and min parameters are specified for each node. The max parameter specifies the maximum amount of resources that are available to the node, and the min parameter specifies the minimum amount of resources that are available to the node. A department can dynamically adjust and use resources in the range of the minimum resource quota to the maximum resource quota. 
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
  name: elasticquotatree
  namespace: kube-system # The ElasticQuotaTree takes effect only in the kube-system namespace of the cluster. 
spec:
  root:
    name: root # The total resource quota of the root node. The value of the max parameter of the root node must be equal to the value of the min parameter of the root node and cannot be smaller than the sum of the values of the max parameters of child nodes. 
    max:
      cpu: 100
      memory: 100Gi
      nvidia.com/gpu: 4
    min:
      cpu: 100
      memory: 100Gi
      nvidia.com/gpu: 4 
    children:
    # Configure two child nodes named Department-a and Department-b. 
    # Associate Department-a with the user-a and user-b namespaces, and associate Department-b with the user-c namespace. The number of pods that can be created in each namespace is subject to the resource quota of the department with which the namespace is associated. 
      - name: Department-a
        max:
          cpu: 100
          memory: 100Gi
          nvidia.com/gpu: 4
        min:
          cpu: 60
          memory: 60Gi
          nvidia.com/gpu: 3
        namespaces: # The namespaces to be associated with Department-a. 
          - user-a
          - user-b
      - name: Department-b
        max:
          cpu: 100
          memory: 100Gi
          nvidia.com/gpu: 4
        min:
          cpu: 40
          memory: 40Gi
          nvidia.com/gpu: 1
        namespaces: # The namespace to be associated with Department-b. 
          - user-c

When resources are sufficient in the cluster, Department-a and Department-b can use their respective maximum resource quotas to handle their workloads.
When resources are insufficient in the cluster, if the resources used by a department are below the minimum resource quota of the department, the department has the right to preempt the resources of the other department that exceed the minimum resource quota of that department to meet the basic requirements for resources.

After you configure the ElasticQuotaTree, run the following command. The kube-queue-controller component automatically creates a job queue for each child node in the kube-queue namespace of the cluster based on the ElasticQuotaTree.
```
kubectl get queue -n kube-queue
```
Expected output:
```
NAME                       AGE
root-department-a-user-a   58s
root-department-b-user-c   58s
```
Specify an oversold ratio for job queues.
To distribute jobs based on the resource quota, you must set the oversellrate parameter to 1 in the startup parameters of job queues.
1. Log on to the ACK console. In the navigation pane on the left, click Clusters.
2. On the Clusters page, find the cluster you want to manage and click its name. In the left-side pane, choose Workloads > Deployments.
3. On the Deployments page, select the kube-queue namespace from the Namespace drop-down list. Find the kube-queue-controller component and click Edit in the Actions column. On the Edit page, set the oversellrate parameter to 1.
4. Click Update below the Overview section of the Edit page. In the OK message, click Confirm to save the parameter settings.

Step 3: Submit jobs and analyze the results

Scenario 1: Run jobs that share the same resource quota in round robin mode

Use the following sample YAML file to configure a job:

View the YAML file

apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
  namespace: user-a # The job belongs to the user-a namespace. 
spec:
  suspend: true # The job is paused and does not automatically start. 
  completions: 6 # The job runs a total of six pods. 
  parallelism: 6 # The job can run a maximum of six pods in parallel. 
  template:
    spec:
      schedulerName: default-scheduler # The default scheduler is used to schedule pods. 
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["sleep",  "1d"]
        resources:
          requests:
            cpu: 10
            memory: 10Gi
          limits:
            cpu: 10
            memory: 10Gi
      restartPolicy: Never

After the job is configured, run the following command twice to submit the job twice. The submitted jobs share the same resource quota.

kubectl create -f pi.yaml # The submitted job is named pi.yaml. The name of the job can be customized.

Expected output:

job.batch/pi-8k5pn created # The output generated when the job is submitted for the first time. 
job.batch/pi-2xtcj created # The output generated when the job is submitted for the second time.

The expected output indicates that both jobs are submitted.

Run the following command to query the status of the jobs:

kubectl get pod -n user-a

Expected output:

NAME             READY   STATUS    RESTARTS   AGE
pi-8k5pn-8dn4k   1/1     Running   0          25s
pi-8k5pn-lkdn5   1/1     Running   0          25s
pi-8k5pn-s9cvm   1/1     Running   0          25s
pi-8k5pn-tvw6c   1/1     Running   0          25s
pi-8k5pn-wh9zv   1/1     Running   0          25s
pi-8k5pn-zsdqs   1/1     Running   0          25s

The expected output shows that six pods are run by the first job, instead of 12 pods. This indicates that only the first job is run and the second job fails to start. In this case, all resources that belong to the same resource quota are used by the first job.

Run the following command to query the queued jobs in the user-a namespace:

kubectl get queue -n kube-queue root-department-a-user-a  -o yaml

View expected output

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
  annotations:
    kube-queue/queue-args: |
      {}
    kube-queue/quota-fullname: root/Department-a
  creationTimestamp: "2024-04-08T02:04:17Z"
  generation: 2
  labels:
    create-by-kubequeue: "true"
  name: root-department-a-user-a
  namespace: kube-queue
  resourceVersion: "16544808"
  uid: bb1c4edf-f***-4***-a50a-45****62ce
spec:
  queuePolicy: Round
status:
  queueItemDetails:
    active:
    - name: pi-2xtcj
      namespace: user-a
      position: 1
    backoff: []

The expected output indicates that the job queue of the user-a namespace runs jobs in round robin mode and one active job exists in the queue.

Run the following command to query the status of the ElasticQuotaTree and view the usage of each resource quota in the cluster:
```
kubectl -n kube-system get eqtree -ojson elasticquotatree  | jq 'status'
```
The output shows that all the resources of the root node are allocated to the user-a namespace. However, no resources are allocated to the user-b namespace that shares the same resource quota.

Scenario 2: Run jobs that share the same resource quota by priority

This scenario is configured based on Scenario 1.

By default, the ack-kube-queue component runs jobs in round robin mode. If you want the jobs in the cluster to be run by priority, you can set the StrictPriority environment variable to true for the kube-queue-controller component. In this case, the states of all job queues change to Block.

Enable blocking queues for jobs. For more information, see the Use ack-kube-queue to manage AI and machine learning workloads section of the "Use ack-kube-queue to manage job queues" topic.

Submit a job whose priority is higher than that of the job submitted in Scenario 1. Use the following sample YAML file to submit the job:

View the sample YAML file

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  annotations:
    meta.helm.sh/release-name: ack-kube-queue
    meta.helm.sh/release-namespace: kube-queue
  labels:
    app.kubernetes.io/managed-by: Helm
  name: priority-2
value: 2
---
apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
  namespace: user-a
spec:
  suspend: true
  completions: 6
  parallelism: 6
  template:
    spec:
      schedulerName: default-scheduler
      priorityClassName: priority-2
      priority: 2
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["sleep",  "1d"]
        resources:
          requests:
            cpu: 10
            memory: 10Gi
          limits:
            cpu: 10
            memory: 10Gi
      restartPolicy: Never

Run the following command to query the status of the job queue:

kubectl get queue -n kube-queue root-department-a-user-a  -o yaml

View expected output

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: Queue
metadata:
  annotations:
    kube-queue/queue-args: |
      {}
    kube-queue/quota-fullname: root/Department-a
  creationTimestamp: "2024-04-08T02:04:17Z"
  generation: 2
  labels:
    create-by-kubequeue: "true"
  name: root-department-a-user-a
  namespace: kube-queue
  resourceVersion: "16549536"
  uid: bb1c4edf-f***-4***-a***-45d1******
spec:
  queuePolicy: Block
status:
  queueItemDetails:
    active:
    - name: pi-6nsc7
      namespace: user-a
      position: 1
      priority: 2
    - name: pi-2xtcj
      namespace: user-a
      position: 2
    backoff: []

The expected output indicates that the high-priority job is queued as the first job of the queue.

Scenario 3: Run jobs that use different resource quotas by priority

This scenario is configured based on Scenario 2.

In the user-c namespace, submit a job whose priority is higher than that of the job submitted in Scenario 2 and allocate all the resources in the cluster to the job.

View the sample YAML file

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  annotations:
    meta.helm.sh/release-name: ack-kube-queue
    meta.helm.sh/release-namespace: kube-queue
  labels:
    app.kubernetes.io/managed-by: Helm
  name: priority-3
value: 3
---
apiVersion: batch/v1
kind: Job
metadata:
  generateName: pi-
  namespace: user-c
spec:
  suspend: true
  completions: 1
  parallelism: 1
  template:
    spec:
      schedulerName: default-scheduler
      priorityClassName: priority-3
      priority: 3
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["sleep",  "1d"]
        resources:
          requests:
            cpu: 10
            memory: 10Gi
          limits:
            cpu: 10
            memory: 10Gi
      restartPolicy: Never

Run the following command to query the resource usage in the user-a namespace:

kubectl get pod -o wide -n user-a

Expected output:

NAME             READY   STATUS    RESTARTS   AGE   IP              NODE                        NOMINATED NODE   READINESS GATES
pi-6nsc7-psz9k   1/1     Running   0          45s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-6nsc7-qtkcc   1/1     Running   0          45s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-6nsc7-rvklt   1/1     Running   0          45s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-6nsc7-z4hhc   1/1     Running   0          45s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-lkdn5   1/1     Running   0          26m   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-ql4fx   1/1     Running   0          25s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-tcdlb   1/1     Running   0          25s   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-tvw6c   1/1     Running   0          26m   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-wh9zv   1/1     Running   0          26m   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>
pi-8k5pn-zsdqs   1/1     Running   0          26m   192.XX.XX.XX    cn-hangzhou.192.XX.XX.XX   <none>           <none>

The expected output indicates that the resources allocated to the job submitted by User-c do not reach the preset minimum resource quota. The system allows User-a to preempt the resources allocated to other jobs to meet the basic requirements for resources.

References

For more information about job queues, see Use ack-kube-queue to manage AI and machine learning workloads.