The static resource allocation mechanism of the native Kubernetes ResourceQuota can lead to low resource utilization in clusters. To resolve this issue, Container Service for Kubernetes (ACK) provides the capacity scheduling feature based on the scheduling framework extension mechanism. This feature uses elastic quota groups to support resource sharing while ensuring resource quotas for users, which effectively improves cluster resource utilization.
Prerequisites
An ACK managed Pro cluster that runs Kubernetes 1.20 or later is created. To upgrade the cluster, see Create an ACK managed cluster.
Key features
In a multi-user cluster environment, administrators allocate fixed resources to ensure sufficient resources for different users. The traditional mode uses the native Kubernetes ResourceQuota for static resource allocation. However, due to differences in time and patterns of resource usage among users, some users may experience resource constraints while others have idle quotas. This results in lower overall resource utilization.
To resolve this issue, ACK supports the capacity scheduling feature on the scheduling side based on the scheduling framework extension mechanism. This feature improves overall resource utilization by using resource sharing and ensures resource allocation for users. The specific features of capacity scheduling are as follows.
Support for defining resource quotas at different levels: Configure multiple levels of elastic quotas according to business needs, such as company organization charts. The leaf nodes of an elastic quota group can correspond to multiple namespaces, but each namespace can only belong to one leaf node.
Support for resource borrowing and reclaiming between different elastic quotas.
Min: Defines the guaranteed resources that can be used. If the resources of a cluster become insufficient, the total amount of minimum resources for all users must be lower than the total amount of resources of the cluster.
Max: Defines the maximum amount of resources that can be used.
Workloads can borrow idle resource quotas from other users, but the total amount of resources that can be used after borrowing still does not exceed the Max value. Unused Min resource quotas can be borrowed, but can be reclaimed when the original user needs to use them.
Support for configuring various resources: In addition to CPU and memory resources, it also supports configuring extended resources such as GPU and any other resources supported by Kubernetes.
Support for attaching quotas to nodes: Use ResourceFlavor to select nodes and associate ResourceFlavor with a quota in ElasticQuotaTree. After the association, pods in the elastic quota can only be scheduled to nodes selected by ResourceFlavor.
Configuration example of capacity scheduling
In this example cluster, the node is an ecs.sn2.13xlarge machine with 56 vCPUs and 224 GiB of memory.
Create the following namespaces.
kubectl create ns namespace1 kubectl create ns namespace2 kubectl create ns namespace3 kubectl create ns namespace4
Create the corresponding elastic quota group according to the following YAML file:
According to the above YAML file, configure the corresponding namespace in the
namespaces
field and configure the corresponding child elastic quota in thechildren
field. The following requirements must be met:In the same elastic quota, Min ≤ Max.
The sum of the Min values of the child elastic quotas must be less than or equal to the Min value of the parent quota.
The Min of the root node equals Max and is less than or equal to the total resources of the cluster.
Each namespace belongs to only one leaf. A leaf can contain multiple namespaces.
Check whether the elastic quota group is created successfully.
kubectl get ElasticQuotaTree -n kube-system
Expected output:
NAME AGE elasticquotatree 68s
Borrow idle resources
Deploy a service in
namespace1
according to the following YAML file. The number of replicas for the pod is 5, and each pod requests 5 vCPUs.Check the deployment status of pods in the cluster.
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 70s nginx1-744b889544-6l4s9 1/1 Running 0 70s nginx1-744b889544-cgzlr 1/1 Running 0 70s nginx1-744b889544-w2gr7 1/1 Running 0 70s nginx1-744b889544-zr5xz 0/1 Pending 0 70s
Since there are idle resources in the current cluster (
root.max.cpu=40
), when the CPU resources requested by pods innamespace1
exceed 10 (min.cpu=10
), which is configured byroot.a.1
, the pods can continue to borrow other idle resources. The maximum CPU resources they can request is 20 (max.cpu=20
), which is configured byroot.a.1
.When the amount of CPU resources requested by a pod exceeds 20 (
max.cpu=20
), any additional pods that request resources will be in the Pending state. Therefore, out of the 5 requested pods, 4 are in the Running state and 1 is in the Pending state.
Deploy a service in
namespace2
according to the following YAML file. The number of replicas for the pod is 5, and each pod requests 5 vCPUs.Check the deployment status of pods in the cluster.
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 111s nginx1-744b889544-6l4s9 1/1 Running 0 111s nginx1-744b889544-cgzlr 1/1 Running 0 111s nginx1-744b889544-w2gr7 1/1 Running 0 111s nginx1-744b889544-zr5xz 0/1 Pending 0 111s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-4gl8s 1/1 Running 0 111s nginx2-556f95449f-crwk4 1/1 Running 0 111s nginx2-556f95449f-gg6q2 0/1 Pending 0 111s nginx2-556f95449f-pnz5k 1/1 Running 0 111s nginx2-556f95449f-vjpmq 1/1 Running 0 111s
Similar to
nginx1
. Since there are idle resources in the current cluster (root.max.cpu=40
), when the CPU resources requested by pods innamespace2
exceed 10 (min.cpu=10
), which is configured byroot.a.2
, the pods can continue to borrow other idle resources. The maximum CPU resources they can request is 20 (max.cpu=20
), which is configured byroot.a.2
.When the amount of CPU resources requested by a pod exceeds 20 (
max.cpu=20
), any additional pods that request resources will be in the Pending state. Therefore, out of the 5 requested pods, 4 are in the Running state and 1 is in the Pending state.At this point, the resources occupied by pods in
namespace1
andnamespace2
in the cluster have reached 40 (root.max.cpu=40
), which is configured byroot
.
Return borrowed resources
Deploy a service in
namespace3
according to the following YAML file. The number of replicas for the pod is 5, and each pod requests 5 vCPUs.Run the following command to check the deployment status of pods in the cluster:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-52dbg 1/1 Running 0 6m17s nginx1-744b889544-cgzlr 1/1 Running 0 6m17s nginx1-744b889544-nknns 0/1 Pending 0 3m45s nginx1-744b889544-w2gr7 1/1 Running 0 6m17s nginx1-744b889544-zr5xz 0/1 Pending 0 6m17s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-crwk4 1/1 Running 0 4m22s nginx2-556f95449f-ft42z 1/1 Running 0 4m22s nginx2-556f95449f-gg6q2 0/1 Pending 0 4m22s nginx2-556f95449f-hfr2g 1/1 Running 0 3m29s nginx2-556f95449f-pvgrl 0/1 Pending 0 3m29s
kubectl get pods -n namespace3
Expected output:
NAME READY STATUS RESTARTS AGE nginx3-578877666-msd7f 1/1 Running 0 4m nginx3-578877666-nfdwv 0/1 Pending 0 4m10s nginx3-578877666-psszr 0/1 Pending 0 4m11s nginx3-578877666-xfsss 1/1 Running 0 4m22s nginx3-578877666-xpl2p 0/1 Pending 0 4m10s
The
min
parameter of the elastic quotaroot.b.1
fornginx3
is set to10
. To ensure that the configuredmin
resources are available, the scheduler will return the pod resources that were previously borrowed fromroot.b
underroot.a
. This allowsnginx3
to obtain at least 10 (min.cpu=10
) CPU cores to ensure the operation.The scheduler will comprehensively consider factors such as the priority, availability, and creation time of jobs under
root.a
, and select corresponding pods to return previously occupied resources (10 vCPUs). Therefore, afternginx3
obtains the 10 (min.cpu=10
) CPU cores, 2 pods are in a Running state, and the other 3 remain in a Pending state.Deploy a service
nginx4
innamespace4
according to the following YAML file. The number of replicas for the pod is 5, and each pod requests 5 CPU cores.Run the following command to check the deployment status of pods in the cluster:
kubectl get pods -n namespace1
Expected output:
NAME READY STATUS RESTARTS AGE nginx1-744b889544-cgzlr 1/1 Running 0 8m20s nginx1-744b889544-cwx8l 0/1 Pending 0 55s nginx1-744b889544-gjkx2 0/1 Pending 0 55s nginx1-744b889544-nknns 0/1 Pending 0 5m48s nginx1-744b889544-zr5xz 1/1 Running 0 8m20s
kubectl get pods -n namespace2
Expected output:
NAME READY STATUS RESTARTS AGE nginx2-556f95449f-cglpv 0/1 Pending 0 3m45s nginx2-556f95449f-crwk4 1/1 Running 0 9m31s nginx2-556f95449f-gg6q2 1/1 Running 0 9m31s nginx2-556f95449f-pvgrl 0/1 Pending 0 8m38s nginx2-556f95449f-zv8wn 0/1 Pending 0 3m45s
kubectl get pods -n namespace3
Expected output:
NAME READY STATUS RESTARTS AGE nginx3-578877666-msd7f 1/1 Running 0 8m46s nginx3-578877666-nfdwv 0/1 Pending 0 8m56s nginx3-578877666-psszr 0/1 Pending 0 8m57s nginx3-578877666-xfsss 1/1 Running 0 9m8s nginx3-578877666-xpl2p 0/1 Pending 0 8m56s
kubectl get pods -n namespace4
Expected output:
nginx4-754b767f45-g9954 1/1 Running 0 4m32s nginx4-754b767f45-j4v7v 0/1 Pending 0 4m32s nginx4-754b767f45-jk2t7 0/1 Pending 0 4m32s nginx4-754b767f45-nhzpf 0/1 Pending 0 4m32s nginx4-754b767f45-tv5jj 1/1 Running 0 4m32s
The
min
parameter of the elastic quotaroot.b.2
fornginx4
is set to10
. To ensure that the configuredmin
resources are available, the scheduler will return the pod resources that were previously borrowed fromroot.b
underroot.a
. This allowsnginx4
to obtain at least 10 (min.cpu=10
) CPU cores to ensure the operation.The scheduler will comprehensively consider factors such as the priority, availability, and creation time of jobs under
root.a
, and select corresponding pods to return previously occupied resources (10 vCPUs). Therefore, afternginx4
obtains the 10 (min.cpu=10
) CPU cores, 2 pods are in a Running state, and the other 3 remain in a Pending state.At this point, all elastic quotas in the cluster are using their guaranteed resources set by
min
.
ResourceFlavor configuration example
Prerequisites
ResourceFlavor is installed by referring to ResourceFlavorCRD (ACK Scheduler does not install it by default).
Only the nodeLabels field is effective in the ResourceFlavor resource.
The scheduler version is higher than 6.9.0. For more information about component release notes, see kube-scheduler. For more information about component upgrade entry, see Components.
ResourceFlavor, as a Kubernetes CustomResourceDefinition (CRD), establishes a binding relationship between elastic quotas and nodes by defining node labels (NodeLabels). When associated with a specific elastic quota, pods under that quota are not only limited by the total amount of quota resources but can also only be scheduled to desired nodes that match the NodeLabels.
ResourceFlavor example
An example of a ResourceFlavor is as follows.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "spot"
spec:
nodeLabels:
instance-type: spot
Example of associating an elastic quota
To associate an elastic quota with ResourceFlavor, you must declare it in ElasticQuotaTree by using the attributes
field. The following code shows an example.
apiVersion: scheduling.sigs.k8s.io/v1beta1
kind: ElasticQuotaTree
metadata:
name: elasticquotatree
namespace: kube-system
spec:
root:
children:
- attributes:
resourceflavors: spot
max:
cpu: 99
memory: 40Gi
nvidia.com/gpu: 10
min:
cpu: 99
memory: 40Gi
nvidia.com/gpu: 10
name: child
namespaces:
- default
max:
cpu: 999900
memory: 400000Gi
nvidia.com/gpu: 100000
min:
cpu: 999900
memory: 400000Gi
nvidia.com/gpu: 100000
name: root
After submission, pods belonging to Quota child will only be scheduled to nodes with the instance-type: spot
label.
References
For more information about release records of kube-scheduler, see kube-scheduler.
kube-scheduler supports gang scheduling, which requires that associated groups of pods must be scheduled successfully at the same time. Otherwise, none of them will be scheduled. kube-scheduler is suitable for big data processing task scenarios, such as Spark and Hadoop. For more information, see Work with gang scheduling.