Priority-based resource scheduling is provided by Alibaba Cloud to meet elasticity requirements in pod scheduling. A ResourcePolicy specifies the priorities of nodes in descending order for pod scheduling. When the system deploys or scales out pods for an application, pods are scheduled to nodes based on the priorities of the nodes that are listed in the ResourcePolicy. When the system scales in pods for an application, pods are removed from nodes based on the priorities of the nodes in ascending order.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster that runs Kubernetes 1.20.11 or later is created. For more information about how to update the Kubernetes version of an ACK cluster, see Update the Kubernetes version of an ACK cluster.
The scheduler version that is required varies based on the Kubernetes version of the cluster. The following table describes the scheduler versions that are required for different Kubernetes versions. For more information about the features of different scheduler versions, see kube-scheduler.
Kubernetes version
Scheduler version
1.20
1.20.4-ack-7.0 and later
1.22
1.22.15-ack-2.0 and later
1.24 and later
All versions are supported
ack-virtual-node is deployed if you want to use elastic container instances. For more information, see Use Elastic Container Instance in ACK clusters.
Limits
Priority-based resource scheduling is mutually exclusive with the pod deletion cost feature. For more information about the pod deletion cost feature, see Pod deletion cost.
You cannot use priority-based resource scheduling and Elastic Container Instance-based scheduling at the same time. For more information about Elastic Container Instance-based scheduling, see Use Elastic Container Instance-based scheduling.
This feature uses the BestEffort policy and does not ensure that pods are removed from nodes based on the priorities of the nodes in ascending order when the system scales in pods for an application.
To use new features, we recommend that you update the scheduler to the latest version. For more information about how to update the scheduler, see kube-scheduler.
The max parameter is available only when your cluster runs Kubernetes 1.22 or later and the version of the scheduler installed in your cluster is 5.0 or later.
If you use this feature together with elastic node pools, invalid nodes may be added to the elastic node pools. Make sure that the elastic node pools are included in units. Do not specify the max field for the units.
If the version of the scheduler installed in your cluster is earlier than 5.0 or your cluster runs Kubernetes 1.20 or earlier, pods that are created before a ResourcePolicy is created are prioritized for deletion during scale-in activities.
If your scheduler version is earlier than 6.1 or the Kubernetes version of your cluster is 1.20 or earlier, do not modify the ResourcePolicy before you delete all pods that are associated with the ResourcePolicy.
How to configure priority-based resource scheduling
Create a ResourcePolicy with the following template:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
ignorePreviousPod: true
ignoreTerminatingPod: false
matchLabelKeys:
- pod-template-hash
preemptPolicy: AfterAllUnits
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
selector
: the selector that is used to select pods in a namespace. The ResourcePolicy is applied to the selected pods. In this example, pods that have thekey1=value1
label
are selected. Whenselector
is left empty, the ResourcePolicy takes effect on all pods in the namespace.strategy
: the scheduling policy. Set the value toprefer
.preemptPolicy
(available when the scheduler version is 6.1 and later): the preemption policy that is used when the ResourcePolicy contains multipleunits
. The policy specifies whether to preempt nodes each time the scheduler fails to schedule pods to a unit. If you set the preemption policy to BeforeNextUnit, the scheduler attempts to preempt nodes each time it fails to schedule pods to a unit. If you set the preemption policy to AfterAllUnits, the scheduler attempts to preempt nodes only after it fails to schedule pods to all units. The default is AfterAllUnits.ignorePreviousPod
(available when the scheduler version is 6.1 and later): This parameter must be used together with themax
parameter in theunits
section. If this parameter is set totrue
, the value of the max parameter does not include pods that are scheduled before the ResourcePolicy is created.ignoreTerminatingPod
(available when the scheduler version is 6.1 and later): This parameter must be used together with themax
parameter in theunits
section. If this parameter is set totrue
, the value of the max parameter does not include pods in the Terminating state.matchLabelKeys
(available when the scheduler version is 6.2 and later): This parameter must be used together with themax
parameter in theunits
section. This parameter specifies the label keys of the pods specified by themax
parameter. After you configure thematchLabelKeys
parameter, pods without the specified label keys are not scheduled.units
: the schedulable units. In a scale-out activity, pods are scheduled to nodes based on the priorities of the nodes that are listed belowunits
in descending order. In a scale-in activity, pods are deleted from the nodes based on the priorities of the nodes in ascending order.resource
: the type of resource for pod scheduling. Valid values:eci
andecs
.nodeSelector
: the selector that is used to select nodes with the specifiednode
label
. This parameter takes effect only if the resource parameter is set toecs
.max
(available when the scheduler version is 5.0 and later): Specifies the maximum number of replicated pods that can be scheduled to the current unit.
Examples
Example 1: Schedule pods to specified node pools
You want to schedule the pods of a Deployment to specific node pools, for example, Node Pool A and Node Pool B. You want to prioritize the use of Node Pool A and schedule pods to Node Pool B only if the computing resources of Node Pool A are insufficient. In scale-in activities, you want to delete pods from the nodes in Node Pool B first. In this example, Node Pool A contains the following nodes: cn-beijing.10.0.3.137
and cn-beijing.10.0.3.138
. Node Pool B contains the following nodes: cn-beijing.10.0.6.47
and cn-beijing.10.0.6.46
. Each node has 2 vCores and 4 GB of memory. Perform the following steps to configure priority-based resource scheduling:
Create a ResourcePolicy with the following template:
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # You must specify the label of the pods to which you want to apply the ResourcePolicy. strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058**** - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****
NoteTo view the ID of a node pool, choose Nodes > Node Pools on the details page of a cluster in the ACK console. For more information, see Create a node pool.
Use the following template to create a Deployment that provisions two pods:
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # The pod label must be the same as the one that you specified for the selector in the ResourcePolicy. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2
Deploy an NGINX application and query the pods.
Run the following command to deploy an NGINX application:
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
Run the following command to query the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>
The output shows that the two pods are scheduled to Node Pool A.
Scale out pods for the NGINX application.
Run the following command to increase the number of pods to four:
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output shows that new pods are scheduled to Node Pool B because the computing resources in Node Pool A are insufficient.
Scale in pods for the NGINX application.
Run the following command to reduce the number of pods to two:
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>
The output shows that pods on the nodes in Node Pool B are deleted.
Example 2: Schedule pods to ECS instances and elastic container instances
You want to schedule the pods of a Deployment to multiple types of resources, such as subscription Elastic Compute Service (ECS) instances, pay-as-you-go ECS instances, and elastic container instances. To reduce the resource cost, you want to schedule pods to resources based on the following priorities: subscription ECS instances > pay-as-you-go ECS instances > elastic container instances. In scale-in activities, you want to delete pods from these resources based on the following sequence: elastic container instances, pay-as-you-go ECS instances, and subscription ECS instances. In this example, each node has 2 vCores and 4 GB of memory. Perform the following steps to configure priority-based resource scheduling:
Run the following command to add
labels
that indicate different billing methods to the nodes. You can also use node pools to automatically addlabels
to the nodes.kubectl label node cn-beijing.10.0.3.137 paidtype=subscription kubectl label node cn-beijing.10.0.3.138 paidtype=subscription kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go
Create a ResourcePolicy with the following template:
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # You must specify the label of the pods to which you want to apply the ResourcePolicy. strategy: prefer units: - resource: ecs nodeSelector: paidtype: subscription - resource: ecs nodeSelector: paidtype: pay-as-you-go - resource: eci
Use the following template to create a Deployment that provisions two pods:
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # The pod label must be the same as the one that you specified for the selector in the ResourcePolicy. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2
Deploy an NGINX application and query the pods.
Run the following command to deploy an NGINX application:
kubectl apply -f nginx.yaml
Expected output:
deployment.apps/nginx created
Run the following command to query the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that two pods are scheduled to nodes with the
paidtype=subscription
label
.
Scale out pods for the NGINX application.
Run the following command to increase the number of pods to four:
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that new pods are scheduled to nodes with the
paidtype=pay-as-you-go
label
because nodes with thepaidtype=subscription
label
are insufficient.Run the following command to increase the number of pods to six:
kubectl scale deployment nginx --replicas 6
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output shows that new pods are scheduled to elastic container instances because the ECS nodes are insufficient.
Scale in pods for the NGINX application.
Run the following command to reduce the number of pods to four:
kubectl scale deployment nginx --replicas 4
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>
The output shows that the pods on elastic containers instances are deleted.
Run the following command to reduce the number of pods to two:
kubectl scale deployment nginx --replicas 2
Expected output:
deployment.apps/nginx scaled
Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that the pods on the nodes with the
paidtype=pay-as-you-go
label
are deleted.Run the following command to query the status of the pods:
kubectl get pods -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
The output shows that pods run only on the nodes with the
paidtype=subscription
label
.
References
When you deploy Services in an ACK cluster, you can configure tolerations and node affinity to enable the scheduler to use only Elastic Compute Service (ECS) instances or elastic container instances, or allow the scheduler to automatically apply for elastic container instances when ECS instances are insufficient. You can configure different scheduling policies to scale resources in different scenarios. For more information, see Configure Elastic Container Instance-based scheduling.
High availability and high performance are essential to distributed jobs. In an ACK Pro cluster, you can use the Kubernetes-native scheduling semantics to spread distributed jobs across zones for high availability. You can also use the Kubernetes-native scheduling semantics to deploy distributed jobs in specific zones based on affinity settings for high performance. For more information, see Spread Elastic Container Instance-based pods across zones and configure affinities.