Custom elastic resource priority scheduling is an elastic scheduling policy from Alibaba Cloud. This policy lets you define a custom resource policy (ResourcePolicy) during application deployment or scale-out. The ResourcePolicy specifies the order for scheduling application instance pods to different types of node resources. During a scale-in, pods are removed in the reverse of the scheduling order.
Do not use system-reserved labels, such as alibabacloud.com/compute-class and alibabacloud.com/compute-qos, in the spec.selector.matchLabels field of a workload, such as a deployment. The system may modify these labels during custom priority scheduling. This can cause the VPC controller to frequently recreate pods and affect application stability.
Prerequisites
An ACK managed cluster (Pro version) that runs v1.20.11 or later has been created. To upgrade a cluster, see Manually upgrade a cluster.
The scheduler version must meet the requirements for your ACK cluster version. For more information about the features supported by different scheduler versions, see kube-scheduler.
ACK version
Scheduler version
1.20
v1.20.4-ack-7.0 or later
1.22
v1.22.15-ack-2.0 or later
1.24 or later
All versions are supported
To use ECI resources, ensure that the ack-virtual-node component is deployed. For more information, see Use ECI in ACK.
Notes
Starting from scheduler version v1.x.x-aliyun-6.4, the default value of the
ignorePreviousPodfield for custom elastic resource priorities is changed toFalse, and the default value of theignoreTerminatingPodfield is changed toTrue. This change does not affect existing ResourcePolicy configurations or subsequent updates to them.This feature conflicts with pod-deletion-cost and cannot be used at the same time.
This feature cannot be used with ECI elastic scheduling through ElasticResource.
This feature uses a BestEffort policy and does not guarantee that scale-in operations strictly follow the reverse order.
The `max` field is available only in clusters that run v1.22 or later with a scheduler of v5.0 or later.
When used with an elastic node pool, this feature may cause the node pool to eject nodes incorrectly. To use this feature with an elastic node pool, include the elastic node pool in a Unit and do not set the `max` field for that Unit.
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be scaled in.
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify the ResourcePolicy until all associated pods are completely deleted.
Usage
You can create a ResourcePolicy to define elastic resource priorities:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
podLabels:
key1: value1
podAnnotations:
key1: value1
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
# Optional, Advanced Configurations
preemptPolicy: AfterAllUnits
ignorePreviousPod: false
ignoreTerminatingPod: true
matchLabelKeys:
- pod-template-hash
whenTryNextUnits:
policy: TimeoutOrExceedMax
timeout: 1mselector: Specifies that the ResourcePolicy applies to pods that are in the same namespace and have thelabelkey1=value1. If theselectoris empty, the policy applies to all pods in the namespace.strategy: The scheduling strategy. Currently, onlypreferis supported.units: User-defined scheduling units. During a scale-out, resources are created in the order defined inunits. During a scale-in, resources are removed in the reverse order.resource: The type of elastic resource. The supported values areeci,ecs,elastic, andacs. Theelastictype is available in clusters of v1.24 or later with a scheduler of v6.4.3 or later. Theacstype is available in clusters of v1.26 or later with a scheduler of v6.7.1 or later.NoteThe
elastictype will be deprecated. We recommend that you use auto-scaling node pools by settingk8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"in the pod labels.NoteThe
acstype adds thealibabacloud.com/compute-class: defaultandalibabacloud.com/compute-class: general-purposelabels to pods by default. You can overwrite the default values by declaring different values in the pod labels. Ifalpha.alibabacloud.com/compute-qos-strategyis declared in the pod annotations, thealibabacloud.com/compute-class: defaultlabel is not added by default.NoteThe
acsandecitypes add tolerations for virtual node taints to pods by default. Pods can be scheduled to virtual nodes without requiring additional taint toleration configurations.ImportantIn scheduler versions earlier than 6.8.3, you cannot use multiple
acsUnits at the same time.nodeSelector: Specifies thelabelof anodeto select nodes for the scheduling unit. This parameter applies only toECSresources.max(available for scheduler v5.0 and later): The maximum number of pod replicas that can be scheduled in this scheduling unit.maxResources(available for scheduler v6.9.5 and later): The maximum amount of resources that can be scheduled for pods in this scheduling unit.podAnnotations: The type ismap[string]string{}. The key-value pairs configured inpodAnnotationsare added to the pod by the scheduler. When counting the number of pods in this Unit, only pods with these key-value pairs are counted.podLabels: The type ismap[string]string{}. The key-value pairs configured inpodLabelsare added to the pod by the scheduler. When counting the number of pods in this Unit, only pods with these key-value pairs are counted.NoteIf
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"is included in the `podLabels` of a Unit, and the number of pods in the current Unit is less than the specified `max` value, the scheduler makes the pod wait in the current Unit. You can set the waiting time inwhenTryNextUnits. The labelk8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"is not added to the pod. This label is also not required on the pod when counting the number of pods.NoteWhen you use ResourcePolicy with auto scaling, you must also use instant elasticity. Otherwise, the cluster-autoscaler might trigger incorrect node pool scaling.
preemptPolicy(available for scheduler v6.1 and later. This parameter does not take effect for ACS.): When a ResourcePolicy contains multipleunits, this field specifies whether the scheduler can attempt preemption when scheduling fails for a Unit. `BeforeNextUnit` indicates that the scheduler attempts preemption if scheduling fails for any Unit. `AfterAllUnits` indicates that the scheduler attempts preemption only if scheduling fails for the last Unit. The default value is AfterAllUnits.You can configure the ACK Scheduler parameters to enable preemption.
ignorePreviousPod(available for scheduler v6.1 and later): This field must be used withmaxinunits. If this field is set totrue, pods that were scheduled before the ResourcePolicy was created are ignored when the number of pods is counted.ignoreTerminatingPod(available for scheduler v6.1 and later): This field must be used withmaxinunits. If this field is set totrue, pods in the Terminating state are ignored when the number of pods is counted.matchLabelKeys(available for scheduler v6.2 and later): This field must be used withmaxinunits. Pods are grouped based on the values of the specified labels. Different groups of pods are subject to differentmaxcounts. If this feature is used and a pod is missing a label that is declared inmatchLabelKeys, the pod cannot be scheduled.whenTryNextUnits(available for cluster v1.24 and later with scheduler v6.4 and later): Describes the conditions under which a pod is allowed to use resources from the next Unit.policy: The policy that determines when a pod can be scheduled to the next Unit. Valid values areExceedMax,LackResourceAndNoTerminating,TimeoutOrExceedMax, andLackResourceOrExceedMax(default).ExceedMax: The pod is allowed to use resources from the next Unit if the `max` and `maxResources` fields for the current Unit are not set, or if the number of pods in the current Unit is greater than or equal to the specified `max` value, or if the amount of used resources in the current Unit plus the resources of the current pod exceeds `maxResources`. This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools.ImportantIf the auto-scaling node pool cannot create nodes for a long time, this policy may cause pods to remain in the Pending state.
Because Cluster Autoscaler is not aware of the `max` limit in ResourcePolicy, the actual number of created instances may be greater than the specified `max` value. This issue will be fixed in a future release.
TimeoutOrExceedMax: This error occurs if one of the following conditions is met:The `max` value for the current Unit is set and the number of pods in the Unit is less than the `max` value. Alternatively, `maxResources` is set and the amount of scheduled resources plus the resources of the current pod is less than `maxResources`.
The `max` value for the current Unit is not set, and the `podLabels` of the current Unit contain
k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true".
If the current Unit has insufficient resources to schedule the pod, the pod remains pending in that Unit for a maximum duration specified by the
timeoutvalue. When used with auto scaling and ECI, this policy prioritizes auto scaling for node pools and automatically uses ECI after the timeout.ImportantIf a node is created during the timeout period but does not reach the Ready state, and the pod does not tolerate the NotReady taint, the pod is still scheduled to an ECI instance.
LackResourceOrExceedMax: If the number of pods in the current Unit is greater than or equal to the specified `max` value, or if there are no more available resources in the current Unit, the pod is allowed to use resources from the next Unit. This is the default policy and is suitable for most basic scenarios.LackResourceAndNoTerminating: If the number of pods in the current Unit is greater than or equal to the specified `max` value, or if there are no more available resources in the current Unit, and there are no pods in the Terminating state in the current Unit, the pod is allowed to use resources from the next Unit. This policy is suitable for use with rolling update strategies to prevent new pods from being rolled out to subsequent Units due to terminating pods.
timeout: The `timeout` parameter is not supported in an `acs` Unit, which is limited only by `max`. When the `policy` is set toTimeoutOrExceedMax, this field specifies the timeout duration. If this field is empty, the default value is 15 minutes.
Example scenarios
Scenario 1: Schedule pods based on node pool priority
You want to deploy a deployment to a cluster that has two node pools: Node Pool A and Node Pool B. You want to prioritize scheduling pods to Node Pool A. If Node Pool A has insufficient resources, pods are scheduled to Node Pool B. When you scale in the deployment, pods in Node Pool B are removed first, followed by pods in Node Pool A. In this example, the nodes cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Node Pool A, and the nodes cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Node Pool B. The node specifications are 2 cores and 4 GB. The following steps describe the procedure:
You can use the following YAML content to create a ResourcePolicy that customizes the node pool scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod that you will create later. strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058**** - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****NoteYou can obtain the node pool ID on the Node Management > Node Pools page of the cluster. For more information, see Create and manage node pools.
You can use the following YAML content to create a deployment and deploy two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yamlExpected output:
deployment.apps/nginx createdRun the following command to view the deployment result.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>The output shows that the first two pods are scheduled to the nodes in Node Pool A.
Scale out the pods.
Run the following command to scale out the pods to four replicas.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>The output shows that when the nodes in Node Pool A have insufficient resources, the new pods are scheduled to the nodes in Node Pool B.
Scale in the pods.
Run the following command to scale in the pods from four replicas to two.
kubectl scale deployment nginx --replicas 2Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>The output shows that pods in Node Pool B are removed first, which is the reverse of the scheduling order.
Scenario 2: Use hybrid scheduling for ECS and ECI
You want to deploy a deployment to a cluster that has three types of resources: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. To reduce resource costs, you want to set the scheduling priority in the following order: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. When you scale in the deployment, you want to remove pods from ECI instances first, then from pay-as-you-go ECS instances, and finally from subscription ECS instances. In this example, the node specifications are 2 cores and 4 GB. The following steps describe the procedure for the hybrid scheduling of ECS and ECI:
Run the following commands to add different
labels to nodes that use different billing methods. You can also use the node pool feature to automatically addlabels.kubectl label node cn-beijing.10.0.3.137 paidtype=subscription kubectl label node cn-beijing.10.0.3.138 paidtype=subscription kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-goYou can use the following YAML content to create a ResourcePolicy that customizes the scheduling order.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # This must be associated with the label of the pod that you will create later. strategy: prefer units: - resource: ecs nodeSelector: paidtype: subscription - resource: ecs nodeSelector: paidtype: pay-as-you-go - resource: eciYou can use the following YAML content to create a deployment and deploy two pods.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step. spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2Create the Nginx application and view the deployment result.
Run the following command to create the Nginx application.
kubectl apply -f nginx.yamlExpected output:
deployment.apps/nginx createdRun the following command to view the deployment result.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The expected output shows that the first two Pods are scheduled to nodes where the
labelispaidtype=subscription.
Scale out the pods.
Run the following command to scale out the pods to four replicas.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that when the nodes with the
paidtype=subscriptionlabelhave insufficient resources, the new pods are scheduled to the nodes with thepaidtype=pay-as-you-golabel.Run the following command to scale out the pods to six replicas.
kubectl scale deployment nginx --replicas 6Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>The output shows that when ECS resources are insufficient, the new pods are scheduled to ECI resources.
Scale in the pods.
Run the following command to scale in the pods from six replicas to four.
kubectl scale deployment nginx --replicas 4Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>The output shows that pods on ECI instances are removed first, which is the reverse of the scheduling order.
Run the following command to scale in the pods from four replicas to two.
kubectl scale deployment nginx --replicas 2Expected output:
deployment.apps/nginx scaledRun the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that pods on nodes with the
paidtype=pay-as-you-golabelare removed first, which is the reverse of the scheduling order.Run the following command to check the pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>The output shows that only pods on nodes with the
paidtype=subscriptionlabelremain.
References
When you deploy services in an ACK cluster, you can use tolerations and node affinity to specify that only ECS or ECI elastic resources are used, or to automatically request ECI resources when ECS resources are insufficient. By configuring scheduling policies, you can meet various requirements for elastic resources in different workload scenarios. For more information, see Specify resource allocation for ECS and ECI.
High availability (HA) and high performance are important for running distributed tasks. In an ACK managed cluster (Pro version), you can use native Kubernetes scheduling semantics to spread distributed tasks across zones to meet HA deployment requirements. You can also use native Kubernetes scheduling semantics to implement affinity-based deployment of distributed tasks in a specified zone to meet high-performance deployment requirements. For more information, see Implement zone-based anti-affinity and affinity scheduling for ECI pods.