Custom elastic resource priority scheduling lets you define the order in which pods are scheduled across different resource types and node pools. Create a ResourcePolicy to set this order: during scale-out, pods are scheduled to resource units in the order you define; during scale-in, pods are removed in reverse order.
Do not use system-reserved labels such as alibabacloud.com/compute-class or alibabacloud.com/compute-qos in workload label selectors (for example, the spec.selector.matchLabels field of a Deployment). The system may modify these labels during custom priority scheduling, causing the controller to frequently rebuild pods and affecting application stability.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed cluster Pro edition, version 1.20.11 or later. For upgrade steps, see Manually upgrade a cluster.
-
A kube-scheduler version that meets the requirements for your ACK cluster version. See kube-scheduler for a full list of supported features per version.
ACK version Scheduler version 1.20 v1.20.4-ack-7.0 or later 1.22 v1.22.15-ack-2.0 or later 1.24 or later All versions supported -
(Required for ECI resources) The ack-virtual-node component deployed in your cluster. See Use ECI in ACK.
Usage notes
-
Best-effort ordering: This feature uses a BestEffort policy. Pod scale-in does not strictly follow the reverse of the scheduling order in all cases.
-
Starting from scheduler version v1.x.x-aliyun-6.4, the default value of
ignorePreviousPodchanged tofalse, andignoreTerminatingPodchanged totrue. Existing ResourcePolicy objects and subsequent updates are not affected. -
This feature conflicts with pod-deletion-cost and cannot be used together.
-
This feature cannot be used with Elastic Container Instance (ECI) elastic scheduling implemented through ElasticResource. See Use ElasticResource for elastic scheduling of ECI pods.
-
The
maxfield is available only in clusters of version 1.22 or later with scheduler version 5.0 or later. -
When used with elastic node pools, this feature may cause node pools to create invalid nodes. To prevent this, include the elastic node pool in a unit and do not set the
maxfield for that unit. -
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be scaled in.
-
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify a ResourcePolicy while its associated pods are not completely deleted.
-
When used with auto-scaling, this feature must be used with instant elasticity. The Cluster Autoscaler may otherwise trigger incorrect node pool scaling.
Create a ResourcePolicy
Define a ResourcePolicy with the following YAML structure:
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: test
namespace: default
spec:
selector:
key1: value1
strategy: prefer
units:
- nodeSelector:
unit: first
podLabels:
key1: value1
podAnnotations:
key1: value1
resource: ecs
- nodeSelector:
unit: second
max: 10
resource: ecs
- resource: eci
# Optional advanced configuration
preemptPolicy: AfterAllUnits
ignorePreviousPod: false
ignoreTerminatingPod: true
matchLabelKeys:
- pod-template-hash
whenTryNextUnits:
policy: TimeoutOrExceedMax
timeout: 1m
spec fields
| Field | Description |
|---|---|
selector |
Selects pods that have matching labels in the same namespace. If empty, applies to all pods in the namespace. |
strategy |
Scheduling strategy. Only prefer is supported. |
units |
Ordered list of scheduling units. Pods are scheduled in list order during scale-out and removed in reverse order during scale-in. |
units fields
| Field | Description |
|---|---|
resource |
Resource type for this unit. Supported values: ecs, eci, elastic (clusters 1.24+ with scheduler 6.4.3+), acs (clusters 1.26+ with scheduler 6.7.1+). |
nodeSelector |
Selects nodes in this unit by their labels. |
max |
Maximum number of pod replicas schedulable to this unit. Available in scheduler version 5.0 or later. |
maxResources |
Maximum amount of resources schedulable to pods in this unit. Available in scheduler version 6.9.5 or later. |
podLabels |
Labels added to pods scheduled to this unit. Only pods with these labels are counted for this unit. |
podAnnotations |
Annotations added to pods scheduled to this unit. Only pods with these annotations are counted for this unit. |
Theelasticresource type is being deprecated. Instead, use auto-scaling node pools by settingk8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"inpodLabels.
Theacstype adds thealibabacloud.com/compute-class: defaultandalibabacloud.com/compute-class: general-purposelabels to pods by default. Override these by specifying different values inpodLabels. Ifalpha.alibabacloud.com/compute-qos-strategyis specified inpodAnnotations, thealibabacloud.com/compute-class: defaultlabel is not added.
Theacsandecitypes add tolerations for virtual node taints to pods by default. The scheduler adds these tolerations internally — they do not appear in the pod spec, and pods can be scheduled to virtual nodes without additional taint toleration configuration.
In scheduler versions earlier than 6.8.3, you cannot use multiple units of the acs type at the same time.
If a unit'spodLabelsincludek8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true", or if the number of pods in the unit is less than themaxvalue, the scheduler holds the pod in the current unit until a condition is met. Set the maximum wait duration inwhenTryNextUnits. Thek8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"label is not applied to the pod and is not required for pod counting.
Advanced configuration fields
| Field | Available from | Description |
|---|---|---|
preemptPolicy |
Scheduler v6.1 | Controls when preemption is attempted across units. BeforeNextUnit: attempt preemption each time a unit fails. AfterAllUnits (default): attempt preemption only after all units fail. Not applicable to ACS. See Enable preemption. |
ignorePreviousPod |
Scheduler v6.1 | When true, pods created before the ResourcePolicy are excluded from pod counting. Must be used with max. |
ignoreTerminatingPod |
Scheduler v6.1 | When true, pods in the Terminating state are excluded from pod counting. Must be used with max. |
matchLabelKeys |
Scheduler v6.2 | Groups pods by label values and applies max per group. Pods missing a declared label are rejected by the scheduler. Must be used with max. |
whenTryNextUnits |
Cluster 1.24+, scheduler 6.4+ | Defines when a pod is allowed to move to the next unit. See whenTryNextUnits policies below. |
whenTryNextUnits policies
| Policy | Moves to next unit when... | Best for |
|---|---|---|
LackResourceOrExceedMax (default) |
Current unit runs out of resources, or pod count reaches max |
Most general use cases |
ExceedMax |
max and maxResources are not set, or pod count reaches max, or the resources used in the current unit plus the resources of the current pod exceed maxResources |
Prioritizing auto-scaling of node pools over ECI |
TimeoutOrExceedMax |
(1) max is set and pod count is below max, or maxResources is set and current usage plus the current pod's resources are below maxResources; or (2) max is not set and podLabels contain k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" — in either case, if the unit has insufficient resources, the pod waits up to timeout before moving |
Node pool scale-out with ECI fallback after timeout |
LackResourceAndNoTerminating |
Resources are insufficient (or max is reached) and no pods in the current unit are Terminating |
Rolling updates — prevents new pods from spilling to the next unit while old pods terminate |
The timeout field applies only when policy is TimeoutOrExceedMax. The default is 15 minutes. Not supported for ACS units (which are limited only by max).
If the auto-scaling node pool cannot create nodes for a long time, ExceedMax may leave pods in Pending indefinitely. The Cluster Autoscaler does not currently respect the max limit in ResourcePolicy, so the actual number of created instances may exceed max. This will be addressed in a future release.
With TimeoutOrExceedMax, if a node is created during the timeout period but is not yet Ready, and the pod does not tolerate the NotReady taint, the pod is still scheduled to ECI.
Scenario examples
These scenarios produce best-effort results. Pod removal order during scale-in may not strictly follow the reverse scheduling order in all circumstances.
Prioritize one node pool over another
Goal: Deploy a Deployment across two node pools — Pool A first, Pool B as overflow. During scale-in, remove pods from Pool B first.
In this example, nodes cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Pool A, and cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Pool B. All nodes have 2 vCPUs and 4 GB of memory.
-
Create a ResourcePolicy that sets the node pool scheduling order. Replace the
nodepool-idvalues with your actual node pool IDs, which you can find on the Node Management > Node Pools page. See Create and manage a node pool.apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # Must match the pod label in the Deployment below strategy: prefer units: - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058**** - resource: ecs nodeSelector: alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb**** -
Create a Deployment. The pod label
app: nginxmust match theselectorin the ResourcePolicy.apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # Must match the ResourcePolicy selector spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2 -
Apply the Deployment and verify pod placement.
-
Apply the YAML files.
kubectl apply -f nginx.yamlExpected output:
deployment.apps/nginx created -
Check which nodes the pods are scheduled to.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 17s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 17s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none>Both pods are on Pool A nodes, as expected.
-
-
Scale out to four replicas and verify overflow to Pool B.
-
Scale the Deployment.
kubectl scale deployment nginx --replicas 4 -
Check pod placement.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 101s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 101s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 1/1 Running 0 18s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 1/1 Running 0 18s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>The two new pods overflow to Pool B nodes, as Pool A is at capacity.
-
-
Scale in to two replicas and verify that Pool B pods are removed first.
-
Scale the Deployment.
kubectl scale deployment nginx --replicas 2 -
Check pod status.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 2m41s 172.29.112.216 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-k**** 1/1 Running 0 2m41s 172.29.113.24 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-m**** 0/1 Terminating 0 78s 172.29.113.156 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-x**** 0/1 Terminating 0 78s 172.29.113.89 cn-beijing.10.0.6.46 <none> <none>Pool B pods are removed first, which is the reverse of the scheduling order.
-
Use subscription ECS first, then pay-as-you-go ECS, then fall back to ECI
Goal: Minimize costs by filling subscription ECS capacity first, then pay-as-you-go ECS, and finally ECI. During scale-in, remove pods in reverse order: ECI first, then pay-as-you-go ECS, then subscription ECS.
In this example, all nodes have 2 vCPUs and 4 GB of memory.
-
Label the nodes by billing type. If you use node pools, configure labels at the node pool level instead.
kubectl label node cn-beijing.10.0.3.137 paidtype=subscription kubectl label node cn-beijing.10.0.3.138 paidtype=subscription kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go -
Create a ResourcePolicy that orders units by billing type.
apiVersion: scheduling.alibabacloud.com/v1alpha1 kind: ResourcePolicy metadata: name: nginx namespace: default spec: selector: app: nginx # Must match the pod label in the Deployment below strategy: prefer units: - resource: ecs nodeSelector: paidtype: subscription - resource: ecs nodeSelector: paidtype: pay-as-you-go - resource: eci -
Create a Deployment with two replicas.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: name: nginx labels: app: nginx # Must match the ResourcePolicy selector spec: containers: - name: nginx image: nginx resources: limits: cpu: 2 requests: cpu: 2 -
Apply and verify initial placement on subscription nodes.
-
Apply the YAML files.
kubectl apply -f nginx.yaml -
Check pod placement.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 66s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 66s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>Both pods are on subscription nodes.
-
-
Scale out to verify overflow to pay-as-you-go ECS and then ECI.
-
Scale to four replicas and check pod placement.
kubectl scale deployment nginx --replicas 4kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 16s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 3m48s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 16s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 3m48s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>Overflow pods are scheduled to pay-as-you-go nodes.
-
Scale to six replicas and check pod placement.
kubectl scale deployment nginx --replicas 6kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 3m10s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 6m42s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 3m10s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 6m42s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Running 0 36s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Running 0 36s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>When all ECS capacity is exhausted, the remaining pods are scheduled to ECI (virtual-kubelet nodes).
-
-
Scale in to verify reverse removal order.
-
Scale to four replicas. ECI pods are removed first.
kubectl scale deployment nginx --replicas 4kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 1/1 Running 0 4m59s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 8m31s 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 1/1 Running 0 4m59s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 8m31s 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> nginx-9cdf7bbf9-s**** 1/1 Terminating 0 2m25s 10.0.6.68 virtual-kubelet-cn-beijing-j <none> <none> nginx-9cdf7bbf9-v**** 1/1 Terminating 0 2m25s 10.0.6.67 virtual-kubelet-cn-beijing-j <none> <none>ECI pods are the first to be removed.
-
Scale to two replicas. Pay-as-you-go ECS pods are removed next.
kubectl scale deployment nginx --replicas 2kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-4**** 0/1 Terminating 0 6m43s 172.29.113.155 cn-beijing.10.0.6.47 <none> <none> nginx-9cdf7bbf9-b**** 1/1 Running 0 10m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-f**** 0/1 Terminating 0 6m43s 172.29.113.88 cn-beijing.10.0.6.46 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 10m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none> -
After termination completes, only the subscription ECS pods remain.
kubectl get pods -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-9cdf7bbf9-b**** 1/1 Running 0 11m 172.29.112.215 cn-beijing.10.0.3.137 <none> <none> nginx-9cdf7bbf9-r**** 1/1 Running 0 11m 172.29.113.23 cn-beijing.10.0.3.138 <none> <none>
-
Troubleshooting
Pods are stuck in Pending after applying a ResourcePolicy
The scheduler may not be associating the ResourcePolicy with the correct pods. Check that the selector in the ResourcePolicy exactly matches the pod labels in your workload. If the selector uses a label that the system reserves (such as alibabacloud.com/compute-class), the system may modify it, breaking the association.
Also confirm your kube-scheduler version meets the minimum requirement for your cluster version (see Prerequisites).
Scale-in does not follow the expected reverse order
This feature is best-effort. The scheduler does not guarantee strict reverse-order removal in all cases — for example, when preemption is active or when multiple pods become eligible for removal simultaneously.
If you require stricter ordering, check the whenTryNextUnits.policy setting and consider LackResourceAndNoTerminating for rolling update scenarios.
ResourcePolicy conflicts with pod-deletion-cost
If you have configured pod-deletion-cost annotations on pods in the same workload, the two features conflict and cannot be used together. Remove pod-deletion-cost annotations before applying a ResourcePolicy.
Node pool creates unexpected nodes when used with elastic node pools
When an auto-scaling node pool is included in a unit with the max field set, the Cluster Autoscaler may create more nodes than the max value, because it does not currently read the max limit from ResourcePolicy. To avoid this, include the elastic node pool in a unit and do not set max for that unit.
What's next
-
To declare that only ECS or ECI resources are used, or to automatically request ECI when ECS is insufficient, configure tolerations and node affinity. See Specify resource allocation for ECS and ECI.
-
For zone-based discretization and affinity scheduling of ECI pods in ACK managed cluster Pro edition, see Implement zone-based discretization and affinity scheduling for ECI pods.