Schedule Pods by Priority Using ResourcePolicy to Optimize ECI Placement - ACK

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster Pro edition, version 1.20.11 or later. For upgrade steps, see Manually upgrade a cluster.

A kube-scheduler version that meets the requirements for your ACK cluster version. See kube-scheduler for a full list of supported features per version.

ACK version	Scheduler version
1.20	v1.20.4-ack-7.0 or later
1.22	v1.22.15-ack-2.0 or later
1.24 or later	All versions supported

(Required for ECI resources) The ack-virtual-node component deployed in your cluster. See Use ECI in ACK.

Usage notes

Best-effort ordering: This feature uses a BestEffort policy. Pod scale-in does not strictly follow the reverse of the scheduling order in all cases.
Starting from scheduler version v1.x.x-aliyun-6.4, the default value of ignorePreviousPod changed to false, and ignoreTerminatingPod changed to true. Existing ResourcePolicy objects and subsequent updates are not affected.
This feature conflicts with pod-deletion-cost and cannot be used together.
This feature cannot be used with Elastic Container Instance (ECI) elastic scheduling implemented through ElasticResource. See Use ElasticResource for elastic scheduling of ECI pods.
The max field is available only in clusters of version 1.22 or later with scheduler version 5.0 or later.
When used with elastic node pools, this feature may cause node pools to create invalid nodes. To prevent this, include the elastic node pool in a unit and do not set the max field for that unit.
If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be scaled in.
If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify a ResourcePolicy while its associated pods are not completely deleted.
When used with auto-scaling, this feature must be used with instant elasticity. The Cluster Autoscaler may otherwise trigger incorrect node pool scaling.

Create a ResourcePolicy

Define a ResourcePolicy with the following YAML structure:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: test
  namespace: default
spec:
  selector:
    key1: value1
  strategy: prefer
  units:
  - nodeSelector:
      unit: first
    podLabels:
      key1: value1
    podAnnotations:
      key1: value1
    resource: ecs
  - nodeSelector:
      unit: second
    max: 10
    resource: ecs
  - resource: eci
  # Optional advanced configuration
  preemptPolicy: AfterAllUnits
  ignorePreviousPod: false
  ignoreTerminatingPod: true
  matchLabelKeys:
  - pod-template-hash
  whenTryNextUnits:
    policy: TimeoutOrExceedMax
    timeout: 1m

spec fields

Field	Description
`selector`	Selects pods that have matching labels in the same namespace. If empty, applies to all pods in the namespace.
`strategy`	Scheduling strategy. Only `prefer` is supported.
`units`	Ordered list of scheduling units. Pods are scheduled in list order during scale-out and removed in reverse order during scale-in.

units fields

Field	Description
`resource`	Resource type for this unit. Supported values: `ecs`, `eci`, `elastic` (clusters 1.24+ with scheduler 6.4.3+), `acs` (clusters 1.26+ with scheduler 6.7.1+).
`nodeSelector`	Selects nodes in this unit by their labels.
`max`	Maximum number of pod replicas schedulable to this unit. Available in scheduler version 5.0 or later.
`maxResources`	Maximum amount of resources schedulable to pods in this unit. Available in scheduler version 6.9.5 or later.
`podLabels`	Labels added to pods scheduled to this unit. Only pods with these labels are counted for this unit.
`podAnnotations`	Annotations added to pods scheduled to this unit. Only pods with these annotations are counted for this unit.

The elastic resource type is being deprecated. Instead, use auto-scaling node pools by setting k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" in podLabels.

The acs type adds the alibabacloud.com/compute-class: default and alibabacloud.com/compute-class: general-purpose labels to pods by default. Override these by specifying different values in podLabels. If alpha.alibabacloud.com/compute-qos-strategy is specified in podAnnotations, the alibabacloud.com/compute-class: default label is not added.

The acs and eci types add tolerations for virtual node taints to pods by default. The scheduler adds these tolerations internally — they do not appear in the pod spec, and pods can be scheduled to virtual nodes without additional taint toleration configuration.

Important

In scheduler versions earlier than 6.8.3, you cannot use multiple units of the acs type at the same time.

If a unit's podLabels include k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true", or if the number of pods in the unit is less than the max value, the scheduler holds the pod in the current unit until a condition is met. Set the maximum wait duration in whenTryNextUnits. The k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" label is not applied to the pod and is not required for pod counting.

Advanced configuration fields

Field	Available from	Description
`preemptPolicy`	Scheduler v6.1	Controls when preemption is attempted across units. `BeforeNextUnit`: attempt preemption each time a unit fails. `AfterAllUnits` (default): attempt preemption only after all units fail. Not applicable to ACS. See Enable preemption.
`ignorePreviousPod`	Scheduler v6.1	When `true`, pods created before the ResourcePolicy are excluded from pod counting. Must be used with `max`.
`ignoreTerminatingPod`	Scheduler v6.1	When `true`, pods in the Terminating state are excluded from pod counting. Must be used with `max`.
`matchLabelKeys`	Scheduler v6.2	Groups pods by label values and applies `max` per group. Pods missing a declared label are rejected by the scheduler. Must be used with `max`.
`whenTryNextUnits`	Cluster 1.24+, scheduler 6.4+	Defines when a pod is allowed to move to the next unit. See whenTryNextUnits policies below.

whenTryNextUnits policies

Policy	Moves to next unit when...	Best for
`LackResourceOrExceedMax` (default)	Current unit runs out of resources, or pod count reaches `max`	Most general use cases
`ExceedMax`	`max` and `maxResources` are not set, or pod count reaches `max`, or the resources used in the current unit plus the resources of the current pod exceed `maxResources`	Prioritizing auto-scaling of node pools over ECI
`TimeoutOrExceedMax`	(1) `max` is set and pod count is below `max`, or `maxResources` is set and current usage plus the current pod's resources are below `maxResources`; or (2) `max` is not set and `podLabels` contain `k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true"` — in either case, if the unit has insufficient resources, the pod waits up to `timeout` before moving	Node pool scale-out with ECI fallback after timeout
`LackResourceAndNoTerminating`	Resources are insufficient (or `max` is reached) and no pods in the current unit are Terminating	Rolling updates — prevents new pods from spilling to the next unit while old pods terminate

The timeout field applies only when policy is TimeoutOrExceedMax. The default is 15 minutes. Not supported for ACS units (which are limited only by max).

Important

If the auto-scaling node pool cannot create nodes for a long time, ExceedMax may leave pods in Pending indefinitely. The Cluster Autoscaler does not currently respect the max limit in ResourcePolicy, so the actual number of created instances may exceed max. This will be addressed in a future release.

Important

With TimeoutOrExceedMax, if a node is created during the timeout period but is not yet Ready, and the pod does not tolerate the NotReady taint, the pod is still scheduled to ECI.

Scenario examples

These scenarios produce best-effort results. Pod removal order during scale-in may not strictly follow the reverse scheduling order in all circumstances.

Prioritize one node pool over another

Goal: Deploy a Deployment across two node pools — Pool A first, Pool B as overflow. During scale-in, remove pods from Pool B first.

In this example, nodes cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Pool A, and cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Pool B. All nodes have 2 vCPUs and 4 GB of memory.

Create a ResourcePolicy that sets the node pool scheduling order. Replace the nodepool-id values with your actual node pool IDs, which you can find on the Node Management > Node Pools page. See Create and manage a node pool.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # Must match the pod label in the Deployment below
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058****
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****

Create a Deployment. The pod label app: nginx must match the selector in the ResourcePolicy.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # Must match the ResourcePolicy selector
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Apply the Deployment and verify pod placement.

Apply the YAML files.

kubectl apply -f nginx.yaml

Expected output:

deployment.apps/nginx created

Check which nodes the pods are scheduled to.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          17s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          17s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>

Both pods are on Pool A nodes, as expected.

Scale out to four replicas and verify overflow to Pool B.

Scale the Deployment.

kubectl scale deployment nginx --replicas 4

Check pod placement.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE    IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          101s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          101s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   1/1     Running   0          18s    172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   1/1     Running   0          18s    172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

The two new pods overflow to Pool B nodes, as Pool A is at capacity.

Scale in to two replicas and verify that Pool B pods are removed first.

Scale the Deployment.

kubectl scale deployment nginx --replicas 2

Check pod status.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running       0          2m41s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running       0          2m41s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   0/1     Terminating   0          78s     172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   0/1     Terminating   0          78s     172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

Pool B pods are removed first, which is the reverse of the scheduling order.

Use subscription ECS first, then pay-as-you-go ECS, then fall back to ECI

Goal: Minimize costs by filling subscription ECS capacity first, then pay-as-you-go ECS, and finally ECI. During scale-in, remove pods in reverse order: ECI first, then pay-as-you-go ECS, then subscription ECS.

In this example, all nodes have 2 vCPUs and 4 GB of memory.

Label the nodes by billing type. If you use node pools, configure labels at the node pool level instead.

kubectl label node cn-beijing.10.0.3.137 paidtype=subscription
kubectl label node cn-beijing.10.0.3.138 paidtype=subscription
kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go
kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go

Create a ResourcePolicy that orders units by billing type.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # Must match the pod label in the Deployment below
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      paidtype: subscription
  - resource: ecs
    nodeSelector:
      paidtype: pay-as-you-go
  - resource: eci

Create a Deployment with two replicas.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # Must match the ResourcePolicy selector
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Apply and verify initial placement on subscription nodes.

Apply the YAML files.
```
kubectl apply -f nginx.yaml
```

Check pod placement.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          66s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          66s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

Both pods are on subscription nodes.

Scale out to verify overflow to pay-as-you-go ECS and then ECI.

Scale to four replicas and check pod placement.

kubectl scale deployment nginx --replicas 4

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          16s     172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          3m48s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          16s     172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          3m48s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

Overflow pods are scheduled to pay-as-you-go nodes.

Scale to six replicas and check pod placement.

kubectl scale deployment nginx --replicas 6

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          3m10s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          6m42s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          3m10s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          6m42s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Running   0          36s     10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Running   0          36s     10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

When all ECS capacity is exhausted, the remaining pods are scheduled to ECI (virtual-kubelet nodes).

Scale in to verify reverse removal order.

Scale to four replicas. ECI pods are removed first.

kubectl scale deployment nginx --replicas 4

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running       0          4m59s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          8m31s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running       0          4m59s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          8m31s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Terminating   0          2m25s   10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Terminating   0          2m25s   10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

ECI pods are the first to be removed.

Scale to two replicas. Pay-as-you-go ECS pods are removed next.

kubectl scale deployment nginx --replicas 2

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   0/1     Terminating   0          6m43s   172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          10m     172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   0/1     Terminating   0          6m43s   172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          10m     172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

After termination completes, only the subscription ECS pods remain.

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          11m   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          11m   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

Troubleshooting

Pods are stuck in Pending after applying a ResourcePolicy

The scheduler may not be associating the ResourcePolicy with the correct pods. Check that the selector in the ResourcePolicy exactly matches the pod labels in your workload. If the selector uses a label that the system reserves (such as alibabacloud.com/compute-class), the system may modify it, breaking the association.

Also confirm your kube-scheduler version meets the minimum requirement for your cluster version (see Prerequisites).

Scale-in does not follow the expected reverse order

This feature is best-effort. The scheduler does not guarantee strict reverse-order removal in all cases — for example, when preemption is active or when multiple pods become eligible for removal simultaneously.

If you require stricter ordering, check the whenTryNextUnits.policy setting and consider LackResourceAndNoTerminating for rolling update scenarios.

ResourcePolicy conflicts with pod-deletion-cost

If you have configured pod-deletion-cost annotations on pods in the same workload, the two features conflict and cannot be used together. Remove pod-deletion-cost annotations before applying a ResourcePolicy.

Node pool creates unexpected nodes when used with elastic node pools

When an auto-scaling node pool is included in a unit with the max field set, the Cluster Autoscaler may create more nodes than the max value, because it does not currently read the max limit from ResourcePolicy. To avoid this, include the elastic node pool in a unit and do not set max for that unit.

What's next

To declare that only ECS or ECI resources are used, or to automatically request ECI when ECS is insufficient, configure tolerations and node affinity. See Specify resource allocation for ECS and ECI.
For zone-based discretization and affinity scheduling of ECI pods in ACK managed cluster Pro edition, see Implement zone-based discretization and affinity scheduling for ECI pods.

Container Service for Kubernetes:Custom priority-based scheduling for elastic resources

Prerequisites

Usage notes

Create a ResourcePolicy

spec fields

units fields

Advanced configuration fields

whenTryNextUnits policies

Scenario examples

Prioritize one node pool over another

Use subscription ECS first, then pay-as-you-go ECS, then fall back to ECI

Troubleshooting

Pods are stuck in Pending after applying a ResourcePolicy

Scale-in does not follow the expected reverse order

ResourcePolicy conflicts with pod-deletion-cost

Node pool creates unexpected nodes when used with elastic node pools

What's next

References