All Products
Search
Document Center

Container Service for Kubernetes:Customize elastic resource scheduling priorities

Last Updated:Nov 07, 2025

Custom elastic resource priority scheduling is an elastic scheduling policy from Alibaba Cloud. This policy lets you define a custom resource policy (ResourcePolicy) during application deployment or scale-out. The ResourcePolicy specifies the order for scheduling application instance pods to different types of node resources. During a scale-in, pods are removed in the reverse of the scheduling order.

Warning

Do not use system-reserved labels, such as alibabacloud.com/compute-class and alibabacloud.com/compute-qos, in the spec.selector.matchLabels field of a workload, such as a deployment. The system may modify these labels during custom priority scheduling. This can cause the VPC controller to frequently recreate pods and affect application stability.

Prerequisites

  • An ACK managed cluster (Pro version) that runs v1.20.11 or later has been created. To upgrade a cluster, see Manually upgrade a cluster.

  • The scheduler version must meet the requirements for your ACK cluster version. For more information about the features supported by different scheduler versions, see kube-scheduler.

    ACK version

    Scheduler version

    1.20

    v1.20.4-ack-7.0 or later

    1.22

    v1.22.15-ack-2.0 or later

    1.24 or later

    All versions are supported

  • To use ECI resources, ensure that the ack-virtual-node component is deployed. For more information, see Use ECI in ACK.

Notes

  • Starting from scheduler version v1.x.x-aliyun-6.4, the default value of the ignorePreviousPod field for custom elastic resource priorities is changed to False, and the default value of the ignoreTerminatingPod field is changed to True. This change does not affect existing ResourcePolicy configurations or subsequent updates to them.

  • This feature conflicts with pod-deletion-cost and cannot be used at the same time.

  • This feature cannot be used with ECI elastic scheduling through ElasticResource.

  • This feature uses a BestEffort policy and does not guarantee that scale-in operations strictly follow the reverse order.

  • The `max` field is available only in clusters that run v1.22 or later with a scheduler of v5.0 or later.

  • When used with an elastic node pool, this feature may cause the node pool to eject nodes incorrectly. To use this feature with an elastic node pool, include the elastic node pool in a Unit and do not set the `max` field for that Unit.

  • If your scheduler version is earlier than 5.0 or your cluster version is 1.20 or earlier, pods that exist before the ResourcePolicy is created are the first to be scaled in.

  • If your scheduler version is earlier than 6.1 or your cluster version is 1.20 or earlier, do not modify the ResourcePolicy until all associated pods are completely deleted.

Usage

You can create a ResourcePolicy to define elastic resource priorities:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: test
  namespace: default
spec:
  selector:
    key1: value1
  strategy: prefer
  units:
  - nodeSelector:
      unit: first
    podLabels:
      key1: value1
    podAnnotations:
      key1: value1
    resource: ecs
  - nodeSelector:
      unit: second
    max: 10
    resource: ecs
  - resource: eci
  # Optional, Advanced Configurations
  preemptPolicy: AfterAllUnits
  ignorePreviousPod: false
  ignoreTerminatingPod: true
  matchLabelKeys:
  - pod-template-hash
  whenTryNextUnits:
    policy: TimeoutOrExceedMax
    timeout: 1m
  • selector: Specifies that the ResourcePolicy applies to pods that are in the same namespace and have the label key1=value1. If the selector is empty, the policy applies to all pods in the namespace.

  • strategy: The scheduling strategy. Currently, only prefer is supported.

  • units: User-defined scheduling units. During a scale-out, resources are created in the order defined in units. During a scale-in, resources are removed in the reverse order.

    • resource: The type of elastic resource. The supported values are eci, ecs, elastic, and acs. The elastic type is available in clusters of v1.24 or later with a scheduler of v6.4.3 or later. The acs type is available in clusters of v1.26 or later with a scheduler of v6.7.1 or later.

      Note

      The elastic type will be deprecated. We recommend that you use auto-scaling node pools by setting k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" in the pod labels.

      Note

      The acs type adds the alibabacloud.com/compute-class: default and alibabacloud.com/compute-class: general-purpose labels to pods by default. You can overwrite the default values by declaring different values in the pod labels. If alpha.alibabacloud.com/compute-qos-strategy is declared in the pod annotations, the alibabacloud.com/compute-class: default label is not added by default.

      Note

      The acs and eci types add tolerations for virtual node taints to pods by default. Pods can be scheduled to virtual nodes without requiring additional taint toleration configurations.

      Important

      In scheduler versions earlier than 6.8.3, you cannot use multiple acs Units at the same time.

    • nodeSelector: Specifies the label of a node to select nodes for the scheduling unit. This parameter applies only to ECS resources.

    • max (available for scheduler v5.0 and later): The maximum number of pod replicas that can be scheduled in this scheduling unit.

    • maxResources (available for scheduler v6.9.5 and later): The maximum amount of resources that can be scheduled for pods in this scheduling unit.

    • podAnnotations: The type is map[string]string{}. The key-value pairs configured in podAnnotations are added to the pod by the scheduler. When counting the number of pods in this Unit, only pods with these key-value pairs are counted.

    • podLabels: The type is map[string]string{}. The key-value pairs configured in podLabels are added to the pod by the scheduler. When counting the number of pods in this Unit, only pods with these key-value pairs are counted.

      Note

      If k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" is included in the `podLabels` of a Unit, and the number of pods in the current Unit is less than the specified `max` value, the scheduler makes the pod wait in the current Unit. You can set the waiting time in whenTryNextUnits. The label k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true" is not added to the pod. This label is also not required on the pod when counting the number of pods.

      Note

      When you use ResourcePolicy with auto scaling, you must also use instant elasticity. Otherwise, the cluster-autoscaler might trigger incorrect node pool scaling.

  • preemptPolicy (available for scheduler v6.1 and later. This parameter does not take effect for ACS.): When a ResourcePolicy contains multiple units, this field specifies whether the scheduler can attempt preemption when scheduling fails for a Unit. `BeforeNextUnit` indicates that the scheduler attempts preemption if scheduling fails for any Unit. `AfterAllUnits` indicates that the scheduler attempts preemption only if scheduling fails for the last Unit. The default value is AfterAllUnits.

    You can configure the ACK Scheduler parameters to enable preemption.
  • ignorePreviousPod (available for scheduler v6.1 and later): This field must be used with max in units. If this field is set to true, pods that were scheduled before the ResourcePolicy was created are ignored when the number of pods is counted.

  • ignoreTerminatingPod (available for scheduler v6.1 and later): This field must be used with max in units. If this field is set to true, pods in the Terminating state are ignored when the number of pods is counted.

  • matchLabelKeys (available for scheduler v6.2 and later): This field must be used with max in units. Pods are grouped based on the values of the specified labels. Different groups of pods are subject to different max counts. If this feature is used and a pod is missing a label that is declared in matchLabelKeys, the pod cannot be scheduled.

  • whenTryNextUnits (available for cluster v1.24 and later with scheduler v6.4 and later): Describes the conditions under which a pod is allowed to use resources from the next Unit.

    • policy: The policy that determines when a pod can be scheduled to the next Unit. Valid values are ExceedMax, LackResourceAndNoTerminating, TimeoutOrExceedMax, and LackResourceOrExceedMax (default).

      • ExceedMax: The pod is allowed to use resources from the next Unit if the `max` and `maxResources` fields for the current Unit are not set, or if the number of pods in the current Unit is greater than or equal to the specified `max` value, or if the amount of used resources in the current Unit plus the resources of the current pod exceeds `maxResources`. This policy can be used with auto scaling and ECI to prioritize auto scaling for node pools.

        Important
        • If the auto-scaling node pool cannot create nodes for a long time, this policy may cause pods to remain in the Pending state.

        • Because Cluster Autoscaler is not aware of the `max` limit in ResourcePolicy, the actual number of created instances may be greater than the specified `max` value. This issue will be fixed in a future release.

      • TimeoutOrExceedMax: This error occurs if one of the following conditions is met:

        • The `max` value for the current Unit is set and the number of pods in the Unit is less than the `max` value. Alternatively, `maxResources` is set and the amount of scheduled resources plus the resources of the current pod is less than `maxResources`.

        • The `max` value for the current Unit is not set, and the `podLabels` of the current Unit contain k8s.aliyun.com/resource-policy-wait-for-ecs-scaling: "true".

        If the current Unit has insufficient resources to schedule the pod, the pod remains pending in that Unit for a maximum duration specified by the timeout value. When used with auto scaling and ECI, this policy prioritizes auto scaling for node pools and automatically uses ECI after the timeout.

        Important

        If a node is created during the timeout period but does not reach the Ready state, and the pod does not tolerate the NotReady taint, the pod is still scheduled to an ECI instance.

      • LackResourceOrExceedMax: If the number of pods in the current Unit is greater than or equal to the specified `max` value, or if there are no more available resources in the current Unit, the pod is allowed to use resources from the next Unit. This is the default policy and is suitable for most basic scenarios.

      • LackResourceAndNoTerminating: If the number of pods in the current Unit is greater than or equal to the specified `max` value, or if there are no more available resources in the current Unit, and there are no pods in the Terminating state in the current Unit, the pod is allowed to use resources from the next Unit. This policy is suitable for use with rolling update strategies to prevent new pods from being rolled out to subsequent Units due to terminating pods.

    • timeout: The `timeout` parameter is not supported in an `acs` Unit, which is limited only by `max`. When the `policy` is set to TimeoutOrExceedMax, this field specifies the timeout duration. If this field is empty, the default value is 15 minutes.

Example scenarios

Scenario 1: Schedule pods based on node pool priority

You want to deploy a deployment to a cluster that has two node pools: Node Pool A and Node Pool B. You want to prioritize scheduling pods to Node Pool A. If Node Pool A has insufficient resources, pods are scheduled to Node Pool B. When you scale in the deployment, pods in Node Pool B are removed first, followed by pods in Node Pool A. In this example, the nodes cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138 belong to Node Pool A, and the nodes cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46 belong to Node Pool B. The node specifications are 2 cores and 4 GB. The following steps describe the procedure:

  1. You can use the following YAML content to create a ResourcePolicy that customizes the node pool scheduling order.

    apiVersion: scheduling.alibabacloud.com/v1alpha1
    kind: ResourcePolicy
    metadata:
      name: nginx
      namespace: default
    spec:
      selector:
        app: nginx # This must be associated with the label of the pod that you will create later.
      strategy: prefer
      units:
      - resource: ecs
        nodeSelector:
          alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058****
      - resource: ecs
        nodeSelector:
          alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****
    Note

    You can obtain the node pool ID on the Node Management > Node Pools page of the cluster. For more information, see Create and manage node pools.

  2. You can use the following YAML content to create a deployment and deploy two pods.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          name: nginx
          labels:
            app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step.
        spec:
          containers:
          - name: nginx
            image: nginx
            resources:
              limits:
                cpu: 2
              requests:
                cpu: 2
  3. Create the Nginx application and view the deployment result.

    1. Run the following command to create the Nginx application.

      kubectl apply -f nginx.yaml

      Expected output:

      deployment.apps/nginx created
    2. Run the following command to view the deployment result.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-b****   1/1     Running   0          17s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-k****   1/1     Running   0          17s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>

      The output shows that the first two pods are scheduled to the nodes in Node Pool A.

  4. Scale out the pods.

    1. Run the following command to scale out the pods to four replicas.

      kubectl scale deployment nginx --replicas 4                      

      Expected output:

      deployment.apps/nginx scaled
    2. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE    IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-b****   1/1     Running   0          101s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-k****   1/1     Running   0          101s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
      nginx-9cdf7bbf9-m****   1/1     Running   0          18s    172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
      nginx-9cdf7bbf9-x****   1/1     Running   0          18s    172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

      The output shows that when the nodes in Node Pool A have insufficient resources, the new pods are scheduled to the nodes in Node Pool B.

  5. Scale in the pods.

    1. Run the following command to scale in the pods from four replicas to two.

      kubectl scale deployment nginx --replicas 2

      Expected output:

      deployment.apps/nginx scaled
    2. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-b****   1/1     Running       0          2m41s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-k****   1/1     Running       0          2m41s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
      nginx-9cdf7bbf9-m****   0/1     Terminating   0          78s     172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
      nginx-9cdf7bbf9-x****   0/1     Terminating   0          78s     172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

      The output shows that pods in Node Pool B are removed first, which is the reverse of the scheduling order.

Scenario 2: Use hybrid scheduling for ECS and ECI

You want to deploy a deployment to a cluster that has three types of resources: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. To reduce resource costs, you want to set the scheduling priority in the following order: subscription ECS instances, pay-as-you-go ECS instances, and ECI instances. When you scale in the deployment, you want to remove pods from ECI instances first, then from pay-as-you-go ECS instances, and finally from subscription ECS instances. In this example, the node specifications are 2 cores and 4 GB. The following steps describe the procedure for the hybrid scheduling of ECS and ECI:

  1. Run the following commands to add different labels to nodes that use different billing methods. You can also use the node pool feature to automatically add labels.

    kubectl label node cn-beijing.10.0.3.137 paidtype=subscription
    kubectl label node cn-beijing.10.0.3.138 paidtype=subscription
    kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go
    kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go
  2. You can use the following YAML content to create a ResourcePolicy that customizes the scheduling order.

    apiVersion: scheduling.alibabacloud.com/v1alpha1
    kind: ResourcePolicy
    metadata:
      name: nginx
      namespace: default
    spec:
      selector:
        app: nginx # This must be associated with the label of the pod that you will create later.
      strategy: prefer
      units:
      - resource: ecs
        nodeSelector:
          paidtype: subscription
      - resource: ecs
        nodeSelector:
          paidtype: pay-as-you-go
      - resource: eci
  3. You can use the following YAML content to create a deployment and deploy two pods.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
      labels:
        app: nginx
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          name: nginx
          labels:
            app: nginx # This must be associated with the selector of the ResourcePolicy created in the previous step.
        spec:
          containers:
          - name: nginx
            image: nginx
            resources:
              limits:
                cpu: 2
              requests:
                cpu: 2
  4. Create the Nginx application and view the deployment result.

    1. Run the following command to create the Nginx application.

      kubectl apply -f nginx.yaml

      Expected output:

      deployment.apps/nginx created
    2. Run the following command to view the deployment result.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-b****   1/1     Running   0          66s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running   0          66s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

      The expected output shows that the first two Pods are scheduled to nodes where the label is paidtype=subscription.

  5. Scale out the pods.

    1. Run the following command to scale out the pods to four replicas.

      kubectl scale deployment nginx --replicas 4

      Expected output:

      deployment.apps/nginx scaled
    2. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-4****   1/1     Running   0          16s     172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
      nginx-9cdf7bbf9-b****   1/1     Running   0          3m48s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-f****   1/1     Running   0          16s     172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running   0          3m48s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

      The output shows that when the nodes with the paidtype=subscription label have insufficient resources, the new pods are scheduled to the nodes with the paidtype=pay-as-you-go label.

    3. Run the following command to scale out the pods to six replicas.

      kubectl scale deployment nginx --replicas 6

      Expected output:

      deployment.apps/nginx scaled
    4. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-4****   1/1     Running   0          3m10s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
      nginx-9cdf7bbf9-b****   1/1     Running   0          6m42s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
      nginx-9cdf7bbf9-f****   1/1     Running   0          3m10s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running   0          6m42s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
      nginx-9cdf7bbf9-s****   1/1     Running   0          36s     10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
      nginx-9cdf7bbf9-v****   1/1     Running   0          36s     10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

      The output shows that when ECS resources are insufficient, the new pods are scheduled to ECI resources.

  6. Scale in the pods.

    1. Run the following command to scale in the pods from six replicas to four.

      kubectl scale deployment nginx --replicas 4

      Expected output:

      deployment.apps/nginx scaled
    2. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-4****   1/1     Running       0          4m59s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
      nginx-9cdf7bbf9-b****   1/1     Running       0          8m31s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
      nginx-9cdf7bbf9-f****   1/1     Running       0          4m59s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running       0          8m31s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
      nginx-9cdf7bbf9-s****   1/1     Terminating   0          2m25s   10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
      nginx-9cdf7bbf9-v****   1/1     Terminating   0          2m25s   10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

      The output shows that pods on ECI instances are removed first, which is the reverse of the scheduling order.

    3. Run the following command to scale in the pods from four replicas to two.

      kubectl scale deployment nginx --replicas 2

      Expected output:

      deployment.apps/nginx scaled
    4. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-4****   0/1     Terminating   0          6m43s   172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
      nginx-9cdf7bbf9-b****   1/1     Running       0          10m     172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-f****   0/1     Terminating   0          6m43s   172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running       0          10m     172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

      The output shows that pods on nodes with the paidtype=pay-as-you-go label are removed first, which is the reverse of the scheduling order.

    5. Run the following command to check the pod status.

      kubectl get pods -o wide

      Expected output:

      NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
      nginx-9cdf7bbf9-b****   1/1     Running   0          11m   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
      nginx-9cdf7bbf9-r****   1/1     Running   0          11m   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

      The output shows that only pods on nodes with the paidtype=subscription label remain.

References

  • When you deploy services in an ACK cluster, you can use tolerations and node affinity to specify that only ECS or ECI elastic resources are used, or to automatically request ECI resources when ECS resources are insufficient. By configuring scheduling policies, you can meet various requirements for elastic resources in different workload scenarios. For more information, see Specify resource allocation for ECS and ECI.

  • High availability (HA) and high performance are important for running distributed tasks. In an ACK managed cluster (Pro version), you can use native Kubernetes scheduling semantics to spread distributed tasks across zones to meet HA deployment requirements. You can also use native Kubernetes scheduling semantics to implement affinity-based deployment of distributed tasks in a specified zone to meet high-performance deployment requirements. For more information, see Implement zone-based anti-affinity and affinity scheduling for ECI pods.