How to customize pod scheduling - Container Service for Kubernetes

Priority-based resource scheduling is provided by Alibaba Cloud to meet elasticity requirements in pod scheduling. A ResourcePolicy specifies the priorities of nodes in descending order for pod scheduling. When the system deploys or scales out pods for an application, pods are scheduled to nodes based on the priorities of the nodes that are listed in the ResourcePolicy. When the system scales in pods for an application, pods are removed from nodes based on the priorities of the nodes in ascending order.

Important

In kube-scheduler v1.x.x-aliyun-6.4 and later, the ignorePreviousPod parameter of a ResourcePolicy is set to False and the ignoreTerminatingPod parameter is set to True by default. Existing ResourcePolicies that use the preceding parameters are not affected by this change or further updates.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster that runs Kubernetes 1.20.11 or later is created. For more information about how to update the Kubernetes version of an ACK cluster, see Update the Kubernetes version of an ACK cluster.
The scheduler version that is required varies based on the Kubernetes version of the cluster. The following table describes the scheduler versions that are required for different Kubernetes versions. For more information about the features of different scheduler versions, see kube-scheduler.
Kubernetes version
Scheduler version
1.20
1.20.4-ack-7.0 and later
1.22
1.22.15-ack-2.0 and later
1.24 and later
All versions are supported
ack-virtual-node is deployed if you want to use elastic container instances. For more information, see Use Elastic Container Instance in ACK clusters.

Limits

Priority-based resource scheduling is mutually exclusive with the pod deletion cost feature. For more information about the pod deletion cost feature, see Pod deletion cost.
You cannot use priority-based resource scheduling and Elastic Container Instance-based scheduling at the same time. For more information about Elastic Container Instance-based scheduling, see Use Elastic Container Instance-based scheduling.
This feature uses the BestEffort policy and does not ensure that pods are removed from nodes based on the priorities of the nodes in ascending order when the system scales in pods for an application.
The max parameter is available only when your cluster runs Kubernetes 1.22 or later and the version of the scheduler installed in your cluster is 5.0 or later.
If you use this feature together with elastic node pools, invalid nodes may be added to the elastic node pools. Make sure that the elastic node pools are included in units. Do not specify the max parameter for the units.
If you use a scheduler version earlier than 5.0 or the Kubernetes version of your cluster is 1.20 or earlier, pods that already exist before the ResourcePolicy is created are prioritized during a scale-in activity.
If you use a scheduler version earlier than 6.1 or the Kubernetes version of your cluster is 1.20 or earlier, do not modify the ResourcePolicy before all pods selected by the ResourcePolicy are deleted.

How to configure priority-based resource scheduling

Create a ResourcePolicy with the following template:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: test
  namespace: default
spec:
  ignorePreviousPod: false
  ignoreTerminatingPod: true
  matchLabelKeys:
  - pod-template-hash
  preemptPolicy: AfterAllUnits
  selector:
    key1: value1
  strategy: prefer
  units:
  - nodeSelector:
      unit: first
    resource: ecs
  - nodeSelector:
      unit: second
    max: 10
    resource: ecs
  - resource: eci
  whenTryNextUnits:
    policy: TimeoutOrExceedMax
    timeout: 1m

selector: the selector that is used to select pods in a namespace. The ResourcePolicy is applied to the selected pods. In this example, pods that have the key1=value1 label are selected. When selector is left empty, the ResourcePolicy takes effect on all pods in the namespace.
strategy: the scheduling policy. Set the value to prefer.
units: the schedulable units. In a scale-out activity, pods are scheduled to nodes based on the priorities of the nodes that are listed below units in descending order. In a scale-in activity, pods are deleted from the nodes based on the priorities of the nodes in ascending order.
- resource: the type of elastic resource. Valid values: eci, ecs, and elastic. elastic is available when your cluster runs a Kubernetes version later than 1.24 and the scheduler version is 6.4.3 or later.
- nodeSelector: the selector that is used to select nodes with the specified node label. This parameter takes effect only if the resource parameter is set to ecs.
- max (available when the scheduler version is 5.0 and later): Specifies the maximum number of replicated pods that can be scheduled to the current unit.
preemptPolicy (available when the scheduler version is 6.1 and later): the preemption policy that is used when the ResourcePolicy contains multiple units. The policy specifies whether to preempt nodes each time the scheduler fails to schedule pods to a unit. If you set the preemption policy to BeforeNextUnit, the scheduler attempts to preempt nodes each time it fails to schedule pods to a unit. If you set the preemption policy to AfterAllUnits, the scheduler attempts to preempt nodes only after it fails to schedule pods to all units. The default is AfterAllUnits.
ignorePreviousPod (available when the scheduler version is 6.1 and later): This parameter must be used together with the max parameter in the units section. If this parameter is set to true, the value of the max parameter does not include pods that are scheduled before the ResourcePolicy is created.
ignoreTerminatingPod (available when the scheduler version is 6.1 and later): This parameter must be used together with the max parameter in the units section. If this parameter is set to true, the value of the max parameter does not include pods in the Terminating state.
matchLabelKeys (available when the scheduler version is 6.2 and later): This parameter must be used together with the max parameter in the units section. This parameter specifies the label keys of the pods specified by the max parameter. After you configure the matchLabelKeys parameter, pods without the specified label keys are not scheduled.
whenTryNextUnits (available when the Kubernetes version of the cluster is 1.24 or later and the scheduler version is 6.4 or later): This parameter specifies that pods are allowed to use resources in subsequent units under any conditions.
- policy: the pod scheduling policy. Valid values: ExceedMax, LackResourceAndNoTerminating, TimeoutOrExceedMax, and LackResourceOrExceedMax (default value).
  - ExceedMax: If the Max limit is not configured for the current unit or the number of pods in the current unit is greater than the Max limit, pods are allowed to use resources of the next level. This policy can be used together with Auto Scaling and Elastic Container Instance to preferably use Auto Scaling to scale a node pool.
    Important
    If the autoscaler fails to add nodes to a node pool for a long period of time after this policy is used, pending pods may exist.
    The autoscaler is unaware of the Max limit of the ResourcePolicy. The actual number of instances that are added may be greater than the Max limit. This issue will be resolved in later versions.
  - TimeoutOrExceedMax: When one of the following conditions is met:
    - The Max limit is configured for the current unit and the number of pods in the unit is smaller than the Max limit.
    - The Max limit is not configured for the current unit and the resource type of the unit is elastic.
    If the current unit does not have sufficient resources for pod scheduling, the pods in the current unit wait for resources. The maximum waiting time equals the value of timeout. This policy can be used together with Auto Scaling and Elastic Container Instance to preferably use Auto Scaling to scale a node pool and use elastic container instances when the timeout period ends.
  - Important
    If the newly added nodes fail to reach the Ready state before the timeout period ends and pods are not configured to tolerate the NotReady taint, pods are still scheduled to elastic container instances.
  - LackResourceOrExceedMax: If the number of pods in the current unit is equal to or greater than the Max limit or the unit does not have idle resources, pods are allowed to use resources of the next level. This is the default policy and suitable for most scenarios.
  - LackResourceAndNoTerminating: If the number of pods in the current unit is equal to or greater than the Max limit or the unit does not have idle resources, and no Terminating pods exist in the unit, pods are allowed to use resources of the next level. This policy can be used together with a rolling update policy to prevent scheduling new pods to subsequent units when Terminating pods exist in the current unit.
- timeout: When the timeoutOrExceedMaxPolicy policy is used, this parameter specifies the timeout period. When this parameter is set to an empty string, the timeout period is 15 minutes.

Examples

Example 1: Schedule pods to specified node pools

You want to schedule the pods of a Deployment to specific node pools, for example, Node Pool A and Node Pool B. You want to prioritize the use of Node Pool A and schedule pods to Node Pool B only if the computing resources of Node Pool A are insufficient. In scale-in activities, you want to delete pods from the nodes in Node Pool B first. In this example, Node Pool A contains the following nodes: cn-beijing.10.0.3.137 and cn-beijing.10.0.3.138. Node Pool B contains the following nodes: cn-beijing.10.0.6.47 and cn-beijing.10.0.6.46. Each of these nodes has 2 vCores and 4 GB of memory. Perform the following steps to configure priority-based resource scheduling:

Create a ResourcePolicy with the following template:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # You must specify the label of the pods to which you want to apply the ResourcePolicy. 
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: np7ec79f2235954e879de07b780058****
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: npab2df797738644e3a7b7cbf532bb****

Note

To view the ID of a node pool, choose Nodes > Node Pools on the details page of a cluster in the ACK console. For more information, see Create a node pool.

Use the following template to create a Deployment that provisions two pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # The pod label must be the same as the one that you specified for the selector in the ResourcePolicy. 
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Deploy an NGINX application and query the pods.

Run the following command to deploy an NGINX application:
```
kubectl apply -f nginx.yaml
```
Expected output:
```
deployment.apps/nginx created
```

Run the following command to query the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          17s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          17s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>

The output shows that the two pods are scheduled to Node Pool A.

Scale out pods for the NGINX application.

Run the following command to increase the number of pods to four:

kubectl scale deployment nginx --replicas 4

Expected output:

deployment.apps/nginx scaled

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE    IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          101s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running   0          101s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   1/1     Running   0          18s    172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   1/1     Running   0          18s    172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

The output shows that new pods are scheduled to Node Pool B because the computing resources in Node Pool A are insufficient.

Scale in pods for the NGINX application.

Run the following command to reduce the number of pods to two:
```
kubectl scale deployment nginx --replicas 2
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running       0          2m41s   172.29.112.216   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-k****   1/1     Running       0          2m41s   172.29.113.24    cn-beijing.10.0.3.138   <none>           <none>
nginx-9cdf7bbf9-m****   0/1     Terminating   0          78s     172.29.113.156   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-x****   0/1     Terminating   0          78s     172.29.113.89    cn-beijing.10.0.6.46    <none>           <none>

The output shows that pods that run on the nodes in Node Pool B are deleted.

Example 2: Schedule pods to ECS instances and elastic container instances

You want to schedule the pods of a Deployment to multiple types of resources, such as subscription Elastic Compute Service (ECS) instances, pay-as-you-go ECS instances, and elastic container instances. To reduce the resource cost, you want to schedule pods to resources based on the following priorities: subscription ECS instances > pay-as-you-go ECS instances > elastic container instances. In scale-in activities, you want to delete pods from these resources based on the following sequence: elastic container instances, pay-as-you-go ECS instances, and subscription ECS instances. In this example, each of the ECS instances has 2 vCores and 4 GB of memory. Perform the following steps to configure priority-based resource scheduling:

Run the following command to add labels that indicate different billing methods to the nodes. You can also use node pools to automatically add labels to the nodes.

kubectl label node cn-beijing.10.0.3.137 paidtype=subscription
kubectl label node cn-beijing.10.0.3.138 paidtype=subscription
kubectl label node cn-beijing.10.0.6.46 paidtype=pay-as-you-go
kubectl label node cn-beijing.10.0.6.47 paidtype=pay-as-you-go

Create a ResourcePolicy with the following template:

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx # You must specify the label of the pods to which you want to apply the ResourcePolicy. 
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      paidtype: subscription
  - resource: ecs
    nodeSelector:
      paidtype: pay-as-you-go
  - resource: eci

Use the following template to create a Deployment that provisions two pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      name: nginx
      labels:
        app: nginx # The pod label must be the same as the one that you specified for the selector in the ResourcePolicy. 
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          limits:
            cpu: 2
          requests:
            cpu: 2

Deploy an NGINX application and query the pods.

Run the following command to deploy an NGINX application:
```
kubectl apply -f nginx.yaml
```
Expected output:
```
deployment.apps/nginx created
```

Run the following command to query the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          66s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          66s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that two pods are scheduled to nodes that have the paidtype=subscription label.

Scale out pods for the NGINX application.

Run the following command to increase the number of pods to four:
```
kubectl scale deployment nginx --replicas 4
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          16s     172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          3m48s   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          16s     172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          3m48s   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that new pods are scheduled to nodes that have the paidtype=pay-as-you-go label because nodes that have the paidtype=subscription label are insufficient.

Run the following command to increase the number of pods to six:
```
kubectl scale deployment nginx --replicas 6
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running   0          3m10s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running   0          6m42s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running   0          3m10s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          6m42s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Running   0          36s     10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Running   0          36s     10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

The output shows that new pods are scheduled to elastic container instances because the ECS nodes are insufficient.

Scale in pods for the NGINX application.

Run the following command to reduce the number of pods to four:
```
kubectl scale deployment nginx --replicas 4
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                           NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   1/1     Running       0          4m59s   172.29.113.155   cn-beijing.10.0.6.47           <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          8m31s   172.29.112.215   cn-beijing.10.0.3.137          <none>           <none>
nginx-9cdf7bbf9-f****   1/1     Running       0          4m59s   172.29.113.88    cn-beijing.10.0.6.46           <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          8m31s   172.29.113.23    cn-beijing.10.0.3.138          <none>           <none>
nginx-9cdf7bbf9-s****   1/1     Terminating   0          2m25s   10.0.6.68        virtual-kubelet-cn-beijing-j   <none>           <none>
nginx-9cdf7bbf9-v****   1/1     Terminating   0          2m25s   10.0.6.67        virtual-kubelet-cn-beijing-j   <none>           <none>

The output shows that the pods that run on elastic containers instances are deleted.

Run the following command to reduce the number of pods to two:
```
kubectl scale deployment nginx --replicas 2
```
Expected output:
```
deployment.apps/nginx scaled
```

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS        RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-4****   0/1     Terminating   0          6m43s   172.29.113.155   cn-beijing.10.0.6.47    <none>           <none>
nginx-9cdf7bbf9-b****   1/1     Running       0          10m     172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-f****   0/1     Terminating   0          6m43s   172.29.113.88    cn-beijing.10.0.6.46    <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running       0          10m     172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that the pods on the nodes that have the paidtype=pay-as-you-go label are deleted.

Run the following command to query the status of the pods:

kubectl get pods -o wide

Expected output:

NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                    NOMINATED NODE   READINESS GATES
nginx-9cdf7bbf9-b****   1/1     Running   0          11m   172.29.112.215   cn-beijing.10.0.3.137   <none>           <none>
nginx-9cdf7bbf9-r****   1/1     Running   0          11m   172.29.113.23    cn-beijing.10.0.3.138   <none>           <none>

The output shows that pods run only on the nodes with the paidtype=subscription label.

Container Service for Kubernetes:Configure priority-based resource scheduling

Prerequisites

Limits

How to configure priority-based resource scheduling

Examples

Related topics

Kubernetes version	Scheduler version
1.20	1.20.4-ack-7.0 and later
1.22	1.22.15-ack-2.0 and later
1.24 and later	All versions are supported