ack-koordinator provides the load-aware hotspot descheduling feature, which can sense the changes in the loads on cluster nodes and automatically optimize the nodes that exceed the safety load to prevent extreme load imbalance. This topic describes how to work with the load-aware hotspot descheduling feature and how to configure advanced settings for this feature.
Table of contents
Limits
Only Container Service for Kubernetes (ACK) Pro clusters support this feature. For more information about how to create an ACK Pro cluster, see Create an ACK managed cluster.
To use the load-aware hotspot descheduling feature, make sure that the following requirements are met.
Component
Required version
ACK scheduler
v1.22.15-ack-4.0 and later or v1.24.6-ack-4.0 and later
ack-koordinator(ack-slo-manager)
1.1.1-ack.1 and later
Helm
3.0 and later
Introduction to load-aware hotspot descheduling
This section describes the terms used in load-aware hotspot descheduling.
Load-aware scheduling
The ACK scheduler supports load-aware scheduling, which can schedule pods to nodes that run with low loads. Due to the changes in the cluster environment, traffic, and requests, the utilization of nodes dynamically changes and may break the load balance between nodes in the cluster, and even result in extreme load imbalance. This affects the runtime quality of the workload. ack-koordinator can identify changes in the loads of nodes and automatically optimize nodes that exceed the safety load to prevent extreme load imbalance. You can use a combination of load-aware scheduling and hotspot descheduling to achieve optimal load balancing among nodes. For more information about load-aware scheduling, see Use load-aware pod scheduling.
How koord-descheduler works
The ack-koordinator component provides the koord-descheduler module. The LowNodeLoad plug-in in koord-descheduler is responsible for sensing the loads on the nodes and reducing load hotspots. Unlike the Kubernetes-native descheduler plug-in LowNodeUtilization, the LowNodeLoad plug-in decides to deschedule nodes based on the actual resource utilization of the nodes, while LowNodeUtilization decides to deschedule based on resource allocation.
Descheduling procedure
koord-descheduler periodically performs descheduling. The following figure shows the steps of descheduling within each cycle.
Data collection: collects information about nodes and workloads in the cluster and the resource utilization statistics.
Descheduling policy implementation by the policy plug-in.
This section uses LowNodeLoad as an example.
Identifies hotspot nodes. For more information about the classification of nodes, see Load thresholds.
Traverses all hotspot nodes, identifies the nodes on which pods can be migrated, and sorts the pods. For more information about how pods are sorted, see Pod sorting policy.
Traverses all pods to be migrated and checks whether the pods meet the requirements for migration based on constraints such as the cluster size, resource utilization, and the ratio of replicated pods. For more information, see Load-aware hotspot descheduling policies.
Only pods that meet the requirements are migrated. If no pod meets the requirements on the current node, LowNodeLoad continues to traverse the pods on other hotspot nodes.
Pod eviction and migration: evicts the pods that meet the requirements for migration. For more information, see Pod eviction.
Load thresholds
The LowNodeLoad plug-in allows you to set the following load thresholds:
highThresholds: specifies the high load threshold. Pods on nodes whose load is higher than this threshold are descheduled.
lowThresholds: specifies the low load threshold. Pods on nodes whose load is lower than this threshold are not descheduled.
In the following figure, lowThresholds is set to 45% and highThresholds is set to 70%. Nodes are classified based on their loads and the thresholds. If the values of lowThresholds and highThresholds change, the standards for node classification also change.
The resource utilization statistics are updated every minute and the average values within the previous 5 minutes are displayed.
Idle nodes: nodes whose resource utilization is lower than 45%.
Normal nodes: nodes whose resource utilization is higher than or equal to 45% but lower than or equal to 70%. This is the desired resource utilization range for cluster nodes.
Hotspot nodes: nodes whose resource utilization is higher than 70. Pods on hotspot nodes will be evicted until the resource utilization of these nodes drops to 70% or lower.
Load-aware hotspot descheduling policies
Policy | Description |
Hotspot detection frequency policy | To accurately identify hotspot nodes and avoid frequent pod migration caused by the delayed monitoring data, koord-descheduler allows you to specify the frequency of hotspot detection. A node is considered a hotspot node only if the number of times that the node consecutively exceeds the load threshold reaches the specified frequency value. |
Node sorting policy | When hotspot nodes are identified, koord-descheduler sorts the nodes in descending order of resource usage and then deschedules the nodes in sequence. koord-descheduler compares the memory and CPU usage of the hotspot nodes and preferably deschedules nodes whose resource usage is higher. |
Pod sorting policy | koord-descheduler sorts the pods on each hotspot node and then evicts the pods to idle nodes in sequence. koord-descheduler attempts to evict the following pods in sequence: pods with lower QoS classes, pods with lower priorities, pods with higher memory usage, pods with higher CPU usage, and pods whose creation time is later. For example, if the pods that have the same QoS class, koord-descheduler first evicts the pods whose priorities are lower. |
Filtering policy | koord-descheduler allows you to configure various pod and node filters to control descheduling.
|
Precheck policy | koord-descheduler can precheck pods before migrating the pods.
|
Migration control policy | To ensure the high availability of applications during pod migration, koord-descheduler provides multiple features to allow you to control pod migration. You can specify the maximum number of pods that can be migrated at the same time per node, namespace, or workload. koord-descheduler also allows you to specify a pod migration time window to prevent pods that belong to the same workload from being migrated too frequently. koord-descheduler is compatible with the Pod Disruption Budgets (PDB) mechanism of open source Kubernetes, which helps you guarantee the high availability of your applications in a fine-grained manner. For more information, see Pod Disruption Budgets. |
Observability policy | You can collect events to monitor the descheduling procedure, and view the reason and status of the descheduling in event details. The following code block shows an example:
|
Step 1: Install or modify ack-koordinator and enable load-aware hotspot descheduling
Install ack-koordinator
Install ack-koordinator. On the Install ack-koordinator(ack-slo-manager) page, select Enable Descheduler for ack-koordinator. For more information, see Install ack-koordinator.
Modify ack-koordinator (ack-koordinator is already installed)
Modify ack-koordinator. On the ack-koordinator Parameters page, select Enable Descheduler for ack-koordinator. For more information, see Modify ack-koordinator.
Step 2: Enable the LowNodeLoad plug-in
Create a file named koord-descheduler-config.yaml and add the following content to the file.
koord-descheduler-config.yaml is a ConfigMap object used to enable the LowNodeLoad plug-in.
Run the following command to apply the configuration to the cluster:
kubectl apply -f koord-descheduler-config.yaml
Run the following command to restart koord-descheduler.
After koord-descheduler is restarted, the modified configuration takes effect.
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 deployment.apps/ack-koord-descheduler scaled
Step 3: (Optional) Enable the load-aware scheduling plug-in
For more information about how to enable the load-aware scheduling plug-in to achieve optimal load balancing among nodes, see Step 1 in the Use load-aware pod scheduling topic.
Step 4: Verify load-aware hotspot descheduling
In this section, a cluster that contains three nodes is used as an example. Each node has 104 vCores and 396 GB of memory.
Create a file named stress-demo.yaml and add the following content to the file.
Run the following command to create a pod for stress testing:
kubectl create -f stress-demo.yaml deployment.apps/stress-demo created
Run the following command to view the status of the pod until it starts to run.
kubectl get pod -o wide
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Running 0 82s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none>
The output indicates that pod
stress-demo-588f9646cf-s****
is scheduled to nodecn-beijing.10.XX.XX.53
.Increase the load of node
cn-beijing.10.XX.XX.53
. Then, run the following command to check the load of each node:kubectl top node
Expected output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% cn-beijing.10.XX.XX.215 17611m 17% 24358Mi 6% cn-beijing.10.XX.XX.53 63472m 63% 11969Mi 3%
The output indicates that the load of node
cn-beijing.10.XX.XX.53
is higher, which is 63%. The load exceeds the high resource threshold of 50%. The load of nodecn-beijing.10.XX.XX.215
is lower, which is 17%. Th load is lower than the low resource threshold of 20%.Enable load-aware hotspot descheduling. For more information, see Step 2: Enable the LowNodeLoad plug-in.
Run the following command to view the changes of the pods.
Wait for the descheduler to identify hotspot nodes and evict pods.
NoteA node is considered a hotspot node if the node consecutively exceeds the high resource threshold five times within 10 minutes.
kubectl get pod -w
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Terminating 0 59s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none> stress-demo-588f9646cf-7**** 1/1 ContainerCreating 0 10s 10.XX.XX.215 cn-beijing.10.XX.XX.215 <none> <none>
Run the following command to view the event:
kubectl get event | grep stress-demo-588f9646cf-s****
Expected output:
2m14s Normal Evicting podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 101s Normal EvictComplete podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" has been evicted 2m14s Normal Descheduled pod/stress-demo-588f9646cf-s**** Pod evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 2m14s Normal Killing pod/stress-demo-588f9646cf-s**** Stopping container stress
The output indicates the migration result. The pods on the hotspot node are migrated to the idle node.
Advanced settings
Advanced settings for koord-descheduler
The configuration of koord-descheduler is stored in a ConfigMap. The following code block shows the advanced settings for load-aware hotspot descheduling.
koord-descheduler system settings
Parameter | Type | Valid value | Description | Example |
dryRun | boolean |
| The global read-only mode. After this mode is enabled, pods cannot be migrated. | false |
deschedulingInterval | time.Duration | >0s | The descheduling interval. | 120s |
Migration control settings
Parameter | Type | Valid value | Description | Example |
maxMigratingPerNode | int64 | ≥ 0 (default value: 2) | The maximum number of pods that can be migrated at the same time on a node. A value of 0 indicates that no limit is set. | 2 |
maxMigratingPerNamespace | int64 | ≥ 0 (Default value: 0) | The maximum number of pods that can be migrated at the same time in a namespace. A value of 0 indicates that no limit is set. | 1 |
maxMigratingPerWorkload | intOrString | ≥ 0 (Default value: 10%) | The maximum number or percentage of pods that can be migrated at the same time in a workload, such as a Deployment. A value of 0 indicates that no limit is set. If the workload contains only one replicated pod, the workload is excluded for descheduling. | 1 or 10% |
maxUnavailablePerWorkload | intOrString | Equal to or larger than 0 (Default value: 10%) and smaller than the number of replicated pods of the workload | The maximum number or percentage of unavailable replicated pods that are allowed in a workload, such as a Deployment. A value of 0 indicates that no limit is set. | 1 or 10% |
evictLocalStoragePods | boolean |
| Specify whether pods that are configured with the emptyDir or hostPath can be descheduled. By default, this feature is disabled to ensure data security. | false |
objectLimiters.workload | A structure in the following format:
|
| Workload-specific pod migration control.
|
The example indicates that only one replicated pod can be migrated within 5 minutes in a workload. |
LowNodeLoad settings
Parameter | Type | Valid value | Description | Example |
highThresholds | map[string]float64 Note Set the parameter to a percentage value. You can specify this parameter for pods or nodes. | [0,100] | The high resource threshold. Pods on nodes whose load exceeds this threshold are descheduled. |
|
lowThresholds | map[string]float64 Note Set the parameter to a percentage value. You can specify this parameter for pods or nodes. | [0,100] | The low resource threshold. Pods on nodes whose load is lower than this threshold are not descheduled. |
|
anomalyCondition.consecutiveAbnormalities | int64 | > 0 (Default value: 5) | Hotspot detection frequency. A node is considered a hotspot node if the node exceeds highThresholds within the specified consecutive number of hotspot detection cycles. Hotspot nodes are descheduled and then the counter is reset. | 5 |
evictableNamespaces |
| Namespaces in the cluster | The namespaces that you want to include or exclude for descheduling. If you leave this parameter empty, all pods can be descheduled. You can specify the include or exclude list. The lists are mutually exclusive.
|
|
nodeSelector | metav1.LabelSelector | For more information about the format of LabelSelector, see Labels and Selectors. | Use the LabelSelector to select nodes. | You can specify one node pool or multiple node pools when you configure this parameter.
|
podSelectors | A list that can consist of multiple pods. Format of PodSelector:
| For more information about the format of LabelSelector, see Labels and Selectors. | Specify the pods that can be descheduled. |
|
FAQ
Issue 1: What do I do if the resource utilization of a node has reached the high threshold but no pod on the node is evicted?
The following table describes the possible causes.
Category | Cause | Solution |
Ineffective component configuration | No pods or nodes specified | No pods or nodes are specified in the configuration of the descheduler. Check whether namespaces and nodes are specified. |
Descheduler not restarted after modification | After you modify the configuration of the descheduler, you need to restart it for the modification to take effect. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in. | |
Invalid node status | Average node resource utilization lower than the threshold for a long period of time | The descheduler continuously monitors the resource utilization within a period of time and calculates the average value. Descheduling is triggered only if the average value remains above the threshold for a certain period of time. The default time period is 10 minutes. The resource utilization returned by |
Insufficient available resources in the cluster | The descheduler checks other nodes in the cluster to ensure that the nodes can provide sufficient available resources before the descheduler evicts pods. For example, the descheduler wants to evict a pod that requests 8 vCores and 16 GB of memory. However, no node in the cluster can provide sufficient available resources to meet the requirement of the pod. In this case, the descheduler does not evict the pod. To resolve this issue, you can add nodes to the cluster. | |
Workload limits | Only one replicated pod in the workload | By default, if a workload contains only one replicated pod, the pod is excluded for descheduling. This ensures the high availability of the application that runs in the pod. If you want to deschedule the preceding pod, add the |
Pods configured with the emptyDir or hostPath | By default, pods configured with the emptyDir or hostPath are excluded for descheduling to ensure data security. If you want to deschedule these pods, refer to the evictLocalStoragePods setting. For more information, see Migration control settings. | |
Excessive number of unavailable replicated pods or replicated pods that are being migrated | The number of unavailable replicated pods or replicated pods that are being migrated in a workload (Deployment or StatefulSet) exceeds the upper limit specified in maxUnavailablePerWorkload or maxMigrationPerWorkload. For example, both maxUnavailablePerWorkload and maxMigrationPerWorkload are set to 20%, and the expected number of replicated pods for the Deployment is set to 10. Two pods are being migrated or releasing the application. In this case, the descheduler does not evict these pods. Wait until the pods are migrated or the pods finish releasing the application, or increase the values of the preceding parameters. | |
Incorrect replicated pod limits | When the number of replicated pods in a workload is smaller than or equal to the maximum number of pods allowed to migrate specified in maxMigrationPerWorkload or the maximum number of unavailable pods allowed specified in maxUnavailablePerWorkload, the descheduler does not deschedule the pods in the workload. Decrease the values of the preceding parameters and set the parameters to percentage values. |
Issue 2: Why does the descheduler frequently restart?
The format of the ConfigMap of the descheduler is invalid or the ConfigMap does not exist. Refer to Advanced settings and check the content and format of the ConfigMap, modify the ConfigMap, and then restart the descheduler. For more information about how to restart the descheduler, see Step 2: Enable the LowNodeLoad plug-in.
Issue 3: What do I do if new pods are scheduled to the node on which pods are evicted?
After the system evicts the pods on a node, the system automatically recreates and schedules the pods. The scheduler selects nodes based on the specified configuration. We recommend that you enable load-aware pod scheduling for the scheduler and use this feature together with descheduling. This issue is more likely to occur if you have specified the nodes to which pods can be scheduled, only a few number of idle nodes are available, and the resource utilizations of these nodes are close. In this scenario, you can disable the nodeSelector parameter.