The ack-koordinator component provides a load-based hot spot descheduling feature. This feature detects changes in node loads within a cluster and automatically deschedules pods from nodes that exceed a safe load threshold. This prevents severe load imbalances. This topic describes how to use load-based hot spot descheduling and its advanced configuration parameters.
Limits
Only ACK managed Pro clusters are supported.
The related components must meet the following version requirements.
Component
Version requirements
ACK Scheduler
v1.22.15-ack-4.0 or later, v1.24.6-ack-4.0 or later
ack-koordinator
v1.1.1-ack.1 or later
Helm
v3.0 or later
The K8s Spot Rescheduler only evicts pods. The pods are then rescheduled by the ACK Scheduler. We recommend that you use the descheduling feature in conjunction with load-aware scheduling. This allows the ACK Scheduler to avoid rescheduling pods to hot spot nodes.
During descheduling, old pods are evicted before new pods are created. Make sure that your application has enough redundant replicas to prevent the eviction from affecting application availability.
Descheduling uses the standard Kubernetes eviction API to evict pods. Make sure that the logic of your application pods is re-entrant so that your service is not disrupted by restarts after the eviction.
Billing
The ack-koordinator component is free to install and use. However, additional fees may be incurred in the following scenarios:
ack-koordinator is a self-managed component and consumes worker node resources after installation. You can configure the resource requests for each module when you install the component.
By default, ack-koordinator exposes monitoring metrics for features such as resource profiling and fine-grained scheduling in Prometheus format. If you select the Enable Prometheus Monitoring for ACK-Koordinator option when you configure the component and use the Alibaba Cloud Prometheus service, these metrics are considered custom metrics and incur fees. The fees depend on factors such as your cluster size and the number of applications. Before you enable this feature, carefully read the Billing of Prometheus instances documentation for Alibaba Cloud Prometheus to understand the free quota and billing policies for custom metrics. You can monitor and manage your resource usage by querying usage data.
Introduction to load hotspot descheduling
Load-aware scheduling
The ACK scheduler supports load-aware scheduling, which can schedule pods to nodes that run with low loads. Because the cluster environment, traffic, and requests change, node utilization also changes dynamically. This can disrupt the load balance between nodes in the cluster and even result in severe load imbalances. This affects the runtime quality of the workload. ack-koordinator can identify changes in node loads and automatically deschedule pods from nodes that exceed a safe load threshold to prevent severe load imbalances. You can combine load-aware scheduling with hot spot descheduling to achieve optimal load balancing among nodes. For more information, see Use load-aware pod scheduling.
How the Koordinator Descheduler module works
The ack-koordinator component provides the Koordinator Descheduler module. In this module, the LowNodeLoad plug-in detects load levels and performs load-based hot spot descheduling. Unlike the LowNodeUtilization plug-in of the native Kubernetes descheduler, the LowNodeLoad plug-in makes descheduling decisions based on the actual resource utilization of nodes, whereas the LowNodeUtilization plug-in makes decisions based on the resource allocation rate.
Execution procedure
The Koordinator Descheduler module runs periodically. Each execution cycle consists of the following three stages.

Data collection: Obtains information about the nodes and workloads in the cluster and their related resource utilization data.
Policy plug-in execution.
The following steps use the LowNodeLoad plug-in as an example.
Identifies hot spot nodes. For more information about the classification of nodes, see LowNodeLoad load threshold parameters.
Traverses all hot spot nodes, identifies the pods that can be migrated, and sorts the pods. For more information about how pods are scored and sorted, see Pod scoring policy.
Traverses all pods to be migrated and checks whether the pods meet the requirements for migration based on constraints such as the cluster size, resource utilization, and the ratio of replicated pods. For more information, see Load-aware hot spot descheduling policies.
If a pod meets the conditions, it is classified as a replica to be migrated. If not, the process continues to traverse other pods and hot spot nodes.
Pod eviction and migration: Evicts the pods that meet the requirements for migration. For more information, see API-initiated Eviction.
LowNodeLoad Load Threshold Parameters
The LowNodeLoad plug-in has two important parameters.
highThresholds: The high load threshold. Pods on nodes with a load higher than this threshold are eligible for descheduling. Pods on nodes with a load lower than this threshold are not descheduled. We recommend that you also enable the load-aware scheduling feature of the scheduler. For more information, see Scheduling policies. For more information about how to use these features together, see How do I use load-aware scheduling and load-based hot spot descheduling together?.lowThresholds: The idle load threshold.
If the load level of all nodes is higher than lowThresholds, the overall cluster load level is considered high. In this case, the Koordinator Descheduler will not perform descheduling even if the load level of some nodes is higher than highThresholds.
For example, in the following figure, lowThresholds is set to 45% and highThresholds is set to 70%. The nodes are classified based on the following criteria. Similarly, if the values of lowThresholds and highThresholds change, the node classification criteria also change accordingly.
By default, resource utilization data is updated every minute. The granularity is the average value over the last 5 minutes.
Idle Node: A node with resource utilization below 45%.
Normal Node: A node with resource utilization greater than or equal to 45% and less than or equal to 70%. This load level is the desired range.
Hot Spot Node: A node with resource utilization above 70%. Some pods on a hot spot node are evicted to lower its load level to 70% or less.
Load hotspot descheduling policies
Policy Name | Description |
Hotspot check retry policy | To ensure the accuracy of hotspot detection and avoid frequent application migration caused by monitoring data glitches, Koordinator Descheduler supports configuring retries for hotspot checks. A node is identified as a hotspot only if it consecutively exceeds the threshold multiple times. |
Node sorting policy | Among the identified hotspot nodes, Koordinator Descheduler initiates descheduling on nodes in descending order of resource usage. During node sorting, memory and CPU resource usage are compared in sequence. Nodes with higher resource usage are prioritized. |
Pod scoring policy | For each hotspot node, Koordinator Descheduler scores and sorts the pods on it, and then initiates eviction operations to migrate them to idle nodes. The comparison order is as follows:
Note If you have requirements for the pod eviction order, configure different priorities or QoS classes for your pods. |
Filter policy | The Koordinator Descheduler module supports multiple filter parameters for pods and nodes to facilitate grayscale control during use.
|
Pre-check policy | The Koordinator Descheduler module provides a pre-check feature before pod migration to ensure that each migration is as safe as possible.
|
Migration throttling policy | To ensure the high availability of applications during pod migration, Koordinator Descheduler provides multiple features to control pod migration. You can specify the maximum number of pods that can be migrated at the same time per node, namespace, or workload. Koordinator Descheduler also lets you specify a pod migration time window to prevent pods that belong to the same workload from being migrated too frequently. Koordinator Descheduler is also compatible with the PDB (Pod Disruption Budgets) mechanism of open source Kubernetes, which lets you configure more fine-grained management policies to ensure the high availability of your workloads. |
Observability policy | You can observe the migration process of descheduling through events and view the specific reasons and current status of the migration in the details. The following is a sample. |
Step 1: Enable descheduling in ack-koordinator
If the ack-koordinator component is not installed in the cluster, install the component and select Enable Descheduling For Ack-koordinator on the component configuration page. For more information, see Install the ack-koordinator component.
If the ack-koordinator component is already installed in your cluster, on the component configuration page, select Enable Descheduling For Ack-koordinator. For the procedure, see Modify the ack-koordinator component.
Step 2: Enable the load hotspot descheduling plug-in
Create a `koord-descheduler-config.yaml` file using the following YAML content.
The `koord-descheduler-config.yaml` file is a ConfigMap object that is used to enable the LowNodeLoad descheduling plug-in.
Run the following command to apply the configuration to the cluster.
kubectl apply -f koord-descheduler-config.yamlRun the following command to restart the Koordinator Descheduler module.
After you restart the Koordinator Descheduler module, the Koordinator Descheduler uses the most recently modified configuration.
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 deployment.apps/ack-koord-descheduler scaled
Step 3 (Optional): Enable the scheduler load balancing plug-in
To enable the scheduler load balancing plug-in for optimal load balancing among nodes, see Step 1: Enable load-aware scheduling.
Step 4: Verify the descheduling feature
The following example uses a cluster that has three nodes, each with 104 cores and 396 GiB of memory.
Create a `stress-demo.yaml` file using the following YAML content.
Create the stress testing pod.
kubectl create -f stress-demo.yaml deployment.apps/stress-demo createdObserve the status of the pod until it is running.
kubectl get pod -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Running 0 82s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none>The output shows that the pod
stress-demo-588f9646cf-s****is scheduled to the nodecn-beijing.10.XX.XX.53.Increase the load level of the node
cn-beijing.10.XX.XX.53and then check the load of each node.kubectl top nodeExpected output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% cn-beijing.10.XX.XX.215 17611m 17% 24358Mi 6% cn-beijing.10.XX.XX.53 63472m 63% 11969Mi 3%The output shows that the load of the node
cn-beijing.10.XX.XX.53is high at 63%, which exceeds the configured hot spot threshold of 50%. The load of the nodecn-beijing.10.XX.XX.215is low at 17%, which is below the configured idle threshold of 20%.Enable load-aware hot spot descheduling. For more information, see Step 2: Enable the load-based hot spot descheduling plug-in.
Observe the pod changes.
Wait for the descheduler to check for hot spot nodes and perform eviction and migration.
NoteBy default, a node is identified as a hot spot if it exceeds the hot spot threshold for five consecutive checks, which takes 10 minutes.
kubectl get pod -wExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES stress-demo-588f9646cf-s**** 1/1 Terminating 0 59s 10.XX.XX.53 cn-beijing.10.XX.XX.53 <none> <none> stress-demo-588f9646cf-7**** 1/1 ContainerCreating 0 10s 10.XX.XX.215 cn-beijing.10.XX.XX.215 <none> <none>Observe the event.
kubectl get event | grep stress-demo-588f9646cf-s****Expected output:
2m14s Normal Evicting podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 101s Normal EvictComplete podmigrationjob/00fe88bd-8d4c-428d-b2a8-d15bcdeb**** Pod "default/stress-demo-588f9646cf-s****" has been evicted 2m14s Normal Descheduled pod/stress-demo-588f9646cf-s**** Pod evicted from node "cn-beijing.10.XX.XX.53" by the reason "node is overutilized, cpu usage(68.53%)>threshold(50.00%)" 2m14s Normal Killing pod/stress-demo-588f9646cf-s**** Stopping container stressThe expected output is the migration record, which shows that the result is as expected. The pod on the hot spot node is descheduled to another node.
Advanced configuration
Advanced configuration parameters of the Koordinator Descheduler module
All parameter configurations for the Koordinator Descheduler are provided in a ConfigMap. The following shows the format of the advanced configuration parameters for load-based hot spot descheduling.
Koordinator Descheduler system configuration
Parameter | Type | Value | Description | Example |
dryRun | boolean |
| The read-only mode switch. If enabled, no pod migration is initiated. | false |
deschedulingInterval | time.Duration | >0s | The descheduling execution interval. When you use the load hotspot descheduling feature, make sure that the value of this parameter is not greater than the value of the | 120s |
Eviction and migration control configuration
Parameter | Type | Value | Description | Example |
maxMigratingPerNode | int64 | ≥0 (default: 2) | The maximum number of pods that can be in a migrating state on each node. 0 indicates no limit. | 2 |
maxMigratingPerNamespace | int64 | ≥0 (default: unlimited) | The maximum number of pods that can be in a migrating state in each namespace. 0 indicates no limit. | 1 |
maxMigratingPerWorkload | intOrString | ≥0 (default: 10%) | The maximum number or percentage of pods that can be in a migrating state in each workload, such as a deployment. 0 indicates no limit. If a workload has only a single replica, it is not descheduled. | 1 or 10% |
maxUnavailablePerWorkload | intOrString | ≥0 (default: 10%), and less than the total number of replicas for the workload | The maximum number or percentage of unavailable replicas for each workload, such as a deployment. 0 indicates no limit. | 1 or 10% |
evictLocalStoragePods | boolean |
| Specifies whether to allow pods configured with HostPath or EmptyDir to be descheduled. For security reasons, this is disabled by default. | false |
objectLimiters.workload | A struct in the following data format: |
| Throttling for pod migration at the workload level.
| This indicates that a maximum of one replica can be migrated for a single workload within 5 minutes. |
LowNodeLoad plug-in configuration
| Type | Value | Description | Example |
| map[string]float64 Note Supports CPU and memory dimensions. The value is a percentage. | [0,100] | The hot spot water level for the load. Only pods on nodes that exceed this threshold participate in descheduling. Pods on nodes below this threshold are not descheduled. We recommend that you also enable the load-aware scheduling feature of the scheduler. For more information, see Policy Description. For information about how to use these two features in combination, see How do I use a combination of load-aware scheduling and load-aware hotspot descheduling?. If the load levels of all nodes are higher than | |
| map[string]float64 Note Supports CPU and memory dimensions. The value is a percentage. | [0,100] | The idle load threshold. If the usage levels of all nodes are higher than | |
| int64 | >0 (default: 5) | The number of retries for hotspot checks. A node is identified as a hotspot only if it exceeds highThresholds for multiple consecutive execution cycles. The counter is reset after the hotspot node is evicted. | 5 |
| *metav1.Duration | For more information about the Duration format, see Duration. (Default value: 5m) | The cache duration for hotspot checks. When you use the load hotspot descheduling feature, make sure that the value of this parameter is not less than the value of |
|
|
| Namespaces in the cluster | The namespaces that can be descheduled. If you leave this parameter empty, all pods can be descheduled. The include and exclude policies are supported. The two policies are mutually exclusive.
| |
| metav1.LabelSelector | For more information about the LabelSelector format, see Labels and Selectors | Selects target nodes using a LabelSelector. | You can configure this parameter in two ways: one for specifying a single node pool and another for specifying multiple node pools.
|
| A list that consists of PodSelector objects. You can configure multiple groups of pods. The data format of PodSelector is as follows: | For more information about the LabelSelector format, see Labels and Selectors | Selects the pods for which descheduling is enabled using a LabelSelector. | |
FAQ
What do I do if the node utilization reaches the threshold but pods on the node are not evicted?
This issue may occur for the following reasons. You can refer to the corresponding solutions to resolve the issue.
Cause classification | Cause description | Solution |
Component configuration not in effect | The enabled scope is not specified. | The descheduler configuration includes the enabled scope for pods and nodes. Check whether the corresponding namespace and node are enabled. |
The descheduler is not restarted after its configuration is modified. | After you modify the configuration of the descheduler, you must restart it for the modification to take effect. For more information about how to restart the descheduler, see Step 2: Enable the load-based hot spot descheduling plug-in. | |
Improper component configuration | The execution interval of the descheduler component is longer than the cache duration of the LowNodeLoad plug-in. As a result, hot spot node detection becomes invalid. | The value of the |
Node status does not meet conditions | The average utilization of the node is below the threshold for a long time. | The descheduler continuously monitors utilization for a period and calculates a smoothed average of the monitoring data. Therefore, descheduling is triggered only when a node's utilization continuously exceeds the threshold. By default, this period is about 10 minutes. However, the utilization returned by |
Insufficient remaining capacity in the cluster. | Before evicting a pod, the descheduler checks other nodes in the cluster to ensure sufficient capacity for migration. For example, if a pod that requires 8 cores and 16 GiB of memory is selected for eviction, but the available capacity of all other nodes in the cluster is below this value, the descheduler does not migrate the pod for security reasons. In this case, consider adding nodes to ensure sufficient cluster capacity. | |
Workload property constraints | The workload is a single-replica edition. | To ensure the high availability of single-replica applications, these pods are not descheduled by default. If you evaluate such single-replica applications and want the pod to be descheduled, append the annotation Note This annotation configuration is not supported in v1.3.0-ack1.6, v1.3.0-ack1.7, or v1.3.0-ack1.8. To upgrade the component to the latest version, see Install and manage the component. |
The pod specifies HostPath or EmptyDir. | By default, pods configured with `emptyDir` or `hostPath` are excluded from descheduling to ensure data security. If you want to deschedule these pods, refer to the `evictLocalStoragePods` setting. For more information, see Eviction and migration control configuration. | |
The number of unavailable or migrating replicas is too high. | When the number of unavailable or migrating replicas of a workload, such as a deployment or StatefulSet, exceeds the configured limit (maxUnavailablePerWorkload or maxMigratingPerWorkload), the descheduler does not initiate an eviction. For example, if maxUnavailablePerWorkload and maxMigratingPerWorkload are set to 20%, the desired number of replicas for the deployment is 10, and two pods are being evicted or released, the descheduler does not evict any more pods. Wait for the pod eviction or release to complete, or increase the values of these two configurations. | |
Incorrect replica count constraint configuration. | If the total number of replicas of a workload is less than or equal to the value of maxMigratingPerWorkload or maxUnavailablePerWorkload, the descheduler does not deschedule the workload for security reasons. To resolve this, decrease the values of these two configurations or change them to percentages. |
Why does the descheduler frequently restart?
The descheduler may frequently restart if its ConfigMap is invalid or does not exist. For more information, see Advanced configuration. Check the content and format of the ConfigMap, modify the ConfigMap, and then restart the descheduler. For more information about how to restart the descheduler, see Step 2: Enable the load-based hot spot descheduling plug-in.
How do I use load-aware scheduling and load hotspot descheduling together?
After you enable load-based hot spot descheduling, pods on hot spot nodes are evicted. The ACK scheduler selects appropriate nodes for pods that are created by upper-layer controllers, such as Deployments. To achieve optimal load balancing, we recommend that you enable load-aware scheduling at the same time. For more information, see Use load-aware scheduling.
We recommend that you set the loadAwareThreshold parameter for load-aware scheduling to the same value as the highThresholds parameter of the K8s Spot Rescheduler. For more information, see Scheduling policies. When the load of a node exceeds highThresholds, the K8s Spot Rescheduler evicts pods on that node. The scheduler then uses loadAwareThreshold to prevent new pods from being scheduled to the hot spot node. If you do not set the parameters to the same value, evicted pods may be rescheduled to the hot spot node. This issue is more likely to occur if a pod has a specified scope of schedulable nodes, but only a small number of nodes are available and the resource utilization of these nodes is similar.
What is the utilization algorithm that descheduling references?
The descheduler continuously monitors resource usage for a period and calculates an average value. A node is descheduled only if its average resource usage stays above a threshold for a certain period, which is 10 minutes by default. For memory, the descheduler's usage calculation excludes the page cache because the operating system can reclaim these resources. In contrast, the usage value returned by the kubectl top node command includes the page cache. You can use Managed Service for Prometheus to view the actual memory usage.