Descheduling moves running pods that violate eviction rules to other nodes, without waiting for the scheduler to detect the imbalance. Use it to fix pods that became mis-scheduled after their node's taints, affinity rules, or load profile changed.
This topic walks you through enabling descheduling via the ack-koordinator component, using the RemovePodsViolatingNodeTaints plug-in as an example.
Prerequisites
Before you begin, make sure you have:
-
An ACK managed Pro cluster. For more information, see Create an ACK managed cluster.
-
A kubectl client connected to the cluster. For more information, see Get a cluster kubeconfig and connect to the cluster using kubectl.
Descheduling is not supported on virtual nodes.
Usage notes
-
Koordinator Descheduler only evicts running pods. It does not recreate or reschedule them. After eviction, the workload controller (such as a Deployment or StatefulSet) recreates the pod, which is then scheduled by the standard scheduler.
-
During descheduling, old pods are evicted before new pods are created. Make sure your application has enough
replicasto maintain availability during eviction. -
ack-descheduler is discontinued. If you are using it, migrate to Koordinator Descheduler. For more information, see How do I migrate from ack-descheduler to Koordinator Descheduler?.
Choose a descheduling plug-in
Select a plug-in based on the problem you want to solve:
| Scenario | Plug-in | Policy type |
|---|---|---|
Pods remain on nodes that acquired a NoSchedule taint after scheduling |
RemovePodsViolatingNodeTaints |
Deschedule |
| Pods violate inter-pod anti-affinity rules | RemovePodsViolatingInterPodAntiAffinity |
Deschedule |
| Pods no longer satisfy node affinity rules | RemovePodsViolatingNodeAffinity |
Deschedule |
| Pods restart too frequently | RemovePodsHavingTooManyRestarts |
Deschedule |
| Pods have exceeded their time-to-live | PodLifeTime |
Deschedule |
| Pods are in the Failed state | RemoveFailedPod |
Deschedule |
| Replicated pods are unevenly spread across nodes | RemoveDuplicates |
Balance |
| Nodes are unevenly utilized by resource allocation | LowNodeUtilization |
Balance |
| Pods violate topology spread constraints | RemovePodsViolatingTopologySpreadConstraint |
Balance |
| Nodes are overloaded by actual resource utilization | LowNodeLoad |
Balance |
This topic uses RemovePodsViolatingNodeTaints as the example. Read the descheduling concepts and Koordinator Descheduler vs. Kubernetes Descheduler before getting started.
How it works
The RemovePodsViolatingNodeTaints plug-in checks every node for NoSchedule taints at the configured interval. If a running pod lacks a toleration for a node's NoSchedule taint, the plug-in evicts the pod. The workload controller then recreates the pod, and the scheduler places it on a node the pod can tolerate.
Use the excludedTaints field to exempt specific taints from this check. If a taint's key, or its key=value pair, matches an entry in excludedTaints, the plug-in ignores that taint.
Example scenario used in this topic:
A three-node cluster has a Deployment with one pod on each node. An administrator adds NoSchedule taints to two nodes after the pods are already running:
-
Node A gets
deschedule=not-allow:NoSchedule. Becausedeschedule=not-allowis inexcludedTaints, this taint is ignored — the pod stays. -
Node B gets
deschedule=allow:NoSchedule. This taint is not excluded — the pod is evicted and rescheduled to Node C (which has noNoScheduletaint).
Step 1: Install ack-koordinator and enable descheduling
If ack-koordinator is already installed, make sure the version is 1.2.0-ack.2 or later before proceeding.
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, find the target cluster and click its name. In the left navigation pane, click Add-ons.
-
Find ack-koordinator and click Install in the lower-right corner.
-
In the Install dialog box, select Enable Descheduler for ACK-Koordinator, then configure and install the component as prompted.
Koordinator Descheduler is deployed as a Deployment on cluster nodes.
Step 2: Enable the RemovePodsViolatingNodeTaints plug-in
Configure the plug-in
Create a file named koord-descheduler-config.yaml with the following content. This ConfigMap enables RemovePodsViolatingNodeTaints and configures it to ignore the deschedule=not-allow taint.
# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-descheduler-config
namespace: kube-system
data:
koord-descheduler-config: |
# Do not modify the following system configuration of koord-desheduler.
apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
leaderElection:
resourceLock: leases
resourceName: koord-descheduler
resourceNamespace: kube-system
deschedulingInterval: 120s # The interval at which the descheduler runs. Set to 120 seconds here.
dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
# The preceding configuration is the system configuration.
profiles:
- name: koord-descheduler
plugins:
deschedule:
enabled:
- name: RemovePodsViolatingNodeTaints # Enable the node taint verification plug-in.
pluginConfig:
- name: RemovePodsViolatingNodeTaints # Configure the node taint verification plug-in.
args:
excludedTaints:
- deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
# Required for RemovePodsViolatingNodeTaints to take effect. Do not remove.
- name: MigrationController # Configure the migration controller.
args:
apiVersion: descheduler/v1alpha2
kind: MigrationControllerArgs
defaultJobMode: EvictDirectly
RemovePodsViolatingNodeTaints parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
excludedTaints |
list(string) | — | Taint keys or key=value pairs to ignore. Pods on nodes with these taints are not evicted. |
includePreferNoSchedule |
bool | false | When true, also checks taints with effect PreferNoSchedule, not just NoSchedule. |
namespaces.include |
list(string) | — | Restrict descheduling to specific namespaces. Mutually exclusive with namespaces.exclude. |
namespaces.exclude |
list(string) | — | Skip descheduling in specific namespaces. Mutually exclusive with namespaces.include. |
labelSelector |
map | — | Restrict descheduling to pods that match the specified labels. |
Apply the configuration
-
Apply the ConfigMap to the cluster:
kubectl apply -f koord-descheduler-config.yaml -
Restart Koordinator Descheduler to pick up the new configuration:
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 # Expected output: # deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 # Expected output: # deployment.apps/ack-koord-descheduler scaled
Configure advanced parameters
Use a ConfigMap to configure global behavior and fine-grained template settings for Koordinator Descheduler.
Example of advanced configurations
The following example shows a fully configured Koordinator Descheduler ConfigMap. It uses DeschedulerConfiguration for global settings, enables RemovePodsViolatingNodeTaints as the descheduling policy, and uses MigrationController as the evictor.
# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-descheduler-config
namespace: kube-system
data:
koord-descheduler-config: |
# Do not modify the following system configuration of koord-desheduler.
apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
leaderElection:
resourceLock: leases
resourceName: koord-descheduler
resourceNamespace: kube-system
dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
deschedulingInterval: 120s # The interval at which the descheduler runs. The interval is set to 120 seconds in this example.
nodeSelector: # The nodes that are involved in descheduling. By default, all nodes are descheduled.
matchLabels:
alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements.
maxNoOfPodsToEvictPerNode: 10 # The maximum number of pods that can be evicted from a node. The limit takes effect on a global scale. By default, no limit is configured.
maxNoOfPodsToEvictPerNamespace: 10 # The maximum number of pods that can be evicted from a namespace. The limit takes effect on a global scale. By default, no limit is configured.
# The preceding configuration is the system configuration.
# The template list.
profiles:
- name: koord-descheduler # The name of the template.
# Scope: apply this template only to the specified nodes.
# Method 1: Select nodes in one node pool.
nodeSelector:
matchLabels:
alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements
# Method 2: Select nodes in multiple node pools.
# nodeSelector:
# matchExpressions:
# - key: alibabacloud.com/nodepool-id
# operator: In
# values:
# - nodepool-1
# - nodepool-2
plugins:
deschedule: # All plug-ins are disabled by default. Specify the ones to enable.
enabled:
- name: RemovePodsViolatingNodeTaints # Enable the node taint verification plug-in.
balance: # All plug-ins are disabled by default.
disabled:
- name: "*" # Disable all Balance plug-ins.
evict:
enabled:
- name: MigrationController # MigrationController is enabled by default.
filter:
enabled:
- name: MigrationController # Use MigrationController's filtering policy by default.
pluginConfig:
- name: RemovePodsViolatingNodeTaints
args:
excludedTaints:
- deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
- reserved # Ignore nodes whose taint key is reserved.
includePreferNoSchedule: false # When false, only checks NoSchedule taints.
namespaces:
include: # Restrict descheduling to these namespaces.
- "namespace1"
- "namespace2"
# exclude: # Alternatively, exclude these namespaces.
# - "namespace1"
# - "namespace2"
labelSelector: # Only deschedule pods matching these labels.
accelerator: nvidia-tesla-p100
- name: MigrationController
args:
apiVersion: descheduler/v1alpha2
kind: MigrationControllerArgs
defaultJobMode: EvictDirectly
evictLocalStoragePods: false # When false, pods using emptyDir or hostPath are not descheduled.
maxMigratingPerNode: 1 # Maximum pods migrated simultaneously on a node.
maxMigratingPerNamespace: 1 # Maximum pods migrated simultaneously in a namespace.
maxMigratingPerWorkload: 1 # Maximum pods migrated simultaneously in a workload.
maxUnavailablePerWorkload: 2 # Maximum unavailable replicated pods allowed in a workload.
objectLimiters:
workload: # Throttle workload-level migration. Default: only 1 pod per workload within 5 minutes.
duration: 5m
maxMigrating: 1
evictionPolicy: Eviction # Use the Eviction API by default.
System configurations
Configure global, system-level behavior in DeschedulerConfiguration.
| Parameter | Type | Valid value | Description | Example |
|---|---|---|---|---|
dryRun |
boolean | true / false (default: false) | Read-only mode. When enabled, no pods are migrated. | false |
deschedulingInterval |
time.Duration | >0s | How often the descheduler runs. | 120s |
nodeSelector |
Structure | — | Limit which nodes are eligible for descheduling. Accepts matchLabels (one node pool) or matchExpressions (multiple node pools). See Kubernetes labelSelector. |
See example YAML above |
maxNoOfPodsToEvictPerNode |
int | ≥0 (default: 0) | Maximum pods evicted from a single node per descheduling cycle. 0 means no limit. | 10 |
maxNoOfPodsToEvictPerNamespace |
int | ≥0 (default: 0) | Maximum pods evicted from a single namespace per descheduling cycle. 0 means no limit. | 10 |
Template configurations
Koordinator Descheduler uses templates (profiles) to group descheduling policies and evictors. Each template has the following fields:
-
`name`: A string identifier for the template.
-
`plugins`: Enables or disables descheduling policies (
deschedule,balance), evictors (evict), and pre-eviction filters (filter). -
`pluginConfig`: Per-plug-in advanced arguments. Set the
namefield to match the plug-in name and configureargs. See Configure policy plug-ins and Configure evictor plug-ins. -
`nodeSelector`: Limits the template scope to specific nodes. If not set, the template applies to all nodes.
The nodeSelector field in template configurations requires ack-koordinator v1.6.1-ack.1.16 or later.
`plugins` field reference:
| Field | Supported plug-ins | Description |
|---|---|---|
deschedule |
RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, RemovePodsHavingTooManyRestarts, PodLifeTime, RemoveFailedPod |
All disabled by default. Specify plug-ins to enable. |
balance |
RemoveDuplicates, LowNodeUtilization, HighNodeUtilization, RemovePodsViolatingTopologySpreadConstraint, LowNodeLoad |
All disabled by default. Specify plug-ins to enable. |
evict |
MigrationController, DefaultEvictor |
The pod evictor. MigrationController is enabled by default. Do not enable multiple evict plug-ins at the same time. |
filter |
MigrationController, DefaultEvictor |
Pre-eviction filtering policy. MigrationController is enabled by default. Do not enable multiple filter plug-ins at the same time. |
Configure policy plug-ins
Koordinator Descheduler supports six Deschedule plug-ins and five Balance plug-ins from Kubernetes Descheduler. The LowNodeLoad plug-in is provided by Koordinator. For more information, see Work with load-aware hotspot descheduling.
| Policy type | Plug-in | Description |
|---|---|---|
| Deschedule | RemovePodsViolatingInterPodAntiAffinity | Evicts pods that violate inter-pod anti-affinity rules. |
| Deschedule | RemovePodsViolatingNodeAffinity | Evicts pods that no longer satisfy node affinity rules. |
| Deschedule | RemovePodsViolatingNodeTaints | Evicts pods that cannot tolerate node taints. |
| Deschedule | RemovePodsHavingTooManyRestarts | Evicts pods that restart too frequently. |
| Deschedule | PodLifeTime | Evicts pods whose TTL has expired. |
| Deschedule | RemoveFailedPod | Evicts pods in the Failed state. |
| Balance | RemoveDuplicates | Spreads replicated pods evenly across nodes. |
| Balance | LowNodeUtilization | Redistributes pods based on node resource allocation. |
| Balance | HighNodeUtilization | Consolidates pods from underutilized nodes to more utilized ones. |
| Balance | RemovePodsViolatingTopologySpreadConstraint | Evicts pods that violate topology spread constraints. |
Configure evictor plug-ins
Koordinator Descheduler supports two evictor plug-ins: DefaultEvictor and MigrationController.
MigrationController
MigrationController provides fine-grained eviction control and observability through migration jobs.
| Parameter | Type | Valid value | Description | Example |
|---|---|---|---|---|
evictLocalStoragePods |
boolean | true / false (default: false) | When false, pods using emptyDir or hostPath are not descheduled. |
false |
maxMigratingPerNode |
int64 | ≥0 (default: 2) | Maximum pods migrated simultaneously on a node. 0 means no limit. | 2 |
maxMigratingPerNamespace |
int64 | ≥0 (default: 0) | Maximum pods migrated simultaneously in a namespace. 0 means no limit. | 1 |
maxMigratingPerWorkload |
intOrString | ≥0 (default: 10%) | Maximum pods or percentage migrated simultaneously in a workload. 0 means no limit. If a workload has only one pod, it is excluded from descheduling. | 1 or 10% |
maxUnavailablePerWorkload |
intOrString | ≥0 and < replica count (default: 10%) | Maximum unavailable replicated pods allowed in a workload. 0 means no limit. | 1 or 10% |
objectLimiters.workload |
Structure | Duration >0 (default: 5m); MaxMigrating ≥0 (default: 10%) |
Throttles workload-level migration within a time window. Duration sets the window length. MaxMigrating sets the maximum pods migrated within that window. |
duration: 5m maxMigrating: 1 |
evictionPolicy |
string | Eviction (default), Delete, Soft |
Controls how pods are evicted. Eviction: calls the Eviction API for graceful eviction. Delete: calls the Delete API. Soft: adds the scheduling.koordinator.sh/soft-eviction annotation for custom downstream handling. |
Eviction |
DefaultEvictor
DefaultEvictor is the standard Kubernetes Descheduler evictor. For configuration details, see DefaultEvictor.
MigrationController vs. DefaultEvictor
| Capability | DefaultEvictor | MigrationController |
|---|---|---|
| Eviction methods | Eviction API only | Eviction API, Delete API, or Soft annotation |
| Per-node eviction limit | Supported | Supported |
| Per-namespace eviction limit | Supported | Supported |
| Per-workload eviction limit | Not supported | Supported |
| Per-workload unavailability limit | Not supported | Supported |
| Eviction throttling | Not supported | Time window-based throttling per workload |
| Eviction observability | Component logs only | Component logs and Kubernetes events with per-pod migration status |
What's next
-
To understand descheduling concepts, features, and workflow, see Descheduling.
-
If you use the community Kubernetes Descheduler, see Koordinator Descheduler and Kubernetes Descheduler to learn about the differences and complete the component migration.
-
To configure load-aware hotspot descheduling using
LowNodeLoad, see Work with load-aware hotspot descheduling. -
To understand cluster resource usage and get cost-saving recommendations, see Cost Insight.
-
For troubleshooting, see Scheduling FAQs.
-
For the ack-koordinator release notes and component overview, see ack-koordinator (ack-slo-manager).