Enable descheduling and advanced parameters - Container Service for Kubernetes

Enable Koordinator Descheduler to automatically evict and rebalance pods when node taints, affinity rules, or load profiles change.

The following procedure uses the RemovePodsViolatingNodeTaints plug-in as an example.

Prerequisites

Make sure you have:

An ACK managed Pro cluster is created.
kubectl is connected to the cluster.

Descheduling is not supported on virtual nodes.

Usage notes

Koordinator Descheduler only evicts pods — it does not recreate them. The workload controller (such as a Deployment or StatefulSet) recreates evicted pods, and the standard scheduler places them.
Old pods are evicted before new pods are created. Ensure your application has enough replicas to maintain availability during eviction.
ack-descheduler is discontinued. If you still use it, see How do I migrate from ack-descheduler to Koordinator Descheduler?

Choose a descheduling plug-in

Select a plug-in for your scenario:

Scenario	Plug-in	Policy type
Pods remain on nodes that acquired a `NoSchedule` taint after scheduling	`RemovePodsViolatingNodeTaints`	Deschedule
Pods violate inter-pod anti-affinity rules	`RemovePodsViolatingInterPodAntiAffinity`	Deschedule
Pods no longer satisfy node affinity rules	`RemovePodsViolatingNodeAffinity`	Deschedule
Pods restart too frequently	`RemovePodsHavingTooManyRestarts`	Deschedule
Pods have exceeded their time-to-live	`PodLifeTime`	Deschedule
Pods are in the Failed state	`RemoveFailedPod`	Deschedule
Replicated pods are unevenly spread across nodes	`RemoveDuplicates`	Balance
Nodes are unevenly utilized by resource allocation	`LowNodeUtilization`	Balance
Pods violate topology spread constraints	`RemovePodsViolatingTopologySpreadConstraint`	Balance
Nodes are overloaded by actual resource utilization	`LowNodeLoad`	Balance

The examples below use RemovePodsViolatingNodeTaints. Read the descheduling concepts and Koordinator Descheduler vs. Kubernetes Descheduler before you start.

How it works

RemovePodsViolatingNodeTaints checks every node for NoSchedule taints at the configured interval. If a running pod lacks a toleration for a node's NoSchedule taint, the plug-in evicts it. The workload controller recreates the pod, and the scheduler places it on a tolerable node.

Use excludedTaints to exempt specific taints. If a taint's key or key=value pair matches an entry in excludedTaints, the plug-in ignores it.

Example scenario:

A three-node cluster runs a Deployment with one pod per node. An administrator adds NoSchedule taints to two nodes after deployment:

Node A gets deschedule=not-allow:NoSchedule. Because deschedule=not-allow is in excludedTaints, this taint is ignored — the pod stays.
Node B gets deschedule=allow:NoSchedule. This taint is not excluded — the pod is evicted and rescheduled to Node C (which has no NoSchedule taint).

Step 1: Install ack-koordinator and enable descheduling

If ack-koordinator is already installed, ensure the version is 1.2.0-ack.2 or later.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the target cluster name. In the left navigation pane, click Add-ons.
Find ack-koordinator and click Install.
In the Install dialog box, select Enable Descheduler for ACK-Koordinator and complete the installation.

Koordinator Descheduler is deployed as a Deployment on cluster nodes.

Step 2: Enable the RemovePodsViolatingNodeTaints plug-in

Configure the plug-in

Create a file named koord-descheduler-config.yaml. This ConfigMap enables RemovePodsViolatingNodeTaints and excludes the deschedule=not-allow taint.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    deschedulingInterval: 120s # The interval at which the descheduler runs. Set to 120 seconds here.
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    # The preceding configuration is the system configuration.

    profiles:
    - name: koord-descheduler
      plugins:
        deschedule:
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints # Configure the node taint verification plug-in.
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.

      # Required for RemovePodsViolatingNodeTaints to take effect. Do not remove.
      - name: MigrationController # Configure the migration controller.
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly

RemovePodsViolatingNodeTaints parameters:

Parameter	Type	Default	Description
`excludedTaints`	list(string)	—	Taint keys or `key=value` pairs to ignore. Pods on nodes with these taints are not evicted.
`includePreferNoSchedule`	bool	false	When true, also checks taints with effect `PreferNoSchedule`, not just `NoSchedule`.
`namespaces.include`	list(string)	—	Restrict descheduling to specific namespaces. Mutually exclusive with `namespaces.exclude`.
`namespaces.exclude`	list(string)	—	Skip descheduling in specific namespaces. Mutually exclusive with `namespaces.include`.
`labelSelector`	map	—	Restrict descheduling to pods that match the specified labels.

Apply the configuration

Apply the ConfigMap to the cluster:

kubectl apply -f koord-descheduler-config.yaml

Restart Koordinator Descheduler to load the new configuration:

kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0
# Expected output:
# deployment.apps/ack-koord-descheduler scaled
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1
# Expected output:
# deployment.apps/ack-koord-descheduler scaled

Step 3: Verify descheduling

This example uses a three-node cluster.

Create a file named stress-demo.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-demo
  namespace: default
  labels:
    app: stress-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: stress-demo
  template:
    metadata:
      name: stress-demo
      labels:
        app: stress-demo
    spec:
      containers:
        - args:
            - '--vm'
            - '2'
            - '--vm-bytes'
            - '1600M'
            - '-c'
            - '2'
            - '--vm-hang'
            - '2'
          command:
            - stress
          image: registry-cn-beijing.ack.aliyuncs.com/acs/stress:v1.0.4
          imagePullPolicy: Always
          name: stress
          resources:
            limits:
              cpu: '2'
              memory: 4Gi
            requests:
              cpu: '2'
              memory: 4Gi
      restartPolicy: Always

Deploy the test workload:
```
kubectl create -f stress-demo.yaml
```

Wait for pods to reach Running:

kubectl get pod -o wide

Expected output:

NAME                         READY   STATUS    RESTARTS   AGE    IP              NODE                        NOMINATED NODE   READINESS GATES
stress-demo-5f6cddf9-9****   1/1     Running   0          10s    192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
stress-demo-5f6cddf9-h****   1/1     Running   0          10s    192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
stress-demo-5f6cddf9-v****   1/1     Running   0          10s    192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>

Add NoSchedule taints to two nodes:
- Add deschedule=not-allow:NoSchedule to cn-beijing.192.XX.XX.247 (excluded by excludedTaints — pod should stay):
```
kubectl taint nodes cn-beijing.192.XX.XX.247 deschedule=not-allow:NoSchedule
```
  Expected output:
```
node/cn-beijing.192.XX.XX.247 tainted
```
- Add deschedule=allow:NoSchedule to cn-beijing.192.XX.XX.248 (not excluded — pod should be evicted):
```
kubectl taint nodes cn-beijing.192.XX.XX.248 deschedule=allow:NoSchedule
```
  Expected output:
```
node/cn-beijing.192.XX.XX.248 tainted
```

Watch pod changes. The descheduler checks taints every deschedulingInterval (120 seconds):

kubectl get pod -o wide -w

Expected output:

NAME                         READY   STATUS              RESTARTS   AGE     IP             NODE                    NOMINATED NODE   READINESS GATES
stress-demo-5f6cddf9-9****   1/1     Running             0          5m34s   192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
stress-demo-5f6cddf9-h****   1/1     Running             0          5m34s   192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
stress-demo-5f6cddf9-v****   1/1     Running             0          5m34s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
stress-demo-5f6cddf9-v****   1/1     Terminating         0          7m58s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
stress-demo-5f6cddf9-j****   0/1     ContainerCreating   0          0s      <none>         cn-beijing.192.XX.XX.249   <none>           <none>
stress-demo-5f6cddf9-j****   1/1     Running             0          2s      192.XX.XX.32   cn-beijing.192.XX.XX.249   <none>           <none>

The output confirms:

The pod on cn-beijing.192.XX.XX.248 (taint deschedule=allow:NoSchedule, not excluded) is evicted.
The pod on cn-beijing.192.XX.XX.247 (taint deschedule=not-allow:NoSchedule, excluded) stays running.
The evicted pod is rescheduled to cn-beijing.192.XX.XX.249, which has no NoSchedule taint.

Check eviction events for the evicted pod:

kubectl get event | grep stress-demo-5f6cddf9-v****

Expected output:

3m24s       Normal    Evicting            podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
2m51s       Normal    EvictComplete       podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" has been evicted
3m24s       Normal    Descheduled         pod/stress-demo-5f6cddf9-v****                         Pod evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
3m24s       Normal    Killing             pod/stress-demo-5f6cddf9-v****                         Stopping container stress

Each event maps to a migration lifecycle phase:

Event	Source	Meaning
`Evicting`	PodMigrationJob	Descheduler started a migration job to evict the pod.
`Descheduled`	Pod	Pod received the eviction signal.
`Killing`	Pod	Container runtime stops the container.
`EvictComplete`	PodMigrationJob	Pod fully evicted. Workload controller recreates it.

The pod on cn-beijing.192.XX.XX.248 had no toleration for the deschedule=allow:NoSchedule taint (not in excludedTaints), so it was evicted.

Configure advanced parameters

Configure global behavior and template settings with a ConfigMap.

Advanced configuration example

This ConfigMap uses DeschedulerConfiguration for global settings, enables RemovePodsViolatingNodeTaints as the descheduling policy, and uses MigrationController as the evictor.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    deschedulingInterval: 120s # The interval at which the descheduler runs. The interval is set to 120 seconds in this example.
    nodeSelector: # The nodes that are involved in descheduling. By default, all nodes are descheduled.
      matchLabels:
        alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements.
    maxNoOfPodsToEvictPerNode: 10 # The maximum number of pods that can be evicted from a node. The limit takes effect on a global scale. By default, no limit is configured.
    maxNoOfPodsToEvictPerNamespace: 10 # The maximum number of pods that can be evicted from a namespace. The limit takes effect on a global scale. By default, no limit is configured.
    # The preceding configuration is the system configuration.

    # The template list.
    profiles:
    - name: koord-descheduler # The name of the template.

      # Scope: apply this template only to the specified nodes.
      # Method 1: Select nodes in one node pool.
      nodeSelector:
        matchLabels:
          alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements
      # Method 2: Select nodes in multiple node pools.
      # nodeSelector:
      #   matchExpressions:
      #   - key: alibabacloud.com/nodepool-id
      #     operator: In
      #     values:
      #     - nodepool-1
      #     - nodepool-2

      plugins:
        deschedule: # All plug-ins are disabled by default. Specify the ones to enable.
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.
        balance: # All plug-ins are disabled by default.
          disabled:
            - name: "*" # Disable all Balance plug-ins.
        evict:
          enabled:
            - name: MigrationController # MigrationController is enabled by default.
        filter:
          enabled:
            - name: MigrationController # Use MigrationController's filtering policy by default.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
          - reserved # Ignore nodes whose taint key is reserved.
          includePreferNoSchedule: false # When false, only checks NoSchedule taints.
          namespaces:
            include: # Restrict descheduling to these namespaces.
              - "namespace1"
              - "namespace2"
            # exclude: # Alternatively, exclude these namespaces.
            #   - "namespace1"
            #   - "namespace2"
          labelSelector: # Only deschedule pods matching these labels.
            accelerator: nvidia-tesla-p100

      - name: MigrationController
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly
          evictLocalStoragePods: false # When false, pods using emptyDir or hostPath are not descheduled.
          maxMigratingPerNode: 1 # Maximum pods migrated simultaneously on a node.
          maxMigratingPerNamespace: 1  # Maximum pods migrated simultaneously in a namespace.
          maxMigratingPerWorkload: 1 # Maximum pods migrated simultaneously in a workload.
          maxUnavailablePerWorkload: 2 # Maximum unavailable replicated pods allowed in a workload.
          objectLimiters:
            workload: # Throttle workload-level migration. Default: only 1 pod per workload within 5 minutes.
              duration: 5m
              maxMigrating: 1
          evictionPolicy: Eviction # Use the Eviction API by default.

System configurations

Configure global, system-level behavior in DeschedulerConfiguration.

Parameter	Type	Valid value	Description	Example
`dryRun`	boolean	true / false (default: false)	Read-only mode. When enabled, no pods are migrated.	false
`deschedulingInterval`	time.Duration	>0s	How often the descheduler runs.	120s
`nodeSelector`	Structure	—	Limit which nodes are eligible for descheduling. Accepts `matchLabels` (one node pool) or `matchExpressions` (multiple node pools). See Kubernetes labelSelector.	See example YAML above
`maxNoOfPodsToEvictPerNode`	int	≥0 (default: 0)	Maximum pods evicted from a single node per descheduling cycle. 0 means no limit.	10
`maxNoOfPodsToEvictPerNamespace`	int	≥0 (default: 0)	Maximum pods evicted from a single namespace per descheduling cycle. 0 means no limit.	10

Template configurations

Each template (profiles) groups descheduling policies and evictors with the following fields:

`name`: Template identifier.
`plugins`: Enables or disables descheduling policies (deschedule, balance), evictors (evict), and pre-eviction filters (filter).
`pluginConfig`: Per-plug-in arguments. Match the name field to the plug-in name and configure args. See Configure policy plug-ins and Configure evictor plug-ins.
`nodeSelector`: Limits the template to specific nodes. If unset, applies to all nodes.

Template-level nodeSelector requires ack-koordinator v1.6.1-ack.1.16 or later.

`plugins` field reference:

Field	Supported plug-ins	Description
`deschedule`	`RemovePodsViolatingNodeTaints`, `RemovePodsViolatingInterPodAntiAffinity`, `RemovePodsViolatingNodeAffinity`, `RemovePodsHavingTooManyRestarts`, `PodLifeTime`, `RemoveFailedPod`	All disabled by default. Specify plug-ins to enable.
`balance`	`RemoveDuplicates`, `LowNodeUtilization`, `HighNodeUtilization`, `RemovePodsViolatingTopologySpreadConstraint`, `LowNodeLoad`	All disabled by default. Specify plug-ins to enable.
`evict`	`MigrationController`, `DefaultEvictor`	The pod evictor. `MigrationController` is enabled by default. Do not enable multiple `evict` plug-ins simultaneously.
`filter`	`MigrationController`, `DefaultEvictor`	Pre-eviction filtering policy. `MigrationController` is enabled by default. Do not enable multiple `filter` plug-ins simultaneously.

Configure policy plug-ins

Koordinator Descheduler supports six Deschedule and five Balance plug-ins from Kubernetes Descheduler. LowNodeLoad is provided by Koordinator. See Work with load-aware hotspot descheduling.

Policy type	Plug-in	Description
Deschedule	RemovePodsViolatingInterPodAntiAffinity	Evicts pods that violate inter-pod anti-affinity rules.
Deschedule	RemovePodsViolatingNodeAffinity	Evicts pods that no longer satisfy node affinity rules.
Deschedule	RemovePodsViolatingNodeTaints	Evicts pods that cannot tolerate node taints.
Deschedule	RemovePodsHavingTooManyRestarts	Evicts pods that restart too frequently.
Deschedule	PodLifeTime	Evicts pods whose TTL has expired.
Deschedule	RemoveFailedPod	Evicts pods in the Failed state.
Balance	RemoveDuplicates	Spreads replicated pods evenly across nodes.
Balance	LowNodeUtilization	Redistributes pods based on node resource allocation.
Balance	HighNodeUtilization	Consolidates pods from underutilized nodes to more utilized ones.
Balance	RemovePodsViolatingTopologySpreadConstraint	Evicts pods that violate topology spread constraints.

Configure evictor plug-ins

Koordinator Descheduler supports two evictor plug-ins: DefaultEvictor and MigrationController.

MigrationController

MigrationController provides fine-grained eviction control and observability through migration jobs.

Parameter	Type	Valid value	Description	Example
`evictLocalStoragePods`	boolean	true / false (default: false)	When false, pods using `emptyDir` or `hostPath` are not descheduled.	false
`maxMigratingPerNode`	int64	≥0 (default: 2)	Maximum pods migrated simultaneously on a node. 0 means no limit.	2
`maxMigratingPerNamespace`	int64	≥0 (default: 0)	Maximum pods migrated simultaneously in a namespace. 0 means no limit.	1
`maxMigratingPerWorkload`	intOrString	≥0 (default: 10%)	Maximum pods or percentage migrated simultaneously in a workload. 0 means no limit. If a workload has only one pod, it is excluded from descheduling.	1 or 10%
`maxUnavailablePerWorkload`	intOrString	≥0 and < replica count (default: 10%)	Maximum unavailable replicated pods allowed in a workload. 0 means no limit.	1 or 10%
`objectLimiters.workload`	Structure	`Duration` >0 (default: 5m); `MaxMigrating` ≥0 (default: 10%)	Throttles workload-level migration within a time window. `Duration` sets the window length. `MaxMigrating` sets the maximum pods migrated within that window.	`duration: 5m` `maxMigrating: 1`
`evictionPolicy`	string	`Eviction` (default), `Delete`, `Soft`	Controls how pods are evicted. `Eviction`: calls the Eviction API for graceful eviction. `Delete`: calls the Delete API. `Soft`: adds the `scheduling.koordinator.sh/soft-eviction` annotation for custom downstream handling.	Eviction

DefaultEvictor

DefaultEvictor is the standard Kubernetes Descheduler evictor. See DefaultEvictor for configuration.

MigrationController vs. DefaultEvictor

Capability	DefaultEvictor	MigrationController
Eviction methods	Eviction API only	Eviction API, Delete API, or Soft annotation
Per-node eviction limit	Supported	Supported
Per-namespace eviction limit	Supported	Supported
Per-workload eviction limit	Not supported	Supported
Per-workload unavailability limit	Not supported	Supported
Eviction throttling	Not supported	Time window-based throttling per workload
Eviction observability	Component logs only	Component logs and Kubernetes events with per-pod migration status

Next steps

Descheduling — concepts, features, and workflow.
Using the community Kubernetes Descheduler? See Koordinator Descheduler and Kubernetes Descheduler for differences and migration.
Configure load-aware descheduling with LowNodeLoad: Work with load-aware hotspot descheduling.
Analyze cluster resource usage and get cost-saving recommendations: Cost Insight.
Troubleshooting: Scheduling FAQs.
Release notes and component overview: ack-koordinator (ack-slo-manager).