All Products
Search
Document Center

Container Service for Kubernetes:Enable the descheduling feature

Last Updated:Mar 26, 2026

Descheduling moves running pods that violate eviction rules to other nodes, without waiting for the scheduler to detect the imbalance. Use it to fix pods that became mis-scheduled after their node's taints, affinity rules, or load profile changed.

This topic walks you through enabling descheduling via the ack-koordinator component, using the RemovePodsViolatingNodeTaints plug-in as an example.

Prerequisites

Before you begin, make sure you have:

Descheduling is not supported on virtual nodes.

Usage notes

  • Koordinator Descheduler only evicts running pods. It does not recreate or reschedule them. After eviction, the workload controller (such as a Deployment or StatefulSet) recreates the pod, which is then scheduled by the standard scheduler.

  • During descheduling, old pods are evicted before new pods are created. Make sure your application has enough replicas to maintain availability during eviction.

  • ack-descheduler is discontinued. If you are using it, migrate to Koordinator Descheduler. For more information, see How do I migrate from ack-descheduler to Koordinator Descheduler?.

Choose a descheduling plug-in

Select a plug-in based on the problem you want to solve:

Scenario Plug-in Policy type
Pods remain on nodes that acquired a NoSchedule taint after scheduling RemovePodsViolatingNodeTaints Deschedule
Pods violate inter-pod anti-affinity rules RemovePodsViolatingInterPodAntiAffinity Deschedule
Pods no longer satisfy node affinity rules RemovePodsViolatingNodeAffinity Deschedule
Pods restart too frequently RemovePodsHavingTooManyRestarts Deschedule
Pods have exceeded their time-to-live PodLifeTime Deschedule
Pods are in the Failed state RemoveFailedPod Deschedule
Replicated pods are unevenly spread across nodes RemoveDuplicates Balance
Nodes are unevenly utilized by resource allocation LowNodeUtilization Balance
Pods violate topology spread constraints RemovePodsViolatingTopologySpreadConstraint Balance
Nodes are overloaded by actual resource utilization LowNodeLoad Balance

This topic uses RemovePodsViolatingNodeTaints as the example. Read the descheduling concepts and Koordinator Descheduler vs. Kubernetes Descheduler before getting started.

How it works

The RemovePodsViolatingNodeTaints plug-in checks every node for NoSchedule taints at the configured interval. If a running pod lacks a toleration for a node's NoSchedule taint, the plug-in evicts the pod. The workload controller then recreates the pod, and the scheduler places it on a node the pod can tolerate.

Use the excludedTaints field to exempt specific taints from this check. If a taint's key, or its key=value pair, matches an entry in excludedTaints, the plug-in ignores that taint.

Example scenario used in this topic:

A three-node cluster has a Deployment with one pod on each node. An administrator adds NoSchedule taints to two nodes after the pods are already running:

  • Node A gets deschedule=not-allow:NoSchedule. Because deschedule=not-allow is in excludedTaints, this taint is ignored — the pod stays.

  • Node B gets deschedule=allow:NoSchedule. This taint is not excluded — the pod is evicted and rescheduled to Node C (which has no NoSchedule taint).

Step 1: Install ack-koordinator and enable descheduling

If ack-koordinator is already installed, make sure the version is 1.2.0-ack.2 or later before proceeding.
  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the target cluster and click its name. In the left navigation pane, click Add-ons.

  3. Find ack-koordinator and click Install in the lower-right corner.

  4. In the Install dialog box, select Enable Descheduler for ACK-Koordinator, then configure and install the component as prompted.

Koordinator Descheduler is deployed as a Deployment on cluster nodes.

Step 2: Enable the RemovePodsViolatingNodeTaints plug-in

Configure the plug-in

Create a file named koord-descheduler-config.yaml with the following content. This ConfigMap enables RemovePodsViolatingNodeTaints and configures it to ignore the deschedule=not-allow taint.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    deschedulingInterval: 120s # The interval at which the descheduler runs. Set to 120 seconds here.
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    # The preceding configuration is the system configuration.

    profiles:
    - name: koord-descheduler
      plugins:
        deschedule:
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints # Configure the node taint verification plug-in.
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.

      # Required for RemovePodsViolatingNodeTaints to take effect. Do not remove.
      - name: MigrationController # Configure the migration controller.
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly

RemovePodsViolatingNodeTaints parameters:

Parameter Type Default Description
excludedTaints list(string) Taint keys or key=value pairs to ignore. Pods on nodes with these taints are not evicted.
includePreferNoSchedule bool false When true, also checks taints with effect PreferNoSchedule, not just NoSchedule.
namespaces.include list(string) Restrict descheduling to specific namespaces. Mutually exclusive with namespaces.exclude.
namespaces.exclude list(string) Skip descheduling in specific namespaces. Mutually exclusive with namespaces.include.
labelSelector map Restrict descheduling to pods that match the specified labels.

Apply the configuration

  1. Apply the ConfigMap to the cluster:

    kubectl apply -f koord-descheduler-config.yaml
  2. Restart Koordinator Descheduler to pick up the new configuration:

    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0
    # Expected output:
    # deployment.apps/ack-koord-descheduler scaled
    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1
    # Expected output:
    # deployment.apps/ack-koord-descheduler scaled

Step 3: Verify descheduling

This example uses a three-node cluster.

  1. Create a file named stress-demo.yaml with the following content:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: stress-demo
      namespace: default
      labels:
        app: stress-demo
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: stress-demo
      template:
        metadata:
          name: stress-demo
          labels:
            app: stress-demo
        spec:
          containers:
            - args:
                - '--vm'
                - '2'
                - '--vm-bytes'
                - '1600M'
                - '-c'
                - '2'
                - '--vm-hang'
                - '2'
              command:
                - stress
              image: registry-cn-beijing.ack.aliyuncs.com/acs/stress:v1.0.4
              imagePullPolicy: Always
              name: stress
              resources:
                limits:
                  cpu: '2'
                  memory: 4Gi
                requests:
                  cpu: '2'
                  memory: 4Gi
          restartPolicy: Always
  2. Deploy the test workload:

    kubectl create -f stress-demo.yaml
  3. Wait for the pods to reach the Running state:

    kubectl get pod -o wide

    Expected output:

    NAME                         READY   STATUS    RESTARTS   AGE    IP              NODE                        NOMINATED NODE   READINESS GATES
    stress-demo-5f6cddf9-9****   1/1     Running   0          10s    192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
    stress-demo-5f6cddf9-h****   1/1     Running   0          10s    192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Running   0          10s    192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
  4. Add NoSchedule taints to two nodes:

    • Add deschedule=not-allow:NoSchedule to cn-beijing.192.XX.XX.247 (excluded by excludedTaints — pod should stay):

      kubectl taint nodes cn-beijing.192.XX.XX.247 deschedule=not-allow:NoSchedule

      Expected output:

      node/cn-beijing.192.XX.XX.247 tainted
    • Add deschedule=allow:NoSchedule to cn-beijing.192.XX.XX.248 (not excluded — pod should be evicted):

      kubectl taint nodes cn-beijing.192.XX.XX.248 deschedule=allow:NoSchedule

      Expected output:

      node/cn-beijing.192.XX.XX.248 tainted
  5. Watch the pod changes. The descheduler checks taints at each deschedulingInterval (120 seconds):

    kubectl get pod -o wide -w

    Expected output:

    NAME                         READY   STATUS              RESTARTS   AGE     IP             NODE                    NOMINATED NODE   READINESS GATES
    stress-demo-5f6cddf9-9****   1/1     Running             0          5m34s   192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
    stress-demo-5f6cddf9-h****   1/1     Running             0          5m34s   192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Running             0          5m34s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Terminating         0          7m58s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
    stress-demo-5f6cddf9-j****   0/1     ContainerCreating   0          0s      <none>         cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-j****   1/1     Running             0          2s      192.XX.XX.32   cn-beijing.192.XX.XX.249   <none>           <none>

    The output confirms:

    • The pod on cn-beijing.192.XX.XX.248 (taint deschedule=allow:NoSchedule, not excluded) is evicted.

    • The pod on cn-beijing.192.XX.XX.247 (taint deschedule=not-allow:NoSchedule, excluded) stays running.

    • The evicted pod is rescheduled to cn-beijing.192.XX.XX.249, which has no NoSchedule taint.

  6. Check the eviction events for the evicted pod:

    kubectl get event | grep stress-demo-5f6cddf9-v****

    Expected output:

    3m24s       Normal    Evicting            podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
    2m51s       Normal    EvictComplete       podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" has been evicted
    3m24s       Normal    Descheduled         pod/stress-demo-5f6cddf9-v****                         Pod evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
    3m24s       Normal    Killing             pod/stress-demo-5f6cddf9-v****                         Stopping container stress

    Each event corresponds to a phase in the migration lifecycle:

    Event Source Meaning
    Evicting PodMigrationJob The descheduler has decided to evict the pod and started the migration job.
    Descheduled Pod The pod received the eviction signal.
    Killing Pod The container runtime is stopping the container.
    EvictComplete PodMigrationJob The pod has been fully evicted. The workload controller will recreate it.

    The pod on cn-beijing.192.XX.XX.248 had no toleration for the deschedule=allow:NoSchedule taint (which is not in excludedTaints), so it was evicted. The result matches expectations.

Configure advanced parameters

Use a ConfigMap to configure global behavior and fine-grained template settings for Koordinator Descheduler.

Example of advanced configurations

The following example shows a fully configured Koordinator Descheduler ConfigMap. It uses DeschedulerConfiguration for global settings, enables RemovePodsViolatingNodeTaints as the descheduling policy, and uses MigrationController as the evictor.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    deschedulingInterval: 120s # The interval at which the descheduler runs. The interval is set to 120 seconds in this example.
    nodeSelector: # The nodes that are involved in descheduling. By default, all nodes are descheduled.
      matchLabels:
        alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements.
    maxNoOfPodsToEvictPerNode: 10 # The maximum number of pods that can be evicted from a node. The limit takes effect on a global scale. By default, no limit is configured.
    maxNoOfPodsToEvictPerNamespace: 10 # The maximum number of pods that can be evicted from a namespace. The limit takes effect on a global scale. By default, no limit is configured.
    # The preceding configuration is the system configuration.

    # The template list.
    profiles:
    - name: koord-descheduler # The name of the template.

      # Scope: apply this template only to the specified nodes.
      # Method 1: Select nodes in one node pool.
      nodeSelector:
        matchLabels:
          alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements
      # Method 2: Select nodes in multiple node pools.
      # nodeSelector:
      #   matchExpressions:
      #   - key: alibabacloud.com/nodepool-id
      #     operator: In
      #     values:
      #     - nodepool-1
      #     - nodepool-2

      plugins:
        deschedule: # All plug-ins are disabled by default. Specify the ones to enable.
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.
        balance: # All plug-ins are disabled by default.
          disabled:
            - name: "*" # Disable all Balance plug-ins.
        evict:
          enabled:
            - name: MigrationController # MigrationController is enabled by default.
        filter:
          enabled:
            - name: MigrationController # Use MigrationController's filtering policy by default.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
          - reserved # Ignore nodes whose taint key is reserved.
          includePreferNoSchedule: false # When false, only checks NoSchedule taints.
          namespaces:
            include: # Restrict descheduling to these namespaces.
              - "namespace1"
              - "namespace2"
            # exclude: # Alternatively, exclude these namespaces.
            #   - "namespace1"
            #   - "namespace2"
          labelSelector: # Only deschedule pods matching these labels.
            accelerator: nvidia-tesla-p100

      - name: MigrationController
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly
          evictLocalStoragePods: false # When false, pods using emptyDir or hostPath are not descheduled.
          maxMigratingPerNode: 1 # Maximum pods migrated simultaneously on a node.
          maxMigratingPerNamespace: 1  # Maximum pods migrated simultaneously in a namespace.
          maxMigratingPerWorkload: 1 # Maximum pods migrated simultaneously in a workload.
          maxUnavailablePerWorkload: 2 # Maximum unavailable replicated pods allowed in a workload.
          objectLimiters:
            workload: # Throttle workload-level migration. Default: only 1 pod per workload within 5 minutes.
              duration: 5m
              maxMigrating: 1
          evictionPolicy: Eviction # Use the Eviction API by default.

System configurations

Configure global, system-level behavior in DeschedulerConfiguration.

Parameter Type Valid value Description Example
dryRun boolean true / false (default: false) Read-only mode. When enabled, no pods are migrated. false
deschedulingInterval time.Duration >0s How often the descheduler runs. 120s
nodeSelector Structure Limit which nodes are eligible for descheduling. Accepts matchLabels (one node pool) or matchExpressions (multiple node pools). See Kubernetes labelSelector. See example YAML above
maxNoOfPodsToEvictPerNode int ≥0 (default: 0) Maximum pods evicted from a single node per descheduling cycle. 0 means no limit. 10
maxNoOfPodsToEvictPerNamespace int ≥0 (default: 0) Maximum pods evicted from a single namespace per descheduling cycle. 0 means no limit. 10

Template configurations

Koordinator Descheduler uses templates (profiles) to group descheduling policies and evictors. Each template has the following fields:

  • `name`: A string identifier for the template.

  • `plugins`: Enables or disables descheduling policies (deschedule, balance), evictors (evict), and pre-eviction filters (filter).

  • `pluginConfig`: Per-plug-in advanced arguments. Set the name field to match the plug-in name and configure args. See Configure policy plug-ins and Configure evictor plug-ins.

  • `nodeSelector`: Limits the template scope to specific nodes. If not set, the template applies to all nodes.

The nodeSelector field in template configurations requires ack-koordinator v1.6.1-ack.1.16 or later.

`plugins` field reference:

Field Supported plug-ins Description
deschedule RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, RemovePodsHavingTooManyRestarts, PodLifeTime, RemoveFailedPod All disabled by default. Specify plug-ins to enable.
balance RemoveDuplicates, LowNodeUtilization, HighNodeUtilization, RemovePodsViolatingTopologySpreadConstraint, LowNodeLoad All disabled by default. Specify plug-ins to enable.
evict MigrationController, DefaultEvictor The pod evictor. MigrationController is enabled by default. Do not enable multiple evict plug-ins at the same time.
filter MigrationController, DefaultEvictor Pre-eviction filtering policy. MigrationController is enabled by default. Do not enable multiple filter plug-ins at the same time.

Configure policy plug-ins

Koordinator Descheduler supports six Deschedule plug-ins and five Balance plug-ins from Kubernetes Descheduler. The LowNodeLoad plug-in is provided by Koordinator. For more information, see Work with load-aware hotspot descheduling.

Policy type Plug-in Description
Deschedule RemovePodsViolatingInterPodAntiAffinity Evicts pods that violate inter-pod anti-affinity rules.
Deschedule RemovePodsViolatingNodeAffinity Evicts pods that no longer satisfy node affinity rules.
Deschedule RemovePodsViolatingNodeTaints Evicts pods that cannot tolerate node taints.
Deschedule RemovePodsHavingTooManyRestarts Evicts pods that restart too frequently.
Deschedule PodLifeTime Evicts pods whose TTL has expired.
Deschedule RemoveFailedPod Evicts pods in the Failed state.
Balance RemoveDuplicates Spreads replicated pods evenly across nodes.
Balance LowNodeUtilization Redistributes pods based on node resource allocation.
Balance HighNodeUtilization Consolidates pods from underutilized nodes to more utilized ones.
Balance RemovePodsViolatingTopologySpreadConstraint Evicts pods that violate topology spread constraints.

Configure evictor plug-ins

Koordinator Descheduler supports two evictor plug-ins: DefaultEvictor and MigrationController.

MigrationController

MigrationController provides fine-grained eviction control and observability through migration jobs.

Parameter Type Valid value Description Example
evictLocalStoragePods boolean true / false (default: false) When false, pods using emptyDir or hostPath are not descheduled. false
maxMigratingPerNode int64 ≥0 (default: 2) Maximum pods migrated simultaneously on a node. 0 means no limit. 2
maxMigratingPerNamespace int64 ≥0 (default: 0) Maximum pods migrated simultaneously in a namespace. 0 means no limit. 1
maxMigratingPerWorkload intOrString ≥0 (default: 10%) Maximum pods or percentage migrated simultaneously in a workload. 0 means no limit. If a workload has only one pod, it is excluded from descheduling. 1 or 10%
maxUnavailablePerWorkload intOrString ≥0 and < replica count (default: 10%) Maximum unavailable replicated pods allowed in a workload. 0 means no limit. 1 or 10%
objectLimiters.workload Structure Duration >0 (default: 5m); MaxMigrating ≥0 (default: 10%) Throttles workload-level migration within a time window. Duration sets the window length. MaxMigrating sets the maximum pods migrated within that window. duration: 5m maxMigrating: 1
evictionPolicy string Eviction (default), Delete, Soft Controls how pods are evicted. Eviction: calls the Eviction API for graceful eviction. Delete: calls the Delete API. Soft: adds the scheduling.koordinator.sh/soft-eviction annotation for custom downstream handling. Eviction

DefaultEvictor

DefaultEvictor is the standard Kubernetes Descheduler evictor. For configuration details, see DefaultEvictor.

MigrationController vs. DefaultEvictor

Capability DefaultEvictor MigrationController
Eviction methods Eviction API only Eviction API, Delete API, or Soft annotation
Per-node eviction limit Supported Supported
Per-namespace eviction limit Supported Supported
Per-workload eviction limit Not supported Supported
Per-workload unavailability limit Not supported Supported
Eviction throttling Not supported Time window-based throttling per workload
Eviction observability Component logs only Component logs and Kubernetes events with per-pod migration status

What's next