The descheduler component is used to optimize the scheduling of pods that cannot be matched with suitable nodes. This avoids resource waste and improves resource utilization in Container Service for Kubernetes (ACK) clusters. This topic describes how to configure and use descheduler.

Prerequisites

Install ack-descheduler

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, choose Marketplace > App Catalog.
  3. On the App Catalog page, click the Alibaba Cloud Apps tab. Find and click ack-descheduler.
    A large number of applications are displayed on the Alibaba Cloud Apps tab. You can enter ack-descheduler in the Name search box to search for the component. You can also enter a keyword to perform a fuzzy match.
  4. On the right side of the View Details page, select the cluster where you want to deploy the application in the Deploy section. Click Create.
    After ack-descheduler is installed, a CronJob is automatically created in the kube-system namespace. By default, this CronJob runs every 2 minutes. After ack-descheduler is installed, you are directed to the ack-descheduler-default page. If all the relevant resources are created, as shown in the following figure, it indicates that the component is installed. ack-descheduler.png

Use ack-descheduler to optimize pod scheduling

  1. Check the DeschedulerPolicy setting of the ack-descheduler-default ConfigMap.
    kubectl describe cm ack-descheduler-default -n kube-system

    Expected output:

    Name:         descheduler
    Namespace:    kube-system
    Labels:       app.kubernetes.io/instance=descheduler
                  app.kubernetes.io/managed-by=Helm
                  app.kuberne
    
    
    tes.io/name=descheduler
                  app.kubernetes.io/version=0.20.0
                  helm.sh/chart=descheduler-0.20.0
    Annotations:  meta.helm.sh/release-name: descheduler
                  meta.helm.sh/release-namespace: kube-system
    Data
    ====
    policy.yaml:
    ----
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":  
         enabled: true
      "RemovePodsViolatingInterPodAntiAffinity": 
         enabled: true
      "LowNodeUtilization": 
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               "cpu" : 20
               "memory": 20
               "pods": 20
             targetThresholds:
               "cpu" : 50
               "memory": 50
               "pods": 50
      "RemovePodsHavingTooManyRestarts":
         enabled: true
         params:
           podsHavingTooManyRestarts:
             podRestartThreshold: 100
             includingInitContainers: true
    Events:  <none>

    The following table describes the scheduling policies returned in the preceding output. For more information about the policy settings in the strategies section, see Descheduler.

    Policy Description
    RemoveDuplicates This policy removes duplicate pods and ensures that only one pod is associated with a ReplicaSet, ReplicationController, StatefulSet, or Job that runs on the same node.
    RemovePodsViolatingInterPodAntiAffinity This policy deletes pods that violate inter-pod anti-affinity rules.
    LowNodeUtilization This policy finds nodes that are underutilized, evicts pods from other nodes, and recreates the pods on the underutilized nodes. The parameters of this policy are configured in the nodeResourceUtilizationThresholds section.
    RemovePodsHavingTooManyRestarts This policy deletes pods that have been restarted for a specified number of times.
  2. Verify pod scheduling before the scheduling policy is modified.
    1. Create a Deployment to test the scheduling.
      Create an nginx.yaml file and copy the following content to the file:
      apiVersion: apps/v1 # for versions before 1.8.0 use apps/v1beta1
      kind: Deployment
      metadata:
        name: nginx-deployment-basic
        labels:
          app: nginx
      spec:
        replicas: 3
        selector:
          matchLabels:
            app: nginx
        template:
          metadata:
            labels:
              app: nginx
          spec:
            containers:
            - name: nginx
              image: nginx:1.7.9 # Replace with the image that you want to use. The value must be in the <image_name:tags> format. 
              ports:
              - containerPort: 80

      Run the following command to create a Deployment with the nginx.yaml file:

      kubectl apply -f nginx.yaml

      Expected output:

      deployment.apps/nginx-deployment-basic created
    2. Wait 2 minutes and run the following command to check the nodes to which the pods are scheduled:
      kubectl get pod -o wide | grep nginx

      Expected output:

      NAME                          READY   STATUS     RESTARTS   AGE    IP               NODE                         NOMINATED NODE   READINESS GATES
      nginx-deployment-basic-**1    1/1     Running    0          36s    172.25.XXX.XX1   cn-hangzhou.172.16.XXX.XX2   <none>           <none>
      nginx-deployment-basic-**2    1/1     Running    0          11s    172.25.XXX.XX2   cn-hangzhou.172.16.XXX.XX3   <none>           <none>
      nginx-deployment-basic-**3    1/1     Running    0          36s    172.25.XXX.XX3   cn-hangzhou.172.16.XXX.XX3   <none>           <none>

      The output shows that pod nginx-deployment-basic-**2 and pod nginx-deployment-basic-**3 are scheduled to the same node cn-hangzhou.172.16.XXX.XX3.

      Note If you use the default settings for the ack-descheduler-default ConfigMap, the scheduling result varies based on actual conditions of the cluster.
  3. Modify the scheduling policy.
    If you use multiple scheduling policies, unexpected scheduling results may be obtained. To prevent this issue, modify the ConfigMap in Step 1 to retain only the RemoveDuplicates policy.
    Note The RemoveDuplicates policy ensures that pods managed by replication controllers are evenly distributed to different nodes.
    In this example, the name of the ConfigMap is changed to newPolicy.yaml after the modification. The modified ConfigMap contains the following content:
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: descheduler
      namespace: kube-system
      labels:
        app.kubernetes.io/instance: descheduler
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: descheduler
        app.kubernetes.io/version: 0.20.0
        helm.sh/chart: descheduler-0.20.0  
      annotations:
        meta.helm.sh/release-name: descheduler
        meta.helm.sh/release-namespace: kube-system
    data: 
      policy.yaml: |-
        apiVersion: "descheduler/v1alpha1"
        kind: "DeschedulerPolicy"
        strategies:
          "RemoveDuplicates": # Retain only the RemoveDuplicates policy. 
             enabled: true
  4. Verify pod scheduling after the scheduling policy is modified.
    1. Run the following command to apply the new scheduling policy:
      kubectl apply -f newPolicy.yaml

      Expected output:

      configmap/descheduler created
    2. Wait 2 minutes and run the following command to check the nodes to which the pods are scheduled:
      kubectl get pod -o wide | grep nginx

      Expected output:

      NAME                          READY   STATUS     RESTARTS   AGE      IP               NODE                         NOMINATED NODE   READINESS GATES
      nginx-deployment-basic-**1    1/1     Running    0          8m26s    172.25.XXX.XX1   cn-hangzhou.172.16.XXX.XX2   <none>           <none>
      nginx-deployment-basic-**2    1/1     Running    0          8m1s     172.25.XXX.XX2   cn-hangzhou.172.16.XXX.XX1   <none>           <none>
      nginx-deployment-basic-**3    1/1     Running    0          8m26s    172.25.XXX.XX3   cn-hangzhou.172.16.XXX.XX3   <none>           <none>

      The output shows that pod nginx-deployment-basic-**2 is rescheduled to cn-hangzhou.172.16.XXX.XX1 by descheduler. In this case, each of the three test pods is scheduled to a different node. This balances pod scheduling among multiple nodes.