负载感知调度是ACK基于Scheduling Framework实现感知节点实际资源负载的调度策略。调度过程中,通过参考节点负载的历史统计,将Pod优先调度到负载较低的节点,实现节点负载均衡的目标,避免出现因单个节点负载过高而导致的应用程序或节点故障。本文介绍如何使用负载感知调度。

前提条件

  • 本功能仅适应于ACK Pro版集群,关于如何创建ACK Pro版集群,请参见创建ACK Pro版集群
  • 系统组件版本要求具体如下表所示。
    组件 版本要求
    Kubernetes ≥v1.18
    Helm ≥v3.0
    Docker v19.03.5

使用方式

通过添加Annotations来标识是否开启负载感知调度。

      annotations:
        alibabacloud.com/loadAwareScheduleEnabled: "true" # 开启负载感知调度。

安装ack-slo-manager组件

  1. 登录容器服务管理控制台
  2. 在控制台左侧导航栏中,选择市场 > 应用目录
  3. 应用目录页面,单击阿里云应用页签,在名称文本框中搜索ack-slo-manager
    5
  4. ack-slo-manager区域单击,在详情页面的右下角单击创建
  5. 创建完成后,执行以下命令检查是否成功部署ack-slo-manager
    helm get manifest ack-slo-manager -n kube-system | kubectl get -n kube-system -f -

    预期输出:

    NAME                               DATA   AGE
    configmap/ack-slo-manager-config   1      74s
    
    NAME                                SECRETS   AGE
    serviceaccount/ack-slo-controller   1         74s
    serviceaccount/ack-slo-agent        1         74s
    
    NAME                                                                               CREATED AT
    customresourcedefinition.apiextensions.k8s.io/nodemetrics.nodes.alibabacloud.com   2021-08-06T06:54:11Z
    customresourcedefinition.apiextensions.k8s.io/nodeslos.nodes.alibabacloud.com      2021-08-06T06:54:11Z
    
    NAME                                                               CREATED AT
    clusterrole.rbac.authorization.k8s.io/ack-slo-manager-cluserrole   2021-08-06T06:54:11Z
    
    NAME                                                                                 ROLE                                     AGE
    clusterrolebinding.rbac.authorization.k8s.io/ack-slo-agent-custom-rolebinding        ClusterRole/ack-slo-manager-cluserrole   74s
    clusterrolebinding.rbac.authorization.k8s.io/ack-slo-controller-custom-rolebinding   ClusterRole/ack-slo-manager-cluserrole   74s
    
    NAME                                                                 CREATED AT
    role.rbac.authorization.k8s.io/ack-slo-manager-role:leaderelection   2021-08-06T06:54:11Z
    
    NAME                                                                                      ROLE                                       AGE
    rolebinding.rbac.authorization.k8s.io/ack-slo-manager-custom-rolebinding:leaderelection   Role/ack-slo-manager-role:leaderelection   74s
    
    NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
    daemonset.apps/ack-slo-agent   6         6         6       6            6           <none>          74s
    
    NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/ack-slo-controller   1/1     1            1           74s

    如果输出信息与以上信息类似,说明您已成功部署ack-slo-manager

使用负载感知调度

本文示例的集群有3台4核8 GiB的节点。

  1. 使用以下YAML文件,创建一个视频转码作业。
    apiVersion: batch/v1
    kind: Job
    metadata:
      generateName: ffmpeg
      namespace: default
    spec:
      backoffLimit: 1
      parallelism: 1
      template:
        metadata:
        spec:
          containers:
          - image: registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/ffmpeg:latest
            name: ffmpeg
            resources:
              requests:
                cpu: 1
                memory: 2Gi
              limits:
                cpu: 2
                memory: 4Gi
            command: ["/bin/sh"]
            args: ["-c","while true;do ffmpeg -i /root/mp4/statefulset.mp4 -c:v libx264 -crf 100 -threads 200 destination.flv -y;sleep 1;done"]
          restartPolicy: OnFailure
  2. 执行以下命令,检查Pod的运行和部署的情况。
    # 镜像较大,pod启动大概需要2min
    kubectl get pods -o wide

    预期输出:

    NAME                READY   STATUS    RESTARTS   AGE    IP             NODE                   NOMINATED NODE   READINESS GATES
    ffmpegjjpv2-xcl6t   1/1     Running   0          2m6s   172.20.177.8   cn-beijing.10.0.0.46   <none>           <none>

    从上述输出可知,该作业的Pod运行在节点cn-beijing.10.0.0.46上。

  3. 执行以下命令,查看该集群所有节点的负载详情。
    kubectl top node

    预期输出:

    NAME                   CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
    cn-beijing.10.0.0.44   251m         6%     1531Mi          22%
    cn-beijing.10.0.0.45   352m         8%     1881Mi          27%
    cn-beijing.10.0.0.46   1802m        45%    1689Mi          25%

    从上述输出可知,此时节点cn-beijing.10.0.0.46的负载最高。

  4. 使用以下YAML文件,创建没有开启负载感知调度的Pod。
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          name: nginx
          labels:
            app: nginx
        spec:
          containers:
          - name: nginx
            image: nginx
            resources:
              limits:
                cpu: 500m
              requests:
                cpu: 500m
  5. 执行以下命令,查看Pod调度详情。
    kubectl get pods | grep nginx

    预期输出:

    nginx-6976ccb95-96xmf   1/1     Running   0          108s    172.20.177.10    cn-beijing.10.0.0.46   <none>           <none>
    nginx-6976ccb95-b6xtp   1/1     Running   0          108s    172.20.177.70    cn-beijing.10.0.0.44   <none>           <none>
    nginx-6976ccb95-hmqcw   1/1     Running   0          108s    172.20.177.69    cn-beijing.10.0.0.44   <none>           <none>
    nginx-6976ccb95-l2gnn   1/1     Running   0          108s    172.20.177.9     cn-beijing.10.0.0.46   <none>           <none>
    nginx-6976ccb95-rzsfc   1/1     Running   0          108s    172.20.176.200   cn-beijing.10.0.0.45   <none>           <none>
    nginx-6976ccb95-shv52   1/1     Running   0          108s    172.20.177.71    cn-beijing.10.0.0.44   <none>           <none>

    从上述输出可知,由于集群没有开启负载感知调度功能,不能感知节点负载,Pod被均衡调度到集群的所有节点上。

  6. 执行以下命令,删除没有开启负载感知调度的Pod。
    kubectl delete deployment nginx
  7. 使用以下YAML文件,创建开启负载感知调度的Pod。
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
      namespace: default
      labels:
        app: nginx
    spec:
      replicas: 6
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          name: nginx
          labels:
            app: nginx
          annotations:
            alibabacloud.com/loadAwareScheduleEnabled: "true" # 开启负载感知调度。
        spec:
          containers:
          - name: nginx
            image: nginx
            resources:
              limits:
                cpu: 500m
              requests:
                cpu: 500m
  8. 执行以下命令,查看Pod调度详情。
    kubectl get pods | grep nginx

    预期输出:

    nginx-7759585d4c-hfblh   1/1     Running   0          72s     172.20.177.74    cn-beijing.10.0.0.44   <none>           <none>
    nginx-7759585d4c-hzzt9   1/1     Running   0          72s     172.20.176.203   cn-beijing.10.0.0.45   <none>           <none>
    nginx-7759585d4c-m4gmm   1/1     Running   0          72s     172.20.177.72    cn-beijing.10.0.0.44   <none>           <none>
    nginx-7759585d4c-v7wfb   1/1     Running   0          72s     172.20.176.201   cn-beijing.10.0.0.45   <none>           <none>
    nginx-7759585d4c-wx59r   1/1     Running   0          72s     172.20.177.73    cn-beijing.10.0.0.44   <none>           <none>
    nginx-7759585d4c-xbwh6   1/1     Running   0          72s     172.20.176.202   cn-beijing.10.0.0.45   <none>           <none>

    从上述输出可知,由于集群开启了负载感知调度功能,能感知节点负载,通过运用调度策略,此时Pod被优先调度到除cn-beijing.10.0.0.46以外的节点上。