ack-slo-manager can dynamically overcommit resources. ack-slo-manager monitors the loads of a node in real time and then schedules resources that are allocated to pods but are not in use. This topic describes how to use the dynamic resource overcommitment feature.

Prerequisites

  • Only Container Service for Kubernetes (ACK) Pro clusters support the dynamic resource overcommitment feature. For more information, see Create an ACK Pro cluster.
  • ack-slo-manager is installed in your cluster. For more information, see Install ack-slo-manager.

Background information

In Kubernetes, the kubelet manages the resources that are used by the pods on a node based on the quality of service (QoS) classes of the pods. For example, the kubelet controls the out of memory (OOM) priorities. The QoS class of a pod can be Guaranteed, Burstable, or BestEffort. The QoS classes of pods depend on the requests and limits of CPU and memory resources that are configured for the pods.

To improve the stability of applications, application administrators reserve resources for Guaranteed or Burstable pods. The reserved resources are used to handle fluctuating workloads. In most cases, the resource request of a pod is much higher than the actual resource utilization. To improve the resource utilization in a cluster, application administrators may provision BestEffort pods. These pods can share the resources that are allocated to other pods but are not in use. This mechanism is known as resource overcommitment. Resource overcommitment has the following disadvantages:
  • BestEffort pods do not have resource requests or limits. As a result, even if a node is overloaded, the system can still schedule BestEffort pods to the node.
  • You cannot guarantee that resources are fairly scheduled among BestEffort pods due to the lack of requests and limits that specify the amount of resources used by a pod.
Disadvantages

You can use the Service Level Objective (SLO) capability of ACK to control the resources that are used by BestEffort pods. In the preceding line graph, the SLO of ACK classifies the resources into three categories: Usage, Buffered, and Reclaimed. Usage refers to the actual resource usage and is represented by the red line. Buffered refers to reserved resources and is represented by the area between the blue line and red line. Reclaimed refers to reclaimed resources and is represented by the area in green.

Reclaimed resources are resources that can be dynamically overcommitted, as shown in the following figure. ack-slo-manager monitors the loads of a node and synchronizes resource statistics to the node metadata as extended resources in real time. To allow BestEffort pods to use reclaimed resources, you can configure requests and limits of reclaimed resources for the BestEffort pods. In addition, you can configure settings that are related to reclaimed resources in the node configuration. This ensures that resources are fairly scheduled among BestEffort pods.

Dynamic resource overcommitment

Limits

Component Required version
Kubernetes ≥ 1.18
ack-slo-manager ≥ 0.3.0
Helm ≥ 3.0
OS Alibaba Cloud Linux 2, CentOS 7.6, and CentOS 7.7

Procedure

  1. Run the following command to query the total amount of reclaimed resources:
    Make sure that the relevant parameters are configured before your query the total amount of reclaimed resources. For more information, see the description in Step 3.
    #Replace $nodeName with the name of the node that you want to query. 
    kubectl get node $nodeName -o yaml
    Expected output:
    #Node
    status:
      allocatable:
        #Unit: millicores. In the following example, 50 cores can be allocated. 
        alibabacloud.com/reclaimed-cpu: 50000
        #Unit: bytes. In the following example, 50 GB of memory can be allocated. 
        alibabacloud.com/reclaimed-memory: 53687091200
  2. Create a pod and apply for the reclaimed resources. Add an annotation to specify the QoS class of the pod, and add requests and limits for the reclaimed resources. This way, the pod can use overcommitted resources. The following code block shows an example:
    #Pod
    metadata:
      annotations:
        #Required. Set the QoS class of the pod to BestEffort. 
        alibabacloud.com/qosClass: "BE"
    spec:
      containers:
      - resources:
          requests:
            #Unit: millicores. In the following example, the CPU request is set to one core. 
            alibabacloud.com/reclaimed-cpu: "1k"
            #Unit: bytes. In the following example, the memory request is set to 1 GB. 
            alibabacloud.com/reclaimed-memory: "1Gi"
          limits:
            alibabacloud.com/reclaimed-cpu: "1k"
            alibabacloud.com/reclaimed-memory: "1Gi"
    When you apply for reclaimed resources, take note of the following items:
    • If you provision a pod by using a Deployment or other types of workloads, you need only to modify the YAML template based on the format in the preceding code block. A pod cannot apply for reclaimed resources and regular resources at the same time.
    • The amount of reclaimed resources on a node is calculated based on the loads of the node in real time. If the kubelet fails to synchronize the most recent statistics about reclaimed resources to the node metadata, the kubelet may reject the request for reclaimed resources. If the request is rejected, you can delete the pod that sends the request.
    • You must set the amount of extended resources to an integer in Kubernetes clusters. The unit of reclaimed-cpu parameter is millicores.
  3. Manage resources that are dynamically overcommitted.
    The amount of reclaimed resources on a node is calculated based on the actual resource utilization. You can use the following formula to calculate the amount of reclaimed resources:

    reclaimed = nodeAllocatable * thresholdPercent - podUsage(non-BE) - systemUsage

    The following section describes the factors in the formula:
    • nodeAllocatable: the amount of allocatable resources on the node.
    • thresholdPercent: the threshold of resources in percentile.
    • podUsage(non-BE): the resource usage of pods whose QoS classes are Burstable or Guaranteed.
    • systemUsage: the usage of system resources on the node.
    The thresholdPercent factor is configurable. The following code block shows how to manage resources by modifying a ConfigMap:
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ack-slo-manager-config
      namespace: kube-system
    data:
      colocation-config: |
        {
          "enable": true,
          "metricAggregateDurationSeconds": 60,
          "cpuReclaimThresholdPercent": 60,
          "memoryReclaimThresholdPercent": 70
        }
    Parameter Data type Description
    enable Boolean Specifies whether to dynamically update the statistics about reclaimed resources. If you disable this feature, the amount of reclaimed resources is reset to 0. Specifies whether the nfs volume is set to the read-only mode. Default value: false.
    metricAggregateDurationSeconds Int The minimum frequency at which the statistics about reclaimed resources are updated. Unit: seconds. Default value: 60. We recommend that you use the default setting.
    cpuReclaimThresholdPercent Int The threshold of reclaimed-cpu resources in percentile. Default value: 65.
    memoryReclaimThresholdPercent Int The threshold of reclaimed-memory resources in percentile. Default value: 65.
    Note ack-slo-manager provides features that are used to limit the resource usage of BestEffort pods and evict BestEffort pods. You can use these features to prevent the negative impact of BestEffort pods on your business. For more information, see Elastic resource limit, Memory QoS, and Resource isolation based on the L3 cache and MBA. If you have questions or further requirements, Submit a ticket.
  4. Run the following command to update the ack-slo-manager-config ConfigMap.
    To avoid changing other settings in the ack-slo-manager-config ConfigMap, we recommend that you run the kubectl patch command to update the ConfigMap.
    kubectl patch cm -n kube-system ack-slo-manager-config --patch "$(cat configmap.yaml)"
  5. Optional. View the usage of reclaimed resources in Prometheus.
    • If this is the first time you use Prometheus dashboards, reset the dashboards and install the Dynamic Resource Overcommitment dashboard. For more information about how to reset Prometheus dashboards, see Reset dashboards.

      To view the reclaimed resources on the Prometheus Monitoring page of the ACK console, perform the following steps:

      1. Log on to the ACK console.
      2. In the left-side navigation pane of the ACK console, click Clusters.
      3. On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
      4. In the left-side navigation pane of the cluster details page, choose Operations > Prometheus Monitoring.
      5. On the Prometheus Monitoring page, click the Dynamic Resource Overcommitment tab.

        On the Dynamic Resource Overcommitment tab, you can view details about the reclaimed resources. The details include the amount of reclaimed resources on each node and the amount of resources requested by the containers on each node, and the reclaimed resources in the cluster and the amount of resources requested by the containers in the cluster. For more information, see Enable ARMS Prometheus.

    • If you use a self-managed Prometheus monitoring system, you can use the metrics that are provided by kube-state-metric to monitor extended resources:
      #The amount of allocatable reclaimed-cpu resources on the node. 
      kube_node_status_allocatable{resource="alibabacloud_com_reclaimed_cpu",node="$node"}
      #The amount of reclaimed-cpu resources that are allocated on the node. 
      kube_pod_container_resource_requests{resource="alibabacloud_com_reclaimed_cpu",node="$node"}
      #The amount of allocatable reclaimed-memory resources on the node. 
      kube_node_status_allocatable{resource="alibabacloud_com_reclaimed_memory",node="$node"}
      #The amount of reclaimed-memory resources that are allocated on the node. 
      kube_pod_container_resource_requests{resource="alibabacloud_com_reclaimed_memory",node="$node"}

Examples

  1. Run the following command to query the total amount of reclaimed resources on the node:
    Make sure that the relevant parameters are configured before your query the total amount of reclaimed resources. For more information, see the description in Step 3.
    kubectl get node $nodeName -o yaml

    Expected output:

    #The node metadata.
    status:
      allocatable:
        #Unit: millicores. In the following example, 50 cores can be allocated. 
        alibabacloud.com/reclaimed-cpu: 50000
        #Unit: bytes. In the following example, 50 GB of memory can be allocated. 
        alibabacloud.com/reclaimed-memory: 53687091200
  2. Create a YAML file named be-pod-demo.yaml based on the following content:
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        alibabacloud.com/qosClass: BE
      name: be-demo
    spec:
      containers:
      - command:
        - "sleep"
        - "100h"
        image: polinux/stress
        imagePullPolicy: Always
        name: be-demo
        resources:
          limits:
            alibabacloud.com/reclaimed-cpu: "50k"
            alibabacloud.com/reclaimed-memory: "10Gi"
          requests:
            alibabacloud.com/reclaimed-cpu: "50k"
            alibabacloud.com/reclaimed-memory: "10Gi"
      schedulerName: default-scheduler
  3. Run the following command to deploy be-pod-demo:
    kubectl apply -f be-pod-demo.yaml
  4. Check whether the resource limits of the BestEffort pod take effect in the cgroup of the node.
    1. Run the following command to query the CPU limit:
      cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/cpu.cfs_quota_us

      Expected output:

      #The CPU limit in the cgroup is set to 50 cores. 
      5000000
    2. Run the following command to query the memory limit:
      cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/memory.limit_in_bytes

      Expected output:

      #The memory limit in the cgroup is set to 10 GB. 
      10737418240