Dynamic resource overcommitment - Container Service for Kubernetes

ack-koordinator can dynamically overcommit resources. ack-koordinator monitors the loads of a node in real time and then schedules resources that are allocated to pods but are not in use. This topic describes how to use the dynamic resource overcommitment feature. This topic describes how to use the dynamic resource overcommitment feature.

Prerequisites

Only Container Service for Kubernetes (ACK) Pro clusters support the dynamic resource overcommitment feature. For more information, see Create an ACK Pro cluster.
ack-koordinator (FKA ack-slo-manager) is installed. For more information, see ack-koordinator.

Background information

In Kubernetes, the kubelet manages the resources that are used by the pods on a node based on the quality of service (QoS) classes of the pods. For example, the kubelet controls the out of memory (OOM) priorities. The QoS class of a pod can be Guaranteed, Burstable, or BestEffort. The QoS classes of pods depend on the requests and limits of CPU and memory resources that are configured for the pods.

To improve the stability of applications, application administrators reserve resources for Guaranteed or Burstable pods. The reserved resources are used to handle fluctuating workloads. In most cases, the resource request of a pod is much higher than the actual resource utilization. To improve the resource utilization in a cluster, application administrators may provision BestEffort pods. These pods can share the resources that are allocated to other pods but are not in use. This mechanism is known as resource overcommitment. Resource overcommitment has the following disadvantages:

BestEffort pods do not have resource requests or limits. As a result, even if a node is overloaded, the system can still schedule BestEffort pods to the node.
You cannot guarantee that resources are fairly scheduled among BestEffort pods due to the lack of requests and limits that specify the amount of resources used by a pod.

You can use the Service Level Objective (SLO) capability of ACK to control the resources that are used by BestEffort pods. In the preceding line graph, the SLO of ACK classifies the resources into three categories: Usage, Buffered, and Reclaimed. Usage refers to the actual resource usage and is represented by the red line. Buffered refers to reserved resources and is represented by the area between the blue line and red line. Reclaimed refers to reclaimed resources and is represented by the area in green.

Reclaimed resources are resources that can be dynamically overcommitted, as shown in the following figure. ack-koordinator monitors the loads of a node and synchronizes resource statistics to the node metadata as extended resources in real time. To allow BestEffort pods to use reclaimed resources, you can configure requests and limits of reclaimed resources for the BestEffort pods. In addition, you can configure settings that are related to reclaimed resources in the node configuration. This ensures that resources are fairly scheduled among BestEffort pods.

To differentiate reclaimed resources from regular resources, ack-koordinator assigns the Batch priority to reclaimed resources, including batch-cpu and batch-memory.

Limits


Component	Required version
Kubernetes	≥ 1.18
ack-koordinator	≥ 0.8.0
Helm	≥ 3.0

Procedure

Run the following command to query the total amount of Batch resources.

Make sure that the relevant parameters are configured before your query the total amount of reclaimed resources. For more information, see the description in Step 3.

# Replace $nodeName with the name of the node that you want to query. 
kubectl get node $nodeName -o yaml

Expected output:

#Node
status:
  allocatable:
    # Unit: millicores. In the following example, 50 cores can be allocated. 
    kubernetes.io/batch-cpu: 50000
    # Unit: bytes. In the following example, 50 GB of memory can be allocated. 
    kubernetes.io/batch-memory: 53687091200

Create a pod and apply for reclaimed resources. Add a label to the pod to specify the QoS class of the pod and specify the Batch resource request and Batch resource limit. This way, the pod can use reclaimed resources.
```
#Pod
metadata:
  labels:
    # Required. Set the QoS class of the pod to BestEffort. 
    koordinator.sh/qosClass: "BE"
spec:
  containers:
  - resources:
      requests:
        # Unit: millicores. In the following example, the CPU request is set to one core. 
        kubernetes.io/batch-cpu: "1k"
        # Unit: bytes. In the following example, the memory request is set to 1 GB. 
        kubernetes.io/batch-memory: "1Gi"
      limits:
        kubernetes.io/batch-cpu: "1k"
        kubernetes.io/batch-memory: "1Gi"
```
When you apply for Batch resources, take note of the following items:
- If you provision a pod by using a Deployment or other types of workloads, you need only to modify the YAML template based on the format in the preceding code block. A pod cannot apply for reclaimed resources and regular resources at the same time.
- The amount of reclaimed resources on a node is calculated based on the loads of the node in real time. If the kubelet fails to synchronize the most recent statistics about reclaimed resources to the node metadata, the kubelet may reject the request for reclaimed resources. If the request is rejected, you can delete the pod that sends the request.
- You must set the amount of extended resources to an integer in Kubernetes clusters. The unit of batch-cpu resources is millicores.

Manage resources that are dynamically overcommitted.

The amount of Batch resources on a node is calculated based on the actual resource utilization. You can use the following formula to calculate the amount of Batch CPU resources and the amount of Batch memory resources:

nodeBatchAllocatable = nodeAllocatable * thresholdPercent - podUsage(non-BE) - systemUsage

The following section describes the factors in the formula:

nodeAllocatable: the amount of allocatable resources on the node.
thresholdPercent: the threshold of resources in percentile.
podUsage(non-BE): the resource usage of pods whose QoS classes are Burstable or Guaranteed.
systemUsage: the usage of system resources on the node.

ack-koordinator can calculate reclaimed memory resources based on the following formula and the resource requests of pods. For more information, see the memoryCalculatePolicy parameter in the following section. In the following formula, podRequest(non-BE) refers to the resource requests of pods whose QoS classes are Burstable or Guaranteed.

nodeBatchAllocatable = nodeAllocatable * thresholdPercent - podRequest(non-BE) - systemUsage

The thresholdPercent factor is configurable. The following code block shows how to manage resources by modifying a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ack-slo-config
  namespace: kube-system
data:
  colocation-config: |
    {
      "enable": true,
      "metricAggregateDurationSeconds": 60,
      "cpuReclaimThresholdPercent": 60,
      "memoryReclaimThresholdPercent": 70,
      "memoryCalculatePolicy": "usage"
    }


Parameter	Data type	Description
`enable`	Boolean	Specifies whether to dynamically update the statistics about Batch resources. If you disable this feature, the amount of reclaimed resources is reset to `0`. Default value: `false`.
`metricAggregateDurationSeconds`	Int	The minimum frequency at which the statistics about Batch resources are updated. Unit: seconds. Default value: 60. We recommend that you use the default setting.
`cpuReclaimThresholdPercent`	Int	The reclaim threshold of `batch-cpu` resources. Default value: `65`. Unit: %.
`memoryReclaimThresholdPercent`	Int	The reclaim threshold of `batch-memory` resources in percentile. Default value: `65`. Unit: %.
`memoryCalculatePolicy`	String	The policy for calculating the amount of batch-memory resources. Valid values: `"usage"`: The amount of batch-memory resources is calculated based on the actual memory usage of pods whose QoS classes are Burstable or Guaranteed. If this policy is used, the batch-memory resources include resources that are not allocated and resources that are allocated but are not in use. This is the default value. `"request"`: The amount of batch-memory resources is calculated based on the memory requests of pods whose QoS classes are Burstable or Guaranteed. If this policy is used, the batch-memory resources include only resources that are not allocated.

Note ack-koordinator provides features that are used to limit the resource usage of BestEffort pods and evict BestEffort pods. You can use these features to eliminate the negative impact of BestEffort pods on your business. For more information, see Elastic resource limit, Memory QoS for containers, and Resource isolation based on the L3 cache and MBA.

Check whether the ack-slo-config ConfigMap exists in the kube-system namespace.
- If the ack-slo-config ConfigMap exists, we recommend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap.
```
kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)"
```
- If the ack-slo-config ConfigMap does not exist, run the kubectl patch command to create a ConfigMap named ack-slo-config:
```
kubectl apply -f configmap.yaml
```
Optional. View the usage of Batch resources in Prometheus.
If this is the first time you use Prometheus dashboards, reset the dashboards and install the Dynamic Resource Overcommitment dashboard. For more information about how to reset Prometheus dashboards, see Reset dashboards.
To view details about the usage of Batch resources on the Prometheus Monitoring page of the ACK console, perform the following steps:
1. Log on to the ACK console.
2. In the left-side navigation pane of the ACK console, click Clusters.
3. On the Clusters page, find the cluster that you want to manage and click its name or click Details in the Actions column.
4. In the left-side navigation pane of the cluster details page, choose Operations > Prometheus Monitoring.
5. On the Prometheus Monitoring page, click the Dynamic Resource Overcommitment tab.
  On the Dynamic Resource Overcommitment tab, you can view details about the Batch resources. The details include the total amount of Batch resources provided by each node, the total amount of Batch resources provided by the cluster, the amount of Batch resources requested by the containers on each node, and the total amount of Batch resources requested by the containers in the cluster. For more information, see Enable ARMS Prometheus.
```
# The amount of allocatable batch-cpu resources on the node. 
koordlet_node_resource_allocatable{resource="kubernetes.io/batch-cpu",node="$node"}
# The amount of batch-cpu resources that are allocated on the node. 
koordlet_container_resource_requests{resource="kubernetes.io/batch-cpu",node="$node"}
# The amount of allocatable batch-memory resources on the node. 
kube_node_status_allocatable{resource="kubernetes.io/batch-memory",node="$node"}
# The amount of batch-memory resources that are allocated on the node. 
koordlet_container_resource_requests{resource="kubernetes.io/batch-memory",node="$node"}
```

Examples

Run the following command to query the total amount of reclaimed resources on the node.

Make sure that the relevant parameters are configured before your query the total amount of reclaimed resources. For more information, see the description in Step 3.

kubectl get node $nodeName -o yaml

Expected output:

# The node metadata.
status:
  allocatable:
    # Unit: millicores. In the following example, 50 cores can be allocated. 
    kubernetes.io/batch-cpu: 50000
    # Unit: bytes. In the following example, 50 GB of memory can be allocated. 
    kubernetes.io/batch-memory: 53687091200

Create a YAML file named be-pod-demo.yaml based on the following content:

apiVersion: v1
kind: Pod
metadata:
  lables:
    koordinator.sh/qosClass: "BE"
  name: be-demo
spec:
  containers:
  - command:
    - "sleep"
    - "100h"
    image: polinux/stress
    imagePullPolicy: Always
    name: be-demo
    resources:
      limits:
        kubernetes.io/batch-cpu: "50k"
        kubernetes.io/batch-memory: "10Gi"
      requests:
        kubernetes.io/batch-cpu: "50k"
        kubernetes.io/batch-memory: "10Gi"
  schedulerName: default-scheduler

Run the following command to deploy be-pod-demo:
```
kubectl apply -f be-pod-demo.yaml
```

Check whether the resource limits of the BestEffort pod take effect in the cgroup of the node.

Run the following command to query the CPU limit:

cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/cpu.cfs_quota_us

Expected output:

#The CPU limit in the cgroup is set to 50 cores. 
5000000

Run the following command to query the memory limit:

cat /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod4b6e96c8_042d_471c_b6ef_b7e0686a****.slice/cri-containerd-11111c202adfefdd63d7d002ccde8907d08291e706671438c4ccedfecba5****.scope/memory.limit_in_bytes

Expected output:

#The memory limit in the cgroup is set to 10 GB. 
10737418240

FAQ

Is the resource overcommitment feature that is enabled based on the earlier version of the ack-slo-manager protocol supported after I upgrade from ack-slo-manager to ack-koordinator?

The earlier version of the ack-slo-manager protocol includes the following components:

The alibabacloud.com/qosClass pod annotation.
The alibabacloud.com/reclaimed field that is used to specify the resource requests and limits of pods.

ack-koordinator is compatible with the earlier version of the ack-slo-manager protocol. The ACK Pro scheduler can calculate the amount of requested resources and the amount of available resources for both the earlier protocol version and the new protocol version. You can seamlessly upgrade from ack-slo-manager to ack-koordinator.

Note ack-koordinator is compatible with the earlier protocol version until July 30, 2023. We recommend that you upgrade the resource parameters of the earlier protocol version to the latest version.

The following table describes the compatibility between the ACK Pro scheduler, ack-koordinator, and different types of protocols.


ACK scheduler version	ack-koordinator (ack-slo-manager)	alibabacloud.com protocol	koordinator.sh protocol
≥1.18 and < 1.22.15-ack-2.0	≥ 0.3.0	Supported	Not supported
≥ 1.22.15-ack-2.0	≥ 0.8.0	Supported	Supported