Memory QoS for containers - Container Service for Kubernetes

The ack-koordinator component provides the memory quality of service (QoS) feature for containers. You can use this feature to optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This topic describes how to enable the memory QoS feature for containers.

Background information

The following memory limits apply to containers:

The memory limit of the container. If the amount of memory that a container uses, including the page cache, is about to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, the application in the container may not be able to request or release memory resources as normal.
The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is downgraded. In extreme cases, the node cannot run as normal.

To improve the performance of applications and the stability of nodes, ACK provides the memory QoS feature for containers. To use this feature, you must use Alibaba Cloud Linux 2 as the node OS and install ack-koordinator. After you enable the memory QoS feature for a container, ack-koordinator automatically configures the memory control group (memcg) based on the configuration of the container. This helps you optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node.

Prerequisites

A Container Service for Kubernetes (ACK) Pro cluster is created. Only ACK Pro clusters support the memory QoS feature. For more information, see Create an ACK Pro cluster.
ack-koordinator is installed. For more information, see ack-koordinator(ack-slo-manager).

Limits

The following table describes the versions of the system components that are required to enable the memory QoS feature for containers.

Component	Required version
Kubernetes	1.18 and later
ack-koordinator (ack-slo-manager)	0.8.0 and later
Helm	3.0 and later
Operating system	Alibaba Cloud Linux 2. For more information about the versions, see the following topics about kernel interfaces: Memcg backend asynchronous reclaim, Memcg QoS feature of the cgroup v1 interface, and Memcg global minimum watermark rating.

Billing

No fee is charged when you install and use the ack-koordinator component. However, fees may be charged in the following scenarios:

ack-koordinator is an non-managed component that occupies worker node resources after it is installed. You can specify the amount of resources requested by each module when you install the component.
By default, ack-koordinator exposes the monitoring metrics of features such as resource profiling and fine-grained scheduling as Prometheus metrics. If you enable Prometheus metrics for ack-koordinator and use Managed Service for Prometheus, these metrics are considered as custom metrics and fees are charged for these metrics. The fee depends on factors such as the size of your cluster and the number of applications. Before you enable Prometheus metrics, we recommend that you read the Billing topic of Managed Service for Prometheus to learn the free quota and billing rules of custom metrics. For more information about how to monitor and manage resource usage see Query the amount of observable data and bills.

Introduction

The amount of computing resources that can be used by an application in a Kubernetes cluster is limited by the resource requests and resource limits of the containers for the application. The memory request of a pod in the following figure is only used to schedule the pod to a node with sufficient memory resources. The memory limit of the pod is used to limit the amount of memory that the pod can use. memory.limit_in_bytes in the following figure indicates the upper limit of memory that can be used by a pod.

Request-Limit模型

If the amount of memory that is used by a pod, including the memory used by the page cache, is about to reach the memory limit of the pod, the memcg-level direct memory reclaim is triggered for the pod. As a result, the processes in the pod are blocked. In this case, if the pod applies for memory at a faster rate than the memory is reclaimed, the OOMKilled error occurs and the memory that is used by the pod is released. To reduce the risk of triggering the OOMKilled error, you can increase the memory limit for the pod. However, if the sum of the memory limits of all pods on the node exceeds the physical memory limit of the node, the node is overcommitted. If a pod on an overcommitted node applies for a large amount of memory, the available memory on the node may become insufficient. In this case, the OOMKilled error may occur in other pods if the pods apply for memory and the memory that is used by these pods may be reclaimed. By default, swap is disabled in Kubernetes. This downgrades the performance of the applications that run in these pods. In the preceding scenarios, the behavior of a pod may adversely affect the memory used by other pods, regardless of whether these pods run with memory less than the requested amount. This imposes a risk of downgrading the performance of applications.

ack-koordinator works together with Alibaba Cloud Linux 2 to enable the memory QoS feature for pods. ack-koordinator automatically configures the memcg based on the container configuration, and allows you to enable the memcg QoS feature, the memcg backend asynchronous reclaim feature, and the global minimum watermark rating feature for containers. This optimizes the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. For more information, see Memcg QoS feature of the cgroup v1 interface, Memcg backend asynchronous reclaim, and Memcg global minimum watermark rating.

Memory QoS provides the following optimizations to improve the memory utilization of pods:

When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclaim for a specific amount of memory. This prevents the reclaim of all the memory that the pod uses and therefore minimizes the adverse impact on the application performance caused by direct memory reclaim.
Memory reclaim is performed in a fairer manner among pods. When the available memory on a node becomes insufficient, memory reclaim is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory.
If the BestEffort pods on a node use more memory than their memory requests, the system prioritizes the memory requirements of Guaranteed pods and Burstable pods over the memory requirements of BestEffort pods.

If you enable service-level objective (SLO)-aware workload scheduling for a cluster, the system prioritizes the memory requirements of Latency-Sensitive (LS) pods over the memory requirements of other types of pods. This delays the reclaim of all the memory used by LS pods. In the following figure, memory.limit_in_bytes indicates the upper limit of memory that can be used by a pod, memory.high indicates the memory throttling threshold, memory.wmark_high indicates the memory reclaim threshold, and memory.min indicates the minimum amount of memory that must be allocated to a pod. 启用内存服务质量

For more information about how to enable the memory QoS kernel feature of Alibaba Cloud Linux 2, see Overview of kernel features and interfaces.

Note

The memory QoS feature is supported in Kubernetes 1.22. You can enable memory QoS by configuring kubelet. This allows you to specify the minimum amount of memory that must be allocated to a pod and enable proactive memory throttling for a pod. This way, memory scheduling among pods is implemented in a fairer manner. The memory QoS feature provided by Kubernetes is in private preview, and supports only cgroups v2 and Linux kernel 4.15 and later versions. This feature is incompatible with cgroups v1. If you enable this feature on a node that uses cgroups v1, all containers on the node are adversely affected. ACK enables the compatibility between the memory QoS feature and cgroups v1. In addition, ACK provides the memcg backend asynchronous reclaim feature, the memcg global minimum watermark rating feature, and SLO-awareness features.

Procedure

When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps:

Add the following annotations to enable memory QoS for the containers in a pod:

annotations:
  # To enable memory QoS for the containers in a pod, set the value to auto. 
  koordinator.sh/memoryQOS: '{"policy": "auto"}'
  # To disable memory QoS for the containers in a pod, set the value to none. 
  #koordinator.sh/memoryQOS: '{"policy": "none"}'

Use a ConfigMap to enable memory QoS for all the containers in a cluster.
1. To enable memory QoS for all the containers in a cluster, create a ConfigMap with the following code block:
```
apiVersion: v1
data:
  resource-qos-config: |-
    {
      "clusterStrategy": {
        "lsClass": {
           "memoryQOS": {
             "enable": true
           }
         },
        "beClass": {
           "memoryQOS": {
             "enable": true
           }
         }
      }
    }
kind: ConfigMap
metadata:
  name: ack-slo-config
  namespace: kube-system
```
2. When you create a pod in the cluster, specify the QoS class of the pod. The pod uses the cluster-wide memory QoS settings.
```
apiVersion: v1
kind: Pod
metadata:
  name: pod-demo
  labels:
    koordinator.sh/qosClass: 'LS' # Set the QoS class of the pod to LS.
```
  After you create the ConfigMap in the cluster, you can set the QoS class of a pod to LS or BE. To set the QoS class of a pod, you only need to add the koordinator.sh/qosClass annotation. If the configuration of a pod does not contain the koordinator.sh/qosClass annotation, ack-koordinator configures the parameters in the ConfigMap based on the original QoS class of the pod. A Guaranteed pod is assigned the default memory QoS settings. A Burstable pod is assigned the default memory QoS settings for the LS QoS class. A BestEffort pod is assigned the default memory QoS settings for the BE QoS class. For more information about the default memory QoS settings, see Parameters.
  We recommend that you use the koordinator.sh/qosClass to centrally manage memory QoS parameters.
3. Check whether the ack-slo-config ConfigMap exists in the kube-system namespace.
  - If the ack-slo-config ConfigMap exists, we recommend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap.
```
kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)"
```
  - If the ack-slo-config ConfigMap does not exist, run the following command to create a ConfigMap:
```
kubectl apply -f configmap.yaml
```

Use a ConfigMap to enable memory QoS for pods in specified namespaces.

If you want to enable or disable memory QoS for pods of the LS and BE QoS classes in specific namespaces, specify the namespaces in the ConfigMap. The following ConfigMap is provided as an example.

Create a filed named ack-slo-pod-config.yaml with the following code block.

Enable or disable memory QoS for pods in the kube-system namespace.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ack-slo-pod-config
  namespace: koordinator-system # You must manually create the namespace when you use this ConfigMap for the first time.
data:
  # Enable or disable memory QoS for pods in the specified namespaces. 
  memory-qos: |
    {
      "enabledNamespaces": ["allow-ns"],
      "disabledNamespaces": ["block-ns"]
    }

Run the following command to update the ConfigMap:

kubectl patch cm -n kube-system ack-slo-pod-config --patch "$(cat ack-slo-pod-config.yaml)"

Optional. Configure the advanced parameters.

The following table describes the advanced parameters that you can use to configure fine-grained memory QoS configurations at the pod level and cluster level. If you have other requirements, submit a ticket.

Parameter	Type	Valid value	Description
`enable`	Boolean	`true` `false`	`true`: enables memory QoS for all the containers in a cluster. The recommended memcg settings for the QoS class of the containers are used. `false`: disables memory QoS for all the containers in a cluster. The memcg settings are restored to the original settings for the QoS class of the containers.
`policy`	String	`auto` `default` `none`	`auto`: enables memory QoS for the containers in the pod and uses the recommended settings. The recommended settings are prioritized over the settings that are specified in the ack-slo-pod-config ConfigMap. `default`: specifies that the pod inherits the settings that are specified in the ack-slo-pod-config ConfigMap. `none`: disables memory QoS for the pod. The relevant memcg settings are restored to the original settings. The original settings are prioritized over the settings that are specified in the ack-slo-pod-config ConfigMap.
`minLimitPercent`	Int	0~100	Unit: %. Default value: `0`. The default value indicates that this parameter is disabled. This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: `Value of memory.min = Memory request × Value of minLimitPercent/100`. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify `Memory Request=100MiB` and `minLimitPercent=100` for a container, `the value of memory.min is 104857600`. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.
`lowLimitPercent`	Int	0~100	Unit: %. Default value: `0`. The default value indicates that this parameter is disabled. This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: `Value of memory.low = Memory request × Value of lowLimitPercent/100`. For example, if you specify `Memory Request=100MiB` and `lowLimitPercent=100` for a container, `the value of memory.low is 104857600`. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.
`throttlingPercent`	Int	0~100	Unit: %. Default value: `0`. The default value indicates that this parameter is disabled. This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: `Value of memory.high = Memory limit × Value of throttlingPercent/100`. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify `Memory Limit=100MiB` and `throttlingPercent=80` for a container, `the value of memory.high is 83886080, which is equal to 80 MiB`. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.
`wmarkRatio`	Int	0~100	Unit: %. Default value: `95`. A value of `0` indicates that this parameter is disabled. This parameter specifies the asynchronous memory reclaim threshold of memory usage to memory limit or memory usage to the value of `memory.high`. If throttlingPercent is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Memory limit × wmarkRatio/100. If throttlingPercent is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Value of memory.high × wmarkRatio/100. If the memory usage exceeds the reclaim threshold, asynchronous memory reclaim is triggered in the background. For example, if you specify `Memory Limit=100MiB` and `wmarkRatio=95,throttlingPercent=80` for a container, the memory throttling threshold `memory.high is 83886080 (80 MiB)`, the memory reclaim ratio `memory.wmark_ratio is 95`, and the memory reclaim threshold `memory.wmark_high is 79691776 (76 MiB)`. For more information, see the Alibaba Cloud Linux 2 topic Memcg backend asynchronous reclaim.
`wmarkMinAdj`	Int	-25~50	Unit: %. The default value is `-25` for the `LS` QoS class and `50` for the `BE` QoS class. A value of `0` indicates that this parameter is disabled. This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is `memory.wmark_min_adj=-25`, which indicates that the minimum watermark is decreased by 25% for the containers in the pod. For more information, see the Alibaba Cloud Linux 2 topic Memcg global minimum watermark rating.

Example

In this example, the following conditions are met:

An ACK Pro cluster of Kubernetes 1.20 is created.
The cluster contains 2 nodes, each of which has 8 vCPUs and 32 GB of memory. One node is used to perform stress tests. The other node runs the workload and serves as the tested machine.

Create a file named redis-demo.yaml and add the following YAML content to the file:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-demo-config
data:
  redis-config: |
    appendonly yes
    appendfsync no
---
apiVersion: v1
kind: Pod
metadata:
  name: redis-demo
  labels:
    koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS. 
  annotations:
    koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. 
spec:
  containers:
  - name: redis
    image: redis:5.0.4
    command:
      - redis-server
      - "/redis-master/redis.conf"
    env:
    - name: MASTER
      value: "true"
    ports:
    - containerPort: 6379
    resources:
      limits:
        cpu: "2"
        memory: "6Gi"
      requests:
        cpu: "2"
        memory: "2Gi"
    volumeMounts:
    - mountPath: /redis-master-data
      name: data
    - mountPath: /redis-master
      name: config
  volumes:
    - name: data
      emptyDir: {}
    - name: config
      configMap:
        name: redis-demo-config
        items:
        - key: redis-config
          path: redis.conf
  nodeName: # Set nodeName to the name of the tested node. 
---
apiVersion: v1
kind: Service
metadata:
  name: redis-demo
spec:
  ports:
  - name: redis-port
    port: 6379
    protocol: TCP
    targetPort: 6379
  selector:
    name: redis-demo
  type: ClusterIP

Run the following command to deploy Redis Server as the test application.
You can access the redis-demo Service from within the cluster.
```
kubectl apply -f redis-demo.yaml
```

Simulate memory overcommitment.

Use the Stress tool to increase the load on memory and trigger memory reclaim. The sum of the memory limits of all pods on the node exceeds the physical memory of the node.

Create a file named stress-demo.yaml and add the following YAML content to the file:

apiVersion: v1
kind: Pod
metadata:
  name: stress-demo
  labels:
    koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE. 
  annotations:
    koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. 
spec:
  containers:
    - args:
        - '--vm'
        - '2'
        - '--vm-bytes'
        - 11G
        - '-c'
        - '2'
        - '--vm-hang'
        - '2'
      command:
        - stress
      image: polinux/stress
      imagePullPolicy: Always
      name: stress
  restartPolicy: Always
  nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed.

Run the following command to deploy stress-demo:
```
kubectl apply -f stress-demo.yaml
```

Run the following command to query the global minimum watermark of the node:
Note
In memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM killers may be triggered for all pods on the node even before memory reclaim is performed. Therefore, we recommend that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set to 4,000,000 KB for the tested node that has 32 GiB of memory.
```
cat /proc/sys/vm/min_free_kbytes
```
Expected output:
```
4000000
```

Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node:

apiVersion: v1
kind: Pod
metadata:
  labels:
    name: memtier-demo
  name: memtier-demo
spec:
  containers:
    - command:
        - memtier_benchmark
        - '-s'
        - 'redis-demo'
        - '--data-size'
        - '200000'
        - "--ratio"
        - "1:4"
      image: 'redislabs/memtier_benchmark:1.3.0'
      name: memtier
  restartPolicy: Never
  nodeName: # Set nodeName to the name of the node that is used to send requests.

Run the following command to query the test results from memtier-benchmark:
```
kubectl logs -f memtier-demo
```

Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests again and compare the results.

apiVersion: v1
kind: Pod
metadata:
  name: redis-demo
  labels:
    koordinator.sh/qosClass: 'LS'
  annotations:
    koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. 
spec:
  ...

---
apiVersion: v1
kind: Pod
metadata:
  name: stress-demo
  labels:
    koordinator.sh/qosClass: 'BE'
  annotations:
    koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS.

Analyze the results

The following table describes the stress test results when memory QoS is enabled and disabled.

Disabled: The memory QoS policy of the pod is set to none.
Enabled: The memory QoS policy of the pod is set to auto and the recommended memory QoS settings are used.

Metric	Disabled	Enabled
`Latency-avg`	51.32 ms	47.25 ms
`Throughput-avg`	149.0 MB/s	161.9 MB/s

The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of applications in memory overcommitment scenarios.

FAQ

Is the memory QoS feature that is enabled based on the earlier version of the ack-slo-manager protocol supported after I upgrade from ack-slo-manager to ack-koordinator?

In an earlier version (≤ 0.8.0) of the ack-slo-manager protocol, the following pod annotations are used:

alibabacloud.com/qosClass
alibabacloud.com/memoryQOS

ack-koordinator is compatible with the earlier versions of the ack-slo-manager protocol. You can seamlessly upgrade from ack-slo-manager to ack-koordinator. ack-koordinator is compatible with the earlier protocol versions until July 30, 2023. We recommend that you upgrade the resource parameters in an earlier protocol version to the latest version.

The following table describes the compatibility between different versions of ack-koordinator and the memory QoS feature.

ack-koordinator version	alibabacloud.com protocol	koordinator.sh protocol
≥ 0.3.0 and < 0.8.0	Supported	Not supported
≥ 0.8.0	Supported	Supported