The ack-koordinator component provides the memory quality of service (QoS) feature for containers. You can use this feature to optimize the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. This topic describes how to enable the memory QoS feature for containers.

Prerequisites

  • A Container Service for Kubernetes (ACK) Pro cluster is created. Only ACK Pro clusters support the memory QoS feature. For more information about how to attach the policy to the RAM role, see Create an ACK Pro cluster.
  • ack-koordinator installed. For more information, see ack-koordinator (FKA ack-slo-manager).

Background information

The following memory limits apply to containers:
  • The memory limit of the container. If the amount of memory that a container uses, including the memory used by the page cache, is about to reach the memory limit of the container, the memory reclaim mechanism of the OS kernel is triggered. As a result, the application in the container may not be able to request or release memory resources as normal.
  • The memory limit of the node. If the memory limit of a container is greater than the memory request of the container, the container can overcommit memory resources. In this case, the available memory on the node may become insufficient. This causes the OS kernel to reclaim memory from containers. As a result, the performance of your application is downgraded. In extreme cases, the node cannot run as normal.

To improve the performance of applications and the stability of nodes, ACK provides the memory QoS feature for containers. To use this feature, you must use Alibaba Cloud Linux 2 as the node OS and install ack-koordinator. After you enable the memory QoS feature for a container, ack-koordinator automatically configures the memory control group (memcg) based on the configuration of the container. This helps you optimize the performance of memory-sensitive applications while ensuring fair memory scheduling on the node.

Limits

The following table describes the versions of the system components that are required to enable the memory QoS feature for containers.

ComponentRequired version
Kubernetes1.18 and later
ack-koordinator (FKA ack-slo-manager)≥ 0.8.0
Helm3.0 and later
Operating SystemAlibaba Cloud Linux 2 (For more information about the versions, see the following topics about kernel interfaces: Memcg backend asynchronous reclaim, Memcg QoS feature of the cgroup v1 interface, and Memcg global minimum watermark rating)

Introduction

The amount of computing resources that can be used by an application in a Kubernetes cluster is limited by the resource requests and resource limits of the containers for the application. The memory request of a pod in the following figure is only used to schedule the pod to a node with sufficient memory resources. The memory limit of the pod is used to limit the amount of memory that the pod can use. memory.limit_in_bytes in the following figure indicates the upper limit of memory that can be used by a pod.

Request-Limit model

If the amount of memory that is used by a pod, including the memory used by the page cache, is about to reach the memory limit of the pod, the memcg-level direct memory reclaim is triggered for the pod. As a result, the processes in the pod are blocked. In this case, if the pod applies for memory at a faster rate than the memory is reclaimed, the OOMKilled error occurs and the memory that is used by the pod is released. To reduce the risk of triggering the OOMKilled error, you can increase the memory limit for the pod. However, if the sum of the memory limits of all pods on the node exceeds the physical memory limit of the node, the node is overcommitted. If a pod on an overcommitted node applies for a large amount of memory, the available memory on the node may become insufficient. In this case, the OOMKilled error may occur in other pods if the pods apply for memory and the memory that is used by these pods may be reclaimed. By default, swap is disabled in Kubernetes. This downgrades the performance of the applications that run in these pods. In the preceding scenarios, the behavior of a pod may adversely affect the memory used by other pods, regardless of whether these pods run with memory less than the requested amount. This imposes a risk of downgrading the performance of applications.

ack-koordinator works together with Alibaba Cloud Linux 2 to enable the memory QoS feature for pods. ack-koordinator automatically configures the memcg based on the container configuration, and allows you to enable the memcg QoS feature, the memcg backend asynchronous reclaim feature, and the global minimum watermark rating feature for containers. This optimizes the performance of memory-sensitive applications while ensuring fair memory scheduling among containers. For more information, see Memcg QoS feature of the cgroup v1 interface, Memcg backend asynchronous reclaim, and Memcg global minimum watermark rating.

Memory QoS provides the following optimizations to improve the memory utilization of pods:

  • When the memory used by a pod is about to reach the memory limit of the pod, the memcg performs asynchronous reclaim for a specific amount of memory. This prevents the reclaim of all the memory that the pod uses and therefore minimizes the adverse impact on the application performance caused by direct memory reclaim.
  • Memory reclaim is performed in a fairer manner among pods. When the available memory on a node becomes insufficient, memory reclaim is first performed on pods that use more memory than their memory requests. This ensures sufficient memory on the node when a pod applies for a large amount of memory.
  • If the BestEffort pods on a node use more memory than their memory requests, the system prioritizes the memory requirements of Guaranteed pods and Burstable pods over the memory requirements of BestEffort pods.
If you enable service-level objective (SLO)-aware workload scheduling for a cluster, the system prioritizes the memory requirements of Latency-Sensitive (LS) pods over the memory requirements of other types of pods. This delays the reclaim of all the memory used by LS pods. In the following figure, memory.limit_in_bytes indicates the upper limit of memory that can be used by a pod, memory.high indicates the memory throttling threshold, memory.wmark_high indicates the memory reclaim threshold, and memory.min indicates the minimum amount of memory that must be allocated to a pod. Enable memory QoS

For more information about how to enable the memory QoS feature of the kernel of Alibaba Cloud Linux 2, see Overview.

Note The memory QoS feature is supported in Kubernetes 1.22. You can enable memory QoS by configuring kubelet. This allows you to specify the minimum amount of memory that must be allocated to a pod and enable proactive memory throttling for a pod. This way, memory scheduling among pods is implemented in a fairer manner. The memory QoS feature provided by Kubernetes is in private preview, and supports only cgroups v2 and Linux kernel 4.15 and later versions. This feature is incompatible with cgroups v1. If you enable this feature on a node that uses cgroups v1, all containers on the node are adversely affected. ACK enables the compatibility between the memory QoS feature and cgroups v1. In addition, ACK provides the memcg backend asynchronous reclaim feature, the memcg global minimum watermark rating feature, and SLO-awareness features.

Procedure

When you enable memory QoS for the containers in a pod, the memcg is automatically configured based on the specified ratios and pod parameters. To enable memory QoS for the containers in a pod, perform the following steps:

  1. Add the following annotations to enable memory QoS for the containers in a pod:
    annotations:
      # To enable memory QoS for the containers in a pod, set the value to auto. 
      koordinator.sh/memoryQOS: '{"policy": "auto"}'
      # To disable memory QoS for the containers in a pod, set the value to none. 
      #koordinator.sh/memoryQOS: '{"policy": "none"}'
  2. Use a ConfigMap to enable memory QoS for all the containers in a cluster.
    1. To enable memory QoS for all the containers in a cluster, create a ConfigMap with the following code block:
      apiVersion: v1
      data:
        resource-qos-config: |-
          {
            "clusterStrategy": {
              "lsClass": {
                 "memoryQOS": {
                   "enable": true
                 }
               },
              "beClass": {
                 "memoryQOS": {
                   "enable": true
                 }
               }
            }
          }
      kind: ConfigMap
      metadata:
        name: ack-slo-config
        namespace: kube-system
    2. When you create a pod in the cluster, specify the QoS class of the pod. The pod uses the cluster-wide memory QoS settings.
      apiVersion: v1
      kind: Pod
      metadata:
        name: pod-demo
        labels:
          koordinator.sh/qosClass: 'LS' # Set the QoS class of the pod to LS. 

      After you create the ConfigMap in the cluster, you can set the QoS class of a pod to LS or BE. To set the QoS class of a pod, you need to only add the alibabacloud.com/qosClass annotation. If the configuration of a pod does not contain the alibabacloud.com/qosClass annotation, ack-koordinator configures the parameters in the ConfigMap based on the original QoS class of the pod. A Guaranteed pod is assigned the default memory QoS settings. A Burstable pod is assigned the default memory QoS settings for the LS QoS class. A BestEffort pod is assigned the default memory QoS settings for the BE QoS class. For more information about the default memory QoS settings, see Parameters.

      To enable centralized management, we recommend that you use the koordinator.sh/qosClass to manage memory QoS parameters.

    3. Check whether the ack-slo-config ConfigMap exists in the kube-system namespace.
      • If the ack-slo-config ConfigMap exists, we recommend that you run the kubectl patch command to update the ConfigMap. This avoids changing other settings in the ConfigMap.
        kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)"
      • If the ack-slo-config ConfigMap does not exist, run the kubectl patch command to create a ConfigMap named ack-slo-config:
        kubectl apply -f configmap.yaml
  3. Use a ConfigMap to enable memory QoS for pods in specified namespaces.

    If you want to enable or disable memory QoS for pods of the LS and BE QoS classes in specific namespaces, specify the namespaces in the ConfigMap. The following ConfigMap is provided as an example.

    1. Create a filed named ack-slo-pod-config.yaml with the following code block.
      Enable or disable memory QoS for pods in the kube-system namespace.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: ack-slo-pod-config
        namespace: kube-system
      data:
        # Enable or disable memory QoS for pods in the specified namespaces. 
        memory-qos: |
          {
            "enabledNamespaces": ["allow-ns"],
            "disabledNamespaces": ["block-ns"]
          }
    2. Run the following command to update the ConfigMap:
      kubectl patch cm -n kube-system ack-slo-pod-config --patch "$(cat ack-slo-pod-config.yaml)"
  4. Optional. Configure advanced parameters.
    The following table describes the advanced parameters that you can use to configure fine-grained memory QoS configurations at the pod level and cluster level. If you have further requirements, Submit a ticket.
    ParameterTypeValid valueDescription
    enableBoolean
    • true
    • false
    • true: enables memory QoS for all the containers in a cluster. The recommended memcg settings for the QoS class of the containers are used.
    • false: disables memory QoS for all the containers in a cluster. The memcg settings are restored to the original settings for the QoS class of the containers.
    policyString
    • auto
    • default
    • none
    • auto: enables memory QoS for the containers in the pod and uses the recommended settings. The recommended settings are prioritized over the settings that are specified in the ack-slo-pod-config ConfigMap.
    • default: specifies that the pod inherits the settings that are specified in the ack-slo-pod-config ConfigMap.
    • none: disables memory QoS for the pod. The relevant memcg settings are restored to the original settings. The original settings are prioritized over the settings that are specified in the ack-slo-pod-config ConfigMap.
    minLimitPercentInt0~100Unit: %. Default value: 0. The default value indicates that this parameter is disabled.

    This parameter specifies the unreclaimable proportion of the memory request of a pod. The amount of unreclaimable memory is calculated based on the following formula: Value of memory.min = Memory request × Value of minLimitPercent/100. This parameter is suitable for scenarios where applications are sensitive to the page cache. You can use this parameter to cache files to optimize read and write performance. For example, if you specify Memory Request=100MiB and minLimitPercent=100 for a container, the value of memory.min is 104857600. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.

    lowLimitPercentInt0~100Unit: %. Default value: 0. The default value indicates that this parameter is disabled.

    This parameter specifies the relatively unreclaimable proportion of the memory request of a pod. The amount of relatively unreclaimable memory is calculated based on the following formula: Value of memory.low = Memory request × Value of lowLimitPercent/100. For example, if you specify Memory Request=100MiB and lowLimitPercent=100 for a container, the value of memory.low is 104857600. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.

    throttlingPercentInt0~100Unit: %. Default value: 0. The default value indicates that this parameter is disabled.

    This parameter specifies the memory throttling threshold for the ratio of the memory usage of a container to the memory limit of the container. The memory throttling threshold for memory usage is calculated based on the following formula: Value of memory.high = Memory limit × Value of throttlingPercent/100. If the memory usage of a container exceeds the memory throttling threshold, the memory used by the container will be reclaimed. This parameter is suitable for container memory overcommitment scenarios. You can use this parameter to cgroups from triggering OOM. For example, if you specify Memory Limit=100MiB and throttlingPercent=80 for a container, the value of memory.high is 83886080, which is equal to 80 MiB. For more information, see the Alibaba Cloud Linux 2 topic Memcg QoS feature of the cgroup v1 interface.

    wmarkRatioInt0~100Unit: %. Default value: 95. A value of 0 indicates that this parameter is disabled.

    This parameter specifies the asynchronous memory reclaim threshold of memory usage to memory limit or memory usage to the value of memory.high. If throttlingPercent is disabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Memory limit × wmarkRatio/100. If throttlingPercent is enabled, the asynchronous memory reclaim threshold for memory usage is calculated based on the following formula: Value of memory.wmark_high = Value of memory.high × wmarkRatio/100. If the memory usage exceeds the reclaim threshold, asynchronous memory reclaim is triggered in the background. For example, if you specify Memory Limit=100MiB and wmarkRatio=95,throttlingPercent=80 for a container, the memory throttling threshold memory.high is 83886080 (80 MiB), the memory reclaim ratio memory.wmark_ratio is 95, and the memory reclaim threshold memory.wmark_high is 79691776 (76 MiB). For more information, see the Alibaba Cloud Linux 2 topic Memcg backend asynchronous reclaim.

    wmarkMinAdjInt-25~50Unit: %. The default value is -25 for the LS QoS class and 50 for the BE QoS class. A value of 0 indicates that this parameter is disabled.

    This parameter specifies the adjustment to the global minimum watermark for a container. A negative value decreases the global minimum watermark and therefore postpones memory reclaim for the container. A positive value increases the global minimum watermark and therefore antedates memory reclaim for the container. For example, if you create a pod whose QoS class is LS, the default setting of this parameter is memory.wmark_min_adj=-25, which indicates that the minimum watermark is decreased by 25% for the containers in the pod. For more information, see the Alibaba Cloud Linux 2 topic Memcg global minimum watermark rating.

Example

In this example, the following conditions are met:

  • An ACK Pro cluster of Kubernetes 1.20 is created.
  • The cluster contains 2 nodes, each of which has 8 vCPUs and 32 GB of memory. One node is used to perform stress tests. The other node runs the workload and serves as the tested machine.
  1. Create a file named redis-demo.yaml with the following YAML template:
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: redis-demo-config
    data:
      redis-config: |
        appendonly yes
        appendfsync no
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: redis-demo
      labels:
        koordinator.sh/qosClass: 'LS' # Set the QoS class of the Redis pod to LS. 
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. 
    spec:
      containers:
      - name: redis
        image: redis:5.0.4
        command:
          - redis-server
          - "/redis-master/redis.conf"
        env:
        - name: MASTER
          value: "true"
        ports:
        - containerPort: 6379
        resources:
          limits:
            cpu: "2"
            memory: "6Gi"
          requests:
            cpu: "2"
            memory: "2Gi"
        volumeMounts:
        - mountPath: /redis-master-data
          name: data
        - mountPath: /redis-master
          name: config
      volumes:
        - name: data
          emptyDir: {}
        - name: config
          configMap:
            name: redis-demo-config
            items:
            - key: redis-config
              path: redis.conf
      nodeName: # Set nodeName to the name of the tested node. 
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: redis-demo
    spec:
      ports:
      - name: redis-port
        port: 6379
        protocol: TCP
        targetPort: 6379
      selector:
        name: redis-demo
      type: ClusterIP
  2. Run the following command to deploy Redis Server as the test application.
    You can access the redis-demo Service from within the cluster.
    kubectl apply -f redis-demo.yaml
  3. Simulate memory overcommitment.
    Use the Stress tool to increase the load on memory and trigger memory reclaim. The sum of the memory limits of all pods on the node exceeds the physical memory of the node.
    1. Create a file named stress-demo.yaml with the following YAML template:
      apiVersion: v1
      kind: Pod
      metadata:
        name: stress-demo
        labels:
          koordinator.sh/qosClass: 'BE' # Set the QoS class of the Stress pod to BE. 
        annotations:
          koordinator.sh/memoryQOS: '{"policy": "auto"}' # Add this annotation to enable memory QoS. 
      spec:
        containers:
          - args:
              - '--vm'
              - '2'
              - '--vm-bytes'
              - 11G
              - '-c'
              - '2'
              - '--vm-hang'
              - '2'
            command:
              - stress
            image: polinux/stress
            imagePullPolicy: Always
            name: stress
        restartPolicy: Always
        nodeName: # Set nodeName to the name of the tested node, which is the node on which the Redis pod is deployed. 
    2. Run the following command to deploy stress-demo:
      kubectl apply -f stress-demo.yaml
  4. Run the following command to query the global minimum watermark of the node:
    Note In memory overcommitment scenarios, if the global minimum watermark of the node is set to a low value, OOM killers may be triggered for all pods on the node even before memory reclaim is performed. Therefore, we recommend that you set the global minimum watermark to a high value. In this example, the global minimum watermark is set to 4,000,000 KB for the tested node that has 32 GiB of memory.
    cat /proc/sys/vm/min_free_kbytes

    Expected output:

    4000000
  5. Use the following YAML template to deploy the memtier-benchmark tool to send requests to the tested node:
    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        name: memtier-demo
      name: memtier-demo
    spec:
      containers:
        - command:
            - memtier_benchmark
            - '-s'
            - 'redis-demo'
            - '--data-size'
            - '200000'
            - "--ratio"
            - "1:4"
          image: 'redislabs/memtier_benchmark:1.3.0'
          name: memtier
      restartPolicy: Never
      nodeName: # Set nodeName to the name of the node that is used to send requests. 
  6. Run the following command to query the test results from memtier-benchmark:
    kubectl logs -f memtier-demo
  7. Use the following YAML template to disable memory QoS for the Redis pod and Stress pod. Then, perform stress tests again and compare the results.
    apiVersion: v1
    kind: Pod
    metadata:
      name: redis-demo
      labels:
        koordinator.sh/qosClass: 'LS'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. 
    spec:
      ...
    
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: stress-demo
      labels:
        koordinator.sh/qosClass: 'BE'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "none"}' # Disable memory QoS. 
                            

Analyze the results

The following table describes the stress test results when memory QoS is enabled and disabled.

  • Disabled: The memory QoS policy of the pod is set to none.
  • Enabled: The memory QoS policy of the pod is set to auto and the recommended memory QoS settings are used.
MetricDisabledEnabled
Latency-avg51.32 ms47.25 ms
Throughput-avg149.0 MB/s161.9 MB/s

The table shows that the latency of the Redis pod is reduced by 7.9% and the throughput of the Redis pod is increased by 8.7% after memory QoS is enabled. This indicates that the memory QoS feature can optimize the performance of applications in memory overcommitment scenarios.

FAQ

Is the memory QoS feature that is enabled based on the earlier version of the ack-slo-manager protocol supported after I upgrade from ack-slo-manager to ack-koordinator?

In an earlier version (≤ 0.8.0) of the ack-slo-manager protocol, the following pod annotations are used:
  • alibabacloud.com/qosClass
  • alibabacloud.com/memoryQOS
ack-koordinator is compatible with the earlier versions of the ack-slo-manager protocol. You can seamlessly upgrade from ack-slo-manager to ack-koordinator. ack-koordinator is compatible with the earlier protocol versions until July 30, 2023. We recommend that you upgrade the resource parameters in an earlier protocol version to the latest version.
The following table describes the compatibility between different versions of ack-koordinator and the memory QoS feature.
ack-koordinator versionalibabacloud.com protocolkoordinator.sh protocol
≥ 0.3.0 and < 0.8.0SupportedNot supported
≥ 0.8.0SupportedSupported