How to use Alibaba Cloud Container Service to ensure the quality of container memory resources


In cloud native scenarios, applications are typically deployed and allocated physical resources in the form of containers. Taking the Kubernetes cluster as an example, the application workload declares the Request/Limit of resources in Pod, and Kubernetes schedules the application's resources and guarantees service quality based on the declaration.

When the memory resources of the container or host are tight, application performance can be affected, such as high service latency or OOM phenomena. Generally speaking, the memory performance of in-container applications is affected by two factors:

1. Self memory limit: When the container's own memory (including page cache) approaches the container's upper limit, it will trigger the kernel's memory subsystem to run, which will affect the performance of memory applications and releases within the container.

2. Host memory limit: When the container's memory exceeds the limit (Memory Limit>Request), which leads to a shortage of overall memory, it will trigger a global memory reclamation of the kernel. This process has a greater performance impact, and in extreme cases, it can cause the entire machine to crash.

The previous articles "Alibaba Cloud Container Service Differentiation SLO Hybrid Technology Practice" and "How to Reasonably Use CPU Management Strategies to Improve Container Performance" respectively elaborated on Alibaba Cloud's practical experience and optimization methods in cloud native hybrid and container CPU resource management. This article discusses the troubles and guarantee strategies when using memory resources in containers.

The troubles of container memory resources

Memory resource management for Kubernetes

The application deployed in the Kubernetes cluster follows the standard Kubernetes Request/Limit model at the resource usage level. In the memory dimension, the scheduler makes decisions based on the Memory Request declared by Pod. On the node side, Kubelet and the container runtime set the declared Memory Limit to the cgroups interface of the Linux kernel, as shown in the following figure:

CGroups (Control Groups, abbreviated as cgroups) are mechanisms for managing container resource usage on Linux. The system can use cgroups to finely limit the CPU and memory resource usage of processes within the container. Kubelet, on the other hand, constrains the available resources of Pod and Container on the node side by setting the Request/Limit of the container to the cgroup interface, roughly as follows:

Kubelet sets the cgroups interface memory.limit based on the Memory Limit of Pod/Container_ in_ Bytes constrain the maximum memory usage limit of the container, and there are similar limitations in the CPU resource dimension, such as CPU time slices or constraints on the number of bound cores. For the Request level, Kubelet sets the cgroups interface cpu.shares based on the CPU Request as the relative weight for competing CPU resources between containers. When the CPU resources of nodes are tight, the proportion of shared CPU time between containers will be divided according to the Request ratio to meet fairness; By default, the cgroups interface is not set for Memory Requests, which are mainly used for scheduling and evicting references.

Kubernetes versions 1.22 and above support resource mapping for Memory Requests based on cgroups v2 (kernel version not lower than 4.15, not compatible with cgroups v1, and enabling it will affect all containers on the node).

For example, if Pod A's CPU Request on a node is 2-core and Pod B's CPU Request is 4-core, then when the node's CPU resources are tight, the relative proportion of CPU resources used by Pod A and Pod B is 1:2. When the memory resources of nodes are tight, due to the fact that Memory Requests are not mapped to the cgroups interface, the available memory between containers will not be divided according to the request ratio like CPUs, resulting in a lack of resource fairness protection.

Memory resource usage in cloud native scenarios

In cloud native scenarios, the memory limit setting of the container affects the memory resource quality of the container itself and the entire host.

Due to the principle of the Linux kernel to use memory as much as possible rather than continuously reclaiming it, the memory usage often continues to increase when processes within the container request memory. When the memory usage of the container approaches the limit, it will trigger container level synchronous memory reclamation, resulting in additional latency; If the memory application rate is high, it may also cause container OOM (Out of Memory) Killed, causing interruption and restart of applications within the container.

The memory usage between containers is also affected by the host's memory limit. If the memory usage of the entire machine is high, it will trigger a global memory reclamation, and in severe cases, it will slow down the memory allocation of all containers, resulting in a decrease in the quality of memory resources for the entire node.

In the Kubernetes cluster, there may be a need to ensure priority between Pods. For example, high priority Pods require better resource stability. When the overall machine resources are tight, it is necessary to avoid the impact on high priority Pods as much as possible. However, in some real-world scenarios, low-priority Pods often run resource consuming tasks, which means they are more likely to cause widespread memory resource constraints and interfere with the resource quality of high-priority Pods, making them true "trouble makers". Currently, Kubernetes mainly uses Kubelet to evict low-priority Pods, but the response time may occur after global memory reclamation.

Using Container Memory Service Quality to Ensure Container Memory Resources

Container Memory Service Quality

Kubelet provides MemoryQoS features in Kubernetes 1.22 and above, further ensuring the quality of container memory resources through the memcg QoS capability provided by Linux cgroups v2, including:

Set the Memory Request of the container to the cgroups v2 interface memory-min to lock the requested memory from being reclaimed by global memory.

Based on the container's Memory Limit setting, the cgroups v2 interface memory.h. When a Pod experiences memory overload (Memory Usage>Request), priority is given to triggering flow limiting to avoid OOM caused by unrestricted memory overload.

The upstream solution can effectively solve the fairness issue of memory resources between Pods, but from the perspective of user usage of resources, there are still some shortcomings:

When Pod's memory declaration Request=Limit, there may still be resource constraints within the container, triggering memcg level direct memory reclamation that may affect the RT (response time) of the application service.

The solution currently does not consider compatibility with cgroups v1, and the issue of memory resource fairness on cgroups v1 has not been resolved.

Alibaba Cloud Container Service ACK is based on the memory subsystem enhancement of Alibaba Cloud Linux 2, allowing users to access more complete container memory QoS functions in advance on cgroups v1, as shown below:

1. Ensure the fairness of memory recycling between Pods. When the overall memory resources are tight, priority should be given to reclaiming memory from Pods with memory overload (Usage>Request) to constrain spoilers to avoid a decrease in overall resource quality.

When the memory usage of Pod approaches the limit, priority is given to asynchronously reclaiming a portion of memory in the background to alleviate the performance impact of direct memory reclamation.

When node memory resources are tight, priority should be given to ensuring the memory running quality of the Guaranteed/Burstable Pod.

Typical Scenarios

Memory oversold

In cloud native scenarios, application administrators may set a Memory Limit greater than Request for the container to increase scheduling flexibility, reduce OOM risks, and optimize the availability of memory resources; For clusters with low memory utilization, resource administrators may also use this approach to improve utilization as a means of reducing costs and increasing efficiency. However, this approach may cause the sum of the memory limits of all containers on the node to exceed the physical capacity, causing the entire machine to be in a memory overcommit state. When a node experiences memory oversold, even if the memory usage of all containers is significantly lower than the Limit value, the overall memory may still reach the global memory reclamation waterline. Therefore, compared to the non oversold state, when memory is oversold, it is more likely to experience resource shortages. Once a container requests a large amount of memory, it may cause other containers on the node to enter a slow path of direct memory recycling, and even trigger the entire machine's OOM, greatly affecting the quality of application service.

The Memory QoS function enables container level memory backend asynchronous recycling, which asynchronously reclaims a portion of memory before direct recycling occurs, which can improve the latency impact caused by triggering direct recycling; For Pods that declare a Memory Request
Mixed deployment

The Kubernetes cluster may have deployed Pods with different resource usage characteristics on the same node. For example, Pod A runs the workload of online services, with relatively stable memory utilization, and belongs to latency sensitive business (LS); Pod B runs batch processing jobs for big data applications, often requesting a large amount of memory immediately after startup, and is a resource consuming business (Best Effort, abbreviated as BE). When the overall memory resources are tight, both Pod A and Pod B will be disturbed by the global memory recycling mechanism. In fact, even if the current memory usage of Pod A does not exceed its Request value, its service quality will be greatly affected; Pod B may have set a large limit or even not configured a limit value, using far more memory resources than requested, making it a true 'trouble maker' but not fully constrained, thereby damaging the overall memory resource quality of the machine.

The Memory QoS function enables global minimum watermark classification and kernel memcg QoS. When the overall memory resources are tight, priority is given to reclaiming memory from the BE container to reduce the impact of global memory reclamation on the LS container; It also supports prioritizing the recycling of unused memory resources, ensuring the fairness of memory resources.

Technical Insider

The Memory Reclamation Mechanism of Linux

If the memory limit declared by the container is too low, additional latency or even OOM may occur when the process within the container requests more memory; If the memory limit of the container is set too large, it will cause the process to consume a large amount of overall memory resources, interfere with other applications on the node, and cause a wide range of business latency jitters. These delays and OOMs triggered by memory application behavior are closely related to the memory recycling mechanism of the Linux kernel.

The memory pages used by processes within the container mainly include:

Anonymous page: From the heap, stack, or data segment, it needs to be reclaimed by swapping out to the swap out area.

File page: From code snippets and file mappings, it needs to be reclaimed through page out, and dirty pages need to be written back to disk first.

• shared memory: from anonymous mmap mapping and shmem shared memory, it needs to be recycled through the swap area.

Kubernetes does not support swapping by default, so the directly recyclable pages in the container mainly come from file pages, which is also called page cache (corresponding to the Cached part of the kernel interface statistics, which also includes a small amount of shared memory). Due to the fact that accessing memory is much faster than accessing disks, the principle of the Linux kernel is to use memory as much as possible, and memory reclamation (such as page cache) is mainly triggered when the memory water level is relatively high.

Specifically, when the memory usage of the container itself (including page cache) approaches the limit, a direct reclaim at the cgroup (referred to as memcg) level will be triggered to reclaim clean file pages. This process occurs within the context of the process's request for internal storage, resulting in application lag within the container. If the memory application rate exceeds the recycling rate at this time, the OOM Killer of the kernel will terminate some processes to further free up memory based on the running and memory usage of processes within the container.

When the overall memory resources are tight, the kernel will trigger a garbage collection based on the water level of free memory (the Free part of the kernel interface statistics). When the water level reaches the Low watermark, background memory garbage collection is triggered, and the garbage collection process is completed by the kernel thread kswapd, which does not block the application process and supports garbage collection of dirty pages; When the idle water level reaches the Min watermark (Min
Cgroups-v1 Memcg QoS

The Pod Memory Request section in the Kubernetes cluster is not fully guaranteed, so when node memory resources are tight, triggering global memory reclamation can disrupt the fairness of memory resources between Pods. Containers that are resource overloaded (Usage>Request) may compete for internal resources with containers that are not overloaded.

For the quality of service (memcg QoS) of the container memory subsystem, the Linux community provides the ability to limit memory usage limits on cgroups v1, which is also set to the limit value of the container by Kubelet. However, it lacks the ability to guarantee (lock) memory usage during memory reclamation, and only supports this feature on the cgroups v2 interface.

The Alibaba Cloud Linux 2 kernel enables memcg QoS by default in the cgroups v1 interface. The Alibaba Cloud container service ACK automatically sets appropriate memcg QoS configurations for Pod through the Memory QoS function. Nodes do not need to upgrade cgroups v2 to support the resource locking and limit flow limiting capabilities of container Memory Requests, as shown in the above figure:

Memory. min: Set as the Memory Request of the container. Based on the memory locking capability of this interface, Pod can lock the memory in the Request section from being globally reclaimed. When the node's memory resources are tight, only memory is reclaimed from the container where memory overload occurs.

Memory. height: When the container's Memory Request is less than Limit or no Limit is set, it is set as a percentage of the Limit. Based on the memory flow limiting ability of this interface, Pods that exceed memory resources will enter the flow limiting process. BestEffort Pods cannot severely exceed the overall memory resources, thereby reducing the risk of triggering global memory reclamation or overall OOM when memory is oversold.

For more descriptions of Alibaba Cloud Linux 2 memcg QoS capabilities, please refer to the official website document:

Memory backend asynchronous recycling

As mentioned earlier, the memory reclamation process of the system not only occurs in the overall dimension, but also triggers within the container (i.e. at the memcg level). When the memory usage inside the container approaches the limit, direct memory reclamation logic will be triggered in the process context, thereby blocking the performance of the application inside the container.

To address this issue, Alibaba Cloud Linux 2 has added a backend asynchronous recycling feature for containers. Unlike the asynchronous collection of kshapd kernel threads in global memory collection, this feature does not create a memcg granular kshapd thread, but adopts the workqueue mechanism to implement it, while supporting the cgroups v1 and cgroups v2 interfaces.

As shown in the above figure, Alibaba Cloud Container Service ACK automatically sets the appropriate background recycling water level for Pod through the Memory QoS function, memory.wmark_ high。 When the memory level of the container reaches this threshold, the kernel will automatically enable the background recycling mechanism, which is earlier than direct memory recycling to avoid the delay impact caused by direct memory recycling and improve the running quality of applications within the container.

For more descriptions of Alibaba Cloud Linux 2's memory backend asynchronous recycling capability, please refer to the official website document:

Global minimum water level classification

Global direct memory reclamation has a significant impact on system performance, especially in memory oversold scenarios where latency sensitive services (LS) and resource consuming tasks (BE) are mixed. Resource consuming tasks often instantly request a large amount of memory, causing the system's free memory to reach the global minimum water mark (global wmark_min), causing all tasks in the system to enter a slow path of direct memory reclamation, This in turn leads to performance jitter in delay sensitive services. In this scenario, whether it is global kswapd backend recycling or memcg backend recycling, this issue will not be effectively avoided.

In response to the above scenario, Alibaba Cloud Linux 2 has added the memcg global minimum water mark classification function, allowing for the use of memory.wmark on top of the global wmark_min_ min_ Adj adjusts the water level for the memcg level to take effect. Alibaba Cloud Container Service ACK sets hierarchical water levels for containers through the Memory QoS function, and performs global WMARK on the entire machine_ On the basis of min, move up the global wmark of the BE container_ Min, allowing it to enter direct memory reclamation in advance; Move global wmark of LS container down_ Min to avoid direct memory reclamation as much as possible, as shown in the following figure:

In this way, when the BE task instantly requests a large amount of memory, the system can use the global wmark moved up_ Min suppresses it for a short period of time to avoid causing direct memory reclamation in LS. Wait for the global kswapd to reclaim a certain amount of memory before addressing the short-term suppression of BE tasks.

For more information on the global minimum water level classification capability of Alibaba Cloud Linux 2 memcg, please refer to the official website document:


In summary, the Memory QoS of containers is based on the Alibaba Cloud Linux 2 kernel to ensure the quality of container memory resources. The recommended usage scenarios for each capability are as follows:

We use Redis Server as a delay sensitive online application to verify the improvement effect of enabling Memory QoS on application latency and throughput by simulating memory oversold and pressure testing requests:

By comparing the above data, it can be seen that after enabling the quality of service in container memory, the average latency and average throughput of Redis applications have been improved to some extent.


In response to the issue of container memory usage in cloud native scenarios, Alibaba Cloud Container Service ACK provides container memory quality of service (Memory QoS) functionality based on the Alibaba Cloud Linux 2 kernel. By allocating container memory recycling and flow limiting mechanisms, it ensures memory resource fairness and improves application runtime memory performance. Memory QoS belongs to a relatively static resource quality scheme and is suitable as a backup mechanism to ensure the memory usage of Kubernetes clusters. For complex resource oversold and mixed deployment scenarios, more dynamic and refined memory guarantee strategies are indispensable. For example, for frequent fluctuations in memory water level, a eviction strategy based on real-time resource pressure indicators can flexibly perform load scheduling in user mode; On the other hand, efficient memory oversold can be achieved by exploring memory resources at a finer granularity, such as memory reclamation based on hot and cold page tags or Runtime (e.g. JVM) GC.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us