By Huaixin Chang from Alibaba Cloud, core member of Cloud Kernel SIG in the OpenAnolis community.
The annoying CPU throttling affects container operation. Sometimes, people have to sacrifice container deployment density to avoid the occurrence of CPU throttling. The CPU Burst technology we designed can guarantee the service quality of container operation without reducing the density of container deployment.
This article is divided into three parts. This article is the first part. The second part will analyze the size effects of CPU Burst. The third part will introduce evaluating impacts of CPU Burst, and discuss how to configure CPU Burst to achieve the best results.
The CPU Burst feature has been incorporated into Linux 5.14. Anolis OS 8.2, Alibaba Cloud Linux 2, and Alibaba Cloud Linux 3 also support the CPU Burst feature.
In Kubernetes container scheduling, the upper limit of containers' CPU resources is specified by the CPU limits parameter. We can limit the excessive CPU run time consumed by some containers and ensure that other containers get enough CPU resources by setting the upper limit of CPU resources. In the Linux kernel, CPU limits are implemented with CPU Bandwidth Controller. It limits the resource consumption of cgroups through CPU throttling. Therefore, when processes in a container use more resources than what CPU limits specify, these processes will be throttled by the CPU. The CPU time they use will be limited, and some key latency indicators in these processes will deteriorate.
What should we do with this kind of situation? Usually, we multiply the daily peak of CPU utilization of the container by a safety factor to set the CPU limits of the container. This way, we can avoid the deterioration of service quality of the container caused by throttling and ensure the utilization of CPU resources. This is a simple example:
If we have a container whose daily peak CPU utilization is around 250%, we set the container CPU limits to 400% to ensure the container service quality. As such, the CPU utilization of the container is 62.5% (250%/400%).
However, is it perfect? It is not! CPU throttling occurs much more frequently than expected. What should we do? It seems we can only continue to increase CPU limits to solve it. In many cases, when the CPU limits of the container are enlarged 5 to 10 times, the service quality of the container is guaranteed more. However, the total CPU utilization of the container is only 10% to 20%. Therefore, the deployment density of containers must be reduced to cope with possible container CPU usage peaks.
In the past, people fixed some CPU throttling problems caused by bugs in CPU Bandwidth Controller. The current unexpected throttling is caused by the burst CPU use of 100ms. We proposed CPU Burst technology to allow certain CPU bursts use to avoid throttling when the average CPU utilization is lower than the CPU limit. In cloud computing scenarios, the value of CPU Burst includes:
The second-level CPU utilization does not reflect the usage of the 100ms-level CPU that Bandwidth Controller works on. This is the cause of unexpected CPU throttling. Bandwidth Controller is suitable for CFS tasks. It uses period and quota to manage the CPU time consumption of cgroup. If the cgroup period is 100ms and its quota is 50ms, the cgroup process uses a maximum of 50ms of CPU time every 100ms. When the CPU usage of a 100ms cycle exceeds 50ms, the process is throttled, and cgroup CPU usage is limited to 50%.
CPU utilization is the average CPU usage over time. CPU utilization tends to be stable with coarse granularity to record CPU usage requirements. When the granularity gets finer, the burst feature of CPU usage is more obvious. If we observe the container load operation at the same time with 1s and 100ms granularity, the second level of CPU utilization averages about 250%, and the peak value of observed CPU utilization in the level of 100ms that Bandwidth Controller works on has exceeded 400%.
Set the container quota and period to 400ms and 100ms based on the observed CPU utilization of 250% in the granularity of second. The fine-grained burst of the container process is throttled by Bandwidth Controller, thus affecting the CPU usage of the container process.
We use CPU Burst technology to meet this fine-grained CPU burst demand. We introduce the concept of burst based on quota and period in the traditional CPU Bandwidth Controller. When the CPU usage of the container is lower than a quota, burst resources that can be used for burst scenarios are accumulated. When the CPU usage of the container exceeds the quota, the accumulated burst resources are allowed to be used. As a result, we can limit the average CPU consumption of the container for a longer period of time to the quota range. The CPU usage is allowed to exceed its quota over a short period of time.
Let's say we use the Bandwidth Controller algorithm to manage the vacation. The period of vacation management is one year, and quota stands for the amount of vacation in one year. CPU Burst allows vacations that cannot be used up this year to be taken later.
After CPU Burst is used in container scenarios, the service quality of test containers is improved. A 68% decrease in RT mean was observed (from over 30 ms to 9.6 ms). 99% of RT decreased by 94.5% (from over 500 ms to 27.37 ms).
The CIS Benchmark for Alibaba Cloud Linux 3 Has Officially Passed the CIS Certification
68 posts | 4 followersFollow
OpenAnolis - March 29, 2022
Alibaba Cloud Native Community - July 13, 2022
OpenAnolis - March 25, 2022
Alibaba Developer - January 10, 2020
Alibaba Clouder - November 10, 2020
Alibaba Clouder - March 13, 2019
68 posts | 4 followersFollow
Provides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resourcesLearn More
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
Take advantage of the cost effectiveness, scalability, and flexibility of Alibaba Cloud's infrastructure and services, as well as the proven reliability of Red Hat Enterprise Linux and Alibaba Cloud's support backed by Red Hat Global Support Services.Learn More
More Posts by OpenAnolis