Kubernetes Resource QoS Classes-Alibaba Cloud Developer Community

Basic concepts

Kubernetes define the Pod Containers Resource based on the request and limit values QoS Class in the Pod. Specify the container request, which indicates the minimum resource limit that the system can provide. The container limit, which indicates the upper limit of resources allowed by the system.

To ensure the long-term stable operation of Pods, you need to set the minimum resources to ensure the operation. However, the resources that Pods can use are often not guaranteed.

Generally, Kubernetes specify the overselling ratio by setting the values of request and limit to improve resource utilization. K8S scheduling is based on request, not limit. Borg improves resource utilization by 20% by using non-guranteed resources.

In a system where resources are oversold (total limits > machine capacity), containers are killed when resources are exhausted. Ideally, those unimportant containers are killed first.

For each Resource, containers can be divided into three QoS Classes: Guaranteed, Burstable, and Best-Effort and their QoS levels decrease successively. K8S uses the limit and request values to divide different levels of QoS.

  • Guaranteed, if the limit and request of all Container Resource in a Pod are equal and not 0, the QoS Class of the Pod is Guaranteed.

Note that if only limit is specified for a container and no request is specified, the value of the request is equal to the value of limit.

Examples:containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 name: bar
 resources:
 limits:
 cpu: 100m
 memory: 100Mi
containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 requests:
 cpu: 10m
 memory: 1Gi

 name: bar
 resources:
 limits:
 cpu: 100m
 memory: 100Mi
 requests:
 cpu: 100m
 memory: 100Mi
  • Best-Effort if the request and limit values of all Resource of all containers in a Pod are not assigned, the QoS Class of this Pod is Best-Effort.
Examples:containers:
 name: foo
 resources:
 name: bar
 resources:
  • Burstable, except for scenarios that meet the Guaranteed and Best-Effort requirements, Pod QoS Class in other scenarios are Burstable. When the limit value is not specified, the valid value is actually the Node Resource of the corresponding Capacity.

Examples: The container bar does not specify the Resource.

containers:
 name: foo
 resources:
 limits:
 cpu: 10m
 memory: 1Gi
 requests:
 cpu: 10m
 memory: 1Gi

 name: bar

Containers foo and bar specify different Resource.

containers:
 name: foo
 resources:
 limits:
 memory: 1Gi

 name: bar
 resources:
 limits:
 cpu: 100m

Container foo does not specify limit, container bar does not specify request and limit.

containers:
 name: foo
 resources:
 requests:
 cpu: 10m
 memory: 1Gi

 name: bar

Differences between compressed and uncompressed resources

when kube-scheduler is scheduled, it is Node Select based on the Pod request value. The Pod and all its Container cannot Consume limit the specified valid values (if have).

How the request and limit take effect depends on whether the resource is compressed.

Guarantee of compressed resources

  • currently, only CPU is supported.
  • Pods ensure that they can obtain the total amount of requested CPU, but do not obtain additional CPU time. This does not completely ensure that the container can use the set resource lower limit, because CPU isolation is container-level. Then, Pod-level cgroups resource isolation is introduced to solve this problem.
  • Excessive or competitive CPU resources are set based on CPU request. By cpu.share to allocate different proportions of time slices. If the request of container A is set to 600 milli and container B is set to 300Mili, when the two compete for CPU time: 1 to apportion.
  • If you reach Pod CPU RESOURCES limit upper limit, CPU WILL slow down (throttled), rather than kill pod. If you do not set the limit for a pod, you can use pods that exceed the CPU limit limit.

Guarantee of uncompressed resources

  • currently, only memory is supported.
  • Pods can obtain the total memory size of requests. If a pod exceeds the memory request value, the pod may be killed when other pods need memory. However, if the memory used by pods is less than the request value, they will not be killed unless more resources are required for system tasks or daemon. (To put it bluntly, it depends on the scores of all processes in the system when oom killer is triggered.)
  • when the memory usage of Pods exceeds the limit and a process in a pod container uses a large amount of memory, the process will be killed by the kernel.

Management and scheduling policies

  • Pods are confirmed by kubelet and scheduled by scheduler. Based on the number of requests allocated to containers, the total number of requests in all containers is within the allocated capacity of the Node. https://github.com/fabric8io/jenkinshift/blob/master/vendor/k8s.io/kubernetes/docs/proposals/node-allocatable.md

how to recycle Resources based on different QoS

  • CPU when the CPU usage cannot reach the request value, for example, system tasks and daemons use a large amount of CPU, Pods are not killed and the CPU efficiency is reduced (throttled).
  • Memory Memory is an uncompressed resource. The following table describes the differences between Memory and Memory management:
    • Best-Effort pods has the lowest priority. If the system memory runs out, processes in pods of this type are killed first. These containers can use any amount of free memory on the system.
    • Guaranteed pods has the highest priority. They can ensure that they will not be killed if they do not reach the upper limit set by the container. Expelled only when the system has memory pressure and does not have a lower priority container.
    • Burstable pods has some forms of minimum resource guarantee, but more resources can be used when needed. When the system has a memory bottleneck, once the memory exceeds their request value and no Best-Effort containers exist, these containers are killed first.

OOM Score configuration on Node

Pod OOM scoring configuration

badness() in mm/oom_kill.c gives each process a OOM score, and processes with higher OOM scores are more likely to be killed. The score depends on:

  • it mainly depends on the memory consumption of the process, including the resident memory, pagetable, and swap usage.
    • Generally, it is the percentage of memory consumption × 10(percent-times-ten).
  • For more information, see User permissions. For example, for processes started with root permissions, the score will be reduced by 30.
  • OOM scoring factor:/proc/pid/oom_score_adj (plus or minus) and/proc/pid/oom_adj (multiply and divide)
    • oom_adj: -15~15 coefficient adjustment
    • oom_score_adj:oom_score adds the value oom_score_adj.
    • The final oom score is still between 0 and 1000

here, a script is provided to calculate the oom_score TPO10 process (the process that is most likely to be killed by oom killer) on the system:

# vim oomscore.sh#!/bin/bashfor proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); doprintf"%2d %5d %s\n" \
 "$(cat $proc/oom_score)" \
 "$(basename $proc)" \
 "$(cat $proc/cmdline | tr '\0' ' ' | head -c 50)"done 2>/dev/null | sort -nr | head -n 10

the following are several OOM score of K8S QoS levels:

Best-effort

  • Set OOM_SCORE_ADJ: 1000
  • therefore, the OOM_SCORE value of the best-effort container is 1000.

Guaranteed

  • Set OOM_SCORE_ADJ: -998
  • therefore, the OOM_SCORE value of the container is guaranteed to be 0 or 1.

Burstable

  • if the total memory request is greater than 99.9% of the available memory, set OOM_SCORE_ADJ to 2. Otherwise, OOM_SCORE_ADJ = 1000-10 * (% of memory requested), which ensures that the burstable POD OOM_SCORE > 1
  • if the memory request is set to 0,OOM_SCORE_ADJ is set to 999 by default. If burstable pods conflict with guaranteed pods, the former will be killed.
  • If the burstable pod uses less memory than the request value, its OOM_SCORE is <1000. If best-effort pod and these burstable pod conflict, best-effort pod will first kill off.
  • If the process in the burstable pod container uses more memory than the request value, set OOM_SCORE to 1000. Otherwise, the OOM_SCORES is less than 1000.
  • In a pile of burstable pod in, use more memory than request value pod, priority in memory using less than request value pod quilt kill.
  • If multiple processes in the burstable pod conflict, OOM_SCORE is randomly set and is not limited by request & limit.

Pod infra containers or Special Pod init process

  • OOM_SCORE_ADJ: -998

Kubelet, Docker

  • OOM_SCORE_ADJ: -999 (won't be OOM killed)
  • key processes in the system will be killed preferentially if they conflict with guranteed processes. In the future, it will be put into a separate cgroup and the memory will be limited.

Known issue and potential optimization points

  • swap is supported: The current QoS policy is disabled by default. If swap is enabled, the usage of guaranteed Container Resources reaches the limit, and disks can be used to provide memory allocation. Finally, when the swap space is insufficient, the process in the pod will be killed. In this case, the node needs to take the swap space into account when providing the isolation policy.
  • Allows you to specify the priority: the user asks kubelet to specify which tasks can be killed.

Source Code Analysis

the source code of QoS is pkg/kubelet/qos. The code is very simple. It mainly includes two files: pkg/kubelet/qos/policy.go and pkg/kubelet/qos/qos.go. The OOM_SCORE_ADJ of each QoS Class discussed above is defined as follows:

pkg/kubelet/qos/policy.go:21const (
 PodInfraOOMAdj int = -998
 KubeletOOMScoreAdj int = -999
 DockerOOMScoreAdj int = -999
 KubeProxyOOMScoreAdj int = -999
 guaranteedOOMScoreAdj int = -998
 besteffortOOMScoreAdj int = 1000
)

the OOM_SCORE_ADJ calculation method of the container is defined as follows:

pkg/kubelet/qos/policy.go:40 func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int {
 switchGetPodQOS(pod) {
 case Guaranteed:
 // Guaranteed containers should be the last to get killed.return guaranteedOOMScoreAdj
 case BestEffort:
 return besteffortOOMScoreAdj
 }

 // Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,// we want to protect Burstable containers that consume less memory than requested.// The formula below is a heuristic. A container requesting for 10% of a system's// memory will have an OOM score adjust of 900. If a process in container Y// uses over 10% of memory, its OOM score will be 1000. The idea is that containers// which use more than their request will have an OOM score of 1000 and will be prime// targets for OOM kills.// Note that this is a heuristic, it won't work if a container has many small processes.
 memoryRequest := container.Resources.Requests.Memory().Value()
 oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
 // A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure// that burstable pods have a higher OOM score adjustment.ifint(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
 return (1000 + guaranteedOOMScoreAdj)
 }
 // Give burstable pods a higher chance of survival over besteffort pods.ifint(oomScoreAdjust) == besteffortOOMScoreAdj {
 returnint(oomScoreAdjust - 1)
 }
 returnint(oomScoreAdjust)
}

to obtain the QoS Class of a Pod, follow these steps:

pkg/kubelet/qos/qos.go:50// GetPodQOS returns the QoS class of a pod.// A pod is besteffort if none of its containers have specified any requests or limits.// A pod is guaranteed only when requests and limits are specified for all the containers and they are equal.// A pod is burstable if limits and requests do not match across all containers. func GetPodQOS(pod *v1.Pod) QOSClass {
 requests := v1.ResourceList{}
 limits := v1.ResourceList{}
 zeroQuantity := resource.MustParse("0")
 isGuaranteed := truefor _, container := range pod.Spec.Containers {
 // process requestsfor name, quantity := range container.Resources.Requests {
 if !supportedQoSComputeResources.Has(string(name)) {
 continue
 }
 if quantity.Cmp(zeroQuantity) == 1 {
 delta := quantity.Copy()
 if _, exists := requests[name]; !exists {
 requests[name] = *delta
 } else {
 delta.Add(requests[name])
 requests[name] = *delta
 }
 }
 }
 // process limits
 qosLimitsFound := sets.NewString()
 for name, quantity := range container.Resources.Limits {
 if !supportedQoSComputeResources.Has(string(name)) {
 continue
 }
 if quantity.Cmp(zeroQuantity) == 1 {
 qosLimitsFound.Insert(string(name))
 delta := quantity.Copy()
 if _, exists := limits[name]; !exists {
 limits[name] = *delta
 } else {
 delta.Add(limits[name])
 limits[name] = *delta
 }
 }
 }

 iflen(qosLimitsFound) != len(supportedQoSComputeResources) {
 isGuaranteed = false
 }
 }
 iflen(requests) == 0 && len(limits) == 0 {
 return BestEffort
 }
 // Check is requests match limits for all resources.if isGuaranteed {
 for name, req := range requests {
 if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
 isGuaranteed = falsebreak
 }
 }
 }
 if isGuaranteed &&
 len(requests) == len(limits) {
 return Guaranteed
 }
 return Burstable
}

PodQoS is called in the Predicates phase of eviction_manager and scheduler, that is, it is used in the k8s processing overconfiguration and scheduling preselection phase.

This article is from open source China-Introduction to Kubernetes Resource QoS Classes

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now