By Jiazhuo, Senior Development Engineer at Alibaba Cloud
The evolution of container runtimes can be divided into three phases:
Phase 1: June 2014
Kubernetes was officially made open source, and Docker was the default and only container runtime at the time.
Phase 2: Kubernetes 1.3
rkt was integrated into the Kubernetes backbone and became the second container runtime.
Phase 3: Kubernetes 1.5
An increasing number of container runtimes want to connect to Kubernetes. It will be difficult to maintain the Kubernetes code and guarantee the code quality if more container runtimes are built into Kubernetes in the same way as rkt and Docker.
To solve this problem, the community introduced the container runtime interface (CRI) to Kubernetes 1.5. The CRI decouples container runtimes from Kubernetes, saving developers the trouble of adapting to all types of container runtimes in the community and the worry about version maintenance due to inconsistent iteration cycles between container runtimes and Kubernetes. For example, the CRI plug-in of containerd allows container runtimes such as the CRI, Kata Containers, and gVisor to directly connect to containerd.
As more container runtimes emerge, different container runtimes are used in different scenarios, resulting in the need to run multiple container runtimes. Before running multiple container runtimes, you must be able to answer the following questions:
To answer the preceding questions, the community introduced RuntimeClass. RuntimeClass was initially introduced in Kubernetes 1.12 in the form of CustomResourceDefinitions (CRDs). After Kubernetes 1.14, RuntimeClass was introduced again as a built-in cluster resource. Kubernetes 1.16 extends the scheduling capability based on Kubernetes 1.14 and reduces overhead.
The following describes the workflow of RuntimeClass in Kubernetes 1.16. The preceding figure shows the RuntimeClass workflow on the left and a YAML file on the right.
The YAML file consists of two parts. The first part is used to create a RuntimeClass object named runv. The second part is used to create a pod that references the RuntimeClass named runv through spec.runtimeClassName.
The RuntimeClass object has a handler at its core. The handler indicates the program that receives the container creation request and corresponds to a container runtime. For example, assume the containers in the pod are created by the container runtime runv. The Scheduling field indicates the node to which the pod is scheduled to.
The left part of the preceding figure shows the RuntimeClass workflow as follows:
Here, we will use the RuntimeClass in Kubernetes 1.16 as an example. The RuntimeClass structure is defined as follows:
A RuntimeClass object represents a container runtime. Its structure includes the Handler, Overhead, and Scheduling fields.
RuntimeClass can be referenced in a pod by setting a RuntimeClass name in the runtimeClassName field.
The Scheduling field is related to the scheduling of the pod that references the RuntimeClass object.
The Scheduling field consists of two fields: NodeSelector and Tolerations. These two fields are similar to NodeSelector and Tolerations for a pod.
NodeSelector provides a list of labels that indicate a node supports a certain RuntimeClass. After a pod references the RuntimeClass, the RuntimeClass admission merges the label lists of the node and the pod. RuntimeClass admission denies the labels if two labels have the same key but different values. The RuntimeClass does not automatically set labels for nodes. You need to set labels in advance.
Tolerations provide the toleration list of the RuntimeClass. After a pod references the RuntimeClass, the RuntimeClass admission merges the toleration lists of the pod and the RuntimeClass. If the two lists have the same toleration configuration, they are merged into one list.
The left part of the preceding figure shows a Docker pod, and the right part shows a Kata pod. The Docker pod includes a conventional container and a pause container. The pause container is excluded from the overhead calculation of the pod. The overhead of the Kata pod only includes the container overhead. The overheads of the Kata agent, pause container, and guest kernel are not calculated. These overheads may reach 100 MB, so they cannot be ignored.
This is why the Pod Overhead field was introduced. Its structure is defined as follows:
Its definition is simple and only contains one field: PodFixed. This field has a key-value mapping, in which the key indicates a resource name and the value indicates a quantity. Each quantity indicates the utilization of a resource. Therefore, PodFixed indicates the utilization of various resources. You can set PodFixed to specify CPU utilization and memory usage.
The Pod Overhead field is used in three scenarios:
Before overhead was introduced, a pod could be scheduled to a node if the available resources of the node were no less than the amount requested by the pod. After the Pod Overhead field was introduced, a pod can be scheduled to a node only when the available resources of the node are no less than the amount requested by the pod plus the pod overhead.
A resource quota places a limit on the resources that can be used in a namespace. For example, you have a namespace with 1 GB memory usage and a pod with a requests value of 500. A maximum of two such pods can be scheduled in the namespace. If you add 200 MB overhead to each of the two pods, a maximum of one such pod can be scheduled in this namespace.
Kubelet pod eviction
After the Overhead field was introduced, overhead was included in a node's resource usage. This increases the proportion of used resources and affects Kubelet pod eviction. Next, let's look at the limits and precautions of using the Pod Overhead field.
The Pod Overhead field is permanently injected into the pod and cannot be manually modified. The Pod Overhead field persists and remains valid even when the RuntimeClass is deleted or updated.
Currently, the Pod Overhead field can be automatically injected only by the RuntimeClass admission, and cannot be manually added or modified. Any manual actions are denied.
The Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) aggregate container metrics and are not affected by the Pod Overhead field.
Alibaba Cloud's ACK security sandbox container can run multiple container runtimes. The following describes how to run multiple container runtimes in the environment shown in the preceding figure.
As shown in the preceding figure, the pod on the left has the container runtime runc, which corresponds to the RuntimeClass runc. The pod on the right has the container runtime runv, which references the RuntimeClass runv. Requests are highlighted in different colors. Blue requests are runc requests, and red requests are runv requests. In the lower part of the figure, the core component is containerd. Multiple container runtimes can be configured in containerd, which forwards requests.
When receiving a runc request, the Kubernetes API server forwards the request to Kubelet, which forwards the request to the CRI plug-in. The CRI plug-in queries the containerd configuration file to identify the handler for runc and determines that containerd-shim is requested through shim API runtime v1. Then, a pod is created by containerd-shim. This is the workflow for runc.
The workflow for runv is similar. A request is forwarded by the Kubernetes API server, Kubelet, and the CRI plug-in in sequence. The CRI plug-in queries the containerd configuration file to determine that containerd-shim-kata-v2 was created through shim API runtime v2. Then, a Kata pod is created by containerd-shim-kata-v2.
Now, let's look at the containerd configuration.
By default, containerd is stored in
file:///etc/containerd/config.toml. The core configuration is stored in the plugins.cri.containerd directory. Each runtimes configuration item has the same prefix in its name, that is, plugins.cri.containerd.runtimes. The name is suffixed with runc or runv, which is a RuntimeClass. runc and runv correspond to the handler names in the aforementioned RuntimeClass object. The configuration item plugins.cri.containerd.runtimes.default_runtime indicates that a pod that does not specify a RuntimeClass but is scheduled to the current node uses the container runtime runc by default.
The following example creates two RuntimeClass objects: runc and runv. You can view all available container runtimes by using kubectl get runtimeclass.
The following figure shows how to create a runc pod and a runv pod. Pay special attention to the runtimeClassName field, which references the container runtimes runc and runv for the two pods.
After pods are created, run the kubectl command to view the container running state of the pods and the container runtimes used by the pods. The cluster includes two pods: runc pod and runv pod. One pod references the RuntimeClass runc, and the other pod references the RuntimeClass runv. Both pods are in the running state.
Let's summarize what we have learned in this article.
Alibaba Container Service - April 28, 2020
Alibaba Developer - June 23, 2020
Alibaba Container Service - October 23, 2019
Alibaba Developer - June 12, 2020
Alibaba System Software - August 27, 2018
Alibaba Developer - May 8, 2019
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
A secure image hosting platform providing containerized image lifecycle managementLearn More
Self-driving Database Platform: Self-repair, Self-optimization, and Self-securityLearn More
A high-performance container manage service that provides containerized application lifecycle managementLearn More
More Posts by Alibaba Developer