Understand K8S from the perspective of resource management

Resource type

In the era of physical or virtual machine clusters, the resources we manage are mainly host-based clusters. A single host node can be a physical machine or a VM based on Hypervisor virtualization middleware. The resource types mainly include CPU, memory, disk IO, and network bandwidth.

The K8S cluster is generally composed of at least one master node and several worker nodes. The master node contains the scheduler, controller controller, resource object interface APIServer and persistent storage Etcd. The Worker node communicates with the Master node through Kubelet, and its resources and load Pod are controlled and scheduled by the Master. The architecture of K8S is C/S mode, that is, the resource management and scheduling of worker nodes are controlled by master nodes.

When recognizing the K8S cluster resources, the traditional method of node unit segmentation is no longer used. The author introduces the resource types from three aspects: workload, storage and network. It is based on the core resource objects Pod, PVC/PV and Service/Address provided by K8S to realize the most basic cluster resource management and container arrangement. In the K8S cluster, CPU and memory resource definitions are included in the Pod, and users cannot manage directly from the K8S. PVC/PV mode decouples storage resources and node resources. Service provides automatic service discovery, enabling the K8S cluster to have self-healing capability. Ingress provides control over the cluster network traffic, enabling K8S to consider the workload and network load in an integrated way to achieve elastic scaling.

Workload

K8S manages the life cycle of a group of container groups through Pod, which is the smallest unit of K8S scheduling. Application service deployment and scheduling is also based on Pod, which can also be called workload. Stateless application deployment is defined and managed by ReplicaSet; StatefulSet is used to define and manage the application deployment with dependencies and states. ReplicaSet and StatefulSet are another interesting topic, which the author will introduce in detail in the following series. The CPU/memory resources required by the application are defined in Pod, because the CPU/memory resources are container-level and are defined by Pod's spec ->containers ->resources.

K8S provides two configuration parameters, requests and limits, to define the scope and quota of resources.

Requests defines the resource limit of the workload, which is the default value of K8S resource allocation when the container is started.

Limits defines the resource limit of the workload, which is the pre-allocation quota of K8S resources when the container is running.

The storage volume, network namespace and PID namespace of the pod where the container shares. The container has CPU/memory resource allocation and resource quota. When performing resource management, you should distinguish between the CPU/memory resource allocation quota and the actual CPU/memory utilization of the workload. The K8S cluster information console generally displays the CPU/memory allocation quota defined by the workload.

storage

The full name of PVC is PersistentVolumeClaim, which is a persistent volume declaration; The full name of PV is PersistentVolume, which is a persistent volume. As the name implies, persistent volumes are resource object definitions that K8S describes various types of storage resources (block storage, NAS, and object storage). Based on the storage management plug-ins provided by various cloud vendors, you can directly use the storage resources of the IaaS layer as the K8S cluster storage.

It seems that PV abstraction is enough to solve the problem that K8S storage resources are suitable for IaaS storage resources of various cloud manufacturers. Why do we have to do more to abstract PVC resource objects? PVC is similar to the abstract class in OOP. Using abstract classes in development is usually to decouple object calls and object implementations. Pod binding PVC instead of PV decouples the deployment of Pod from the allocation of PV resources.

Pod deployment generally belongs to the sub-link of application deployment of the R&D team and is controlled by developers. The definition and allocation of PV storage resources belong to the deops team and are controlled by the cluster administrator.



Pod and PVC are resource objects within the scope of namespace access control, and PV are resource objects within the scope of cluster access control.



The PVC/PV mode decouples the storage resources and node resources, that is, the storage resources defined by PVC/PV are independent of node resources, and migrate between node resources with workload through dynamic binding. The details depend on the implementation of the storage plug-in provided by the cloud vendor.

network

Service implements the automatic discovery mechanism on the server side, that is, a group of Pods that provide application services can provide external services through the same service domain name access address. The addition or destruction of Pods will not affect the overall service performance of the service. The workload is bound to a virtual IP and port through the service, which is not sensitive to the node resources allocated by Pod. The service and node network resources are decoupled, so that the entire cluster has the ability of load migration and fault self-healing.

Unlike the Service object, Ingress is a resource object that K8S provides services outside the cluster. Before the emergence of K8S, we used Nginx reverse proxy and load balancing to control the load and flow, and horizontally expand the server's ability to handle the load. Ingress is essentially an NginxPod, which is also exposed through the service object, and provides reverse proxy and load balancing based on the LoadBalancer mode. Ingress is much like a cluster's router and access portal. By configuring routing rules in ingress to bind the workload and the corresponding domain name resolution path, it can only use one external IP address to expose the services within multiple clusters, thus saving IP resources.

Pod scheduling and management

As the most popular container orchestration platform, K8S provides platform-level elastic scaling and fault self-healing solutions. The K8S cluster is based on the C/S architecture. The master controls and manages the cluster resources and load balancing. Pod is the smallest unit of K8S scheduling. If we want to master cluster resource management and scheduling, we should start with Pod scheduling and management.

health examination

K8S needs to master the running status of application services before it can perform platform-level elastic scaling and fault self-healing. The simplest way is to constantly check whether the container process is running. If it detects that the container process has failed, it will automatically try to restart the process. In many cases, restarting the process can solve the problem, so this health check is simple, but very effective and necessary.

However, if a Java application throws an OOM exception or deadlock while running, but the JVM process is still running; Pod is still running, but the application process is no longer able to provide services. In this case, the above health check cannot be handled. K8S provides

The livenessProbe can capture the abnormal state of the application service level and comprehensively grasp whether the application service is running healthily.

The livenessProbe is similar to the process health check. It checks the health of the container process. When a failure is detected, it restarts the process to achieve automatic repair.

The difference is that the livenessProbe connects the exposed IP and Port of Pod by calling the HTTP GET API defined by the application service, and determines the running status of the container process by whether the return code of the request belongs to 200~399.

In addition to HTTP, you can also determine the running status of the application service through whether the TCP socket is successfully connected.

This kind of application running state judgment logic from K8S rather than from within the application service enables K8S to master the health state of the application service level.

When the load is too large, even if the container process has been in a healthy state, it is still possible that the application service cannot provide services normally. In this case, K8S is detected through readiness Probe.

The access method provided by Readiness Probe is the same as that provided by LivenessProbe. The Kubelet probe can report the Pod workload to the K8S management node through HTTP GET API or TCP Socket Connection.

When the readiness probe fails and the application service process cannot process the request normally, the pod will not be restarted, but will be removed from the service endpoint and will not receive the request load of the service. Similar to traffic degradation, it has been ensured that the pod can correctly handle the received request load.

Automatic scheduling policy

When Google opened the K8S in 2015, it put forward the initial cloud native definition, namely, application containerization, micro-service-oriented architecture, and application support container scheduling. Container scheduling is the core technology of K8S. When faced with hundreds of micro service container clusters, Pod scheduling and resource management become a complex matter. A group of containers in Pod are relevant at runtime, running on the same node and sharing node resources. When the application load changes, the container's consumption of node resources will also change. The capacity and availability of node resources will also affect the performance and stability of application services.

The Scheduler component of K8S selects the appropriate node for scheduling based on the Pod resource object defined by APIServer and the resource usage of each node reported by Kubelet. The Scheduler controls the creation of Pods, elastic scaling, and load migration in case of failure. The Scheduler makes decisions based on container runtime dependencies, resource requirements settings, and default scheduling policies. The default scheduling policy usually considers that the scheduled nodes can ensure high availability, high performance and low latency of the workload. Therefore, unless there is a special purpose for node selection, it is recommended that users directly use the default scheduling strategy.

The customized scheduler can operate the cluster configuration file, as shown in the JSON file below. Among them, predictors means that only node resources meeting the above rules are considered when scheduling Pod. Priorities means that the set of node resources that meet the filtering of predictors rule is sorted according to priorities, and the node resources with the highest weight are selected for scheduling. For example, the Scheduler will prioritize the nodes with the lowest resource requirements for scheduling as shown in the following example.

When the Scheduler detects the Pod resource object definition generated by APIServer, it will first filter out the node resource collection that meets the rules through Predicates; Then, these node resources are weighted according to the priority to select the best node; Allocate node resources to create Pods.

Pods are generally scheduled to nodes with resource capacity greater than the demand for Pod resources. Generally, the node OS and K8S management components will pre-allocate some node resources, and the resource capacity that can be allocated is usually less than the total node resources. The resource configuration that can be scheduled by the Scheduler refers to the resources that can be allocated by nodes, also known as node capacity. Its calculation method is:

Allocatable Capacity= Node Capacity - Kube-Reserved - System-Reserved

Allocatable Capacity is the node resource that the Scheduler can allocate for the application service Pod

Node Capacity is the total amount of node resources

Kube-Reserved reserves resources for K8S background processes, such as Kubelet, CRI, CNI and other components

System-Reserved reserves resources for the node's operating system background processes, such as sshd, udev, etc

When introducing the workload above, we mentioned that we can determine the amount and quota of resource consumption for container operation through requests and limits. Because node resources cannot all be used for resource scheduling, it is recommended that when defining the Pod template, it is better to clearly define the resources ->requests and resources ->limits of the container to prevent the workload from competing with K8S components for resources and causing scheduling failure.

Limits defines the upper limit of the container resource usage, and Requests defines the container resource initialization configuration. Generally, resources are allocated according to requests when the container is started. When the container is running, the resource consumption is generally less than that allocated by requests, as shown in the following figure.

This pyramid resource allocation method has a lot of resource fragmentation. When workloads compete for resources, K8S provides three levels of quality of service assurance.

Best Effort: Pod does not set requests and limits. This kind of Pod has the lowest QOS priority. When the node resource scheduling is insufficient or the competition conflicts, the Pod will be destroyed or migrated first.

Burstable: Pod sets requests and limits, but requests are less than limits. This kind of Pod has the lowest resource guarantee, that is, when there is no more Best-Effort Pod due to node resource competition, it is preferred to destroy or migrate this kind of Pod.

Guard: Pod sets requests and limits, and requests equals limits. This kind of Pod has the highest resource guarantee. QOS level is higher than Burstable and Best-Effort. Therefore, when defining the Pod template of application services, it is recommended to configure appropriate requests and limits as far as possible for Pods with high requirements for container resource assurance and service quality.

Service Discovery

The application services running in the K8S cluster are mostly distributed systems based on the microservice architecture. There is often a mutual invocation relationship between services. When scheduling the application service Pod, the Scheduler will select the best node for resource allocation to create the Pod. Before starting the container, randomly assign the ClusterIP address to this Pod. Therefore, when another application service Pod wants to communicate with the application service Pod, it is difficult to obtain the randomly assigned ClusterIP information.

Client Autodiscover

Traditional distributed systems, such as ZooKeeper, often use the client-side discovery method for automatic discovery between services. The client service has a built-in probe agent that can discover the service registry and select a service instance for communication. The server service instance reports its status to the service registry. The client service selects and wakes up the responding service instance for interaction by querying the service registry information.

Client Service Discovery ByAgent

Automatic discovery of server

The service discovery implemented by K8S is based on the server-side method, that is, the server-side Pod should actively report its service capacity to the service registry. The client Pod should be able to access the service registry, provide service information through the service registry, and access the response server Pod. The client Pod uses a constant virtual IP to access the same service through the proxy service, regardless of which Pod provides the service.

ByProxy found on the server

The K8S implements the discovery logic of the server end is the service resource object. Service can bind a virtual IP, also known as clusterIP, to a group of Pods through the definition of Pod selector and port number.

Since ClusterIP is randomly assigned after Pod is started, how do other services Pod discover and communicate with this ClusterIP? There are two main ways:

environment variable

When the Pod is created, the service object bound to it is also created, and the port number of the binding is immediately monitored. The cluster IP and port values related to the service are automatically set to the Pod in the way of environment variables, and the application service can provide external services through the cluster IP and port.

Since the environment variable corresponding to the service cannot be injected after the Pod is started, binding clusterIP and port based on the environment variable can only occur during the Pod startup process.

DNS query

K8S provides a platform-level DNS service that can be configured for all Pods. When a service resource object is created, the DNS service can bind a DNS access address to the corresponding Pod for access. This DNS service manages the correspondence between the DNS access address of the application service and the ClusterIP and port assigned when the Pod is started, and is responsible for resolving the traffic load from the DNS access address to the corresponding Pod.

If the client service knows the ServiceName and the corresponding Namespace, it can directly access the application service Pod through the internal domain name address service-name.namsapce.svc.cluster.local.

Service-name is the name of the service object definition

Namespace is the namespace name of the service and pod

Svc represents a service resource

Cluster.local is the default cluster internal access of the core DNS service of K8S

Ask the address domain name prefix

Epilogue

Recently, the author is sorting out the basic knowledge points of K8S introduction. This article is a sister article of "Understanding K8S from the perspective of application development", and has a new understanding of K8S from the perspective of resource management. Users of the K8S platform are generally divided into application development and cluster administrator. When mastering the K8S, the cluster administrator should more often understand the K8S from the cluster level, resource level and performance level. At this time, we need to master and understand the most basic knowledge points such as Pod scheduling and resource management, storage and network resources, and service and traffic management.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us