Interpretation of 13 typical anomalies of K8s Pod

In K8s, Pod, as the running carrier of workload, is the core resource object. Pod has a complex life cycle. At each stage of its life cycle, many different exceptions may occur. As a complex system, K8s often requires strong knowledge and experience reserves for anomaly diagnosis. Based on the actual combat experience and the summary of the real scenarios of EDAS users, we summarized 13 common abnormal scenarios of K8s Pod, gave the common error states of each scenario, and analyzed the causes and troubleshooting ideas.

This article is more than 7000 words long, and it takes about 20 minutes to read the full text. The content of the article is derived from the precipitation and analysis of a large number of real scenes. It is recommended to collect it for reference.

Pod Life Cycle

In the whole life cycle, Pod will have five phases.

• Pending: Pod is created by K8s and starts at the Pending stage. In the Pending phase, the Pod will be scheduled and assigned to the target node to start pulling images, loading dependencies, and creating containers.

• Running: When all containers of the Pod have been created and at least one container is running, the Pod will enter the Running phase.

• Succeeded: When all containers in the pod are terminated after execution and will not be restarted, the pod will enter the Succeeded stage.

• Failed: If all containers in the Pod have been terminated, and at least one container has been terminated due to failure, that is, the container exits abnormally in a non-zero state or is terminated by the system, the Pod will enter the Failed phase.

• Unknown: Pod status cannot be obtained for some reasons, in which case Pod will be set to Unknown status.

Generally speaking, for Job type loads, Pod will end up in the Succeeded state after successfully executing the task. For Deployment and other loads, it is generally expected that the Pod can continue to provide services until the Pod disappears due to deletion, or enters the Failed stage due to abnormal exit/system termination.

The five phases of a pod are a simple macro overview of the position of the pod in its life cycle, not a comprehensive summary of the container or pod state. Pod has some subdivisions (PodConditions), such as Ready/NotReady, Initialized, PodScheduled/Unscheduled, etc. These subdivision states describe the specific causes of the stage of Pod. For example, the current phase of Pod is Pending, and the corresponding subdivision status is Unscheduleable, which means that there is a problem with Pod scheduling.

Containers also have their lifecycle states: Waiting, Running, and Terminated. There are also corresponding status reasons (Reason), such as ContainerCreating, Error, OOMKilled, CrashLoopBackOff, Completed, etc. For containers that have been restarted or terminated, the Last State field contains not only the reason for the state, but also the Exit Code of the last exit. For example, the last exit status code of the container is 137. The reason for the status is OOMKilled, which means that the container was forcibly terminated by the system. In the process of abnormal diagnosis, the exit status of the container is crucial information.

In addition to necessary cluster and application monitoring, it is generally necessary to collect abnormal status information through the kubectl command.

Pod exception scenario

Different exceptions may occur at many points in the life cycle of a pod. According to whether the pod container is running as a flag point, we roughly divide the exception scenarios into two categories:

1. An exception occurs during Pod scheduling and container creation, and the Pod is stuck in the Pending stage.

2. An exception occurs in the operation of the Pod container. At this time, the Pod is in different stages according to specific scenarios.

The following will describe and analyze the 13 specific scenarios.

Dispatch failed

Common error status: Unschedule

After the Pod is created, it enters the scheduling phase. The K8s scheduler selects a suitable node for the Pod to run according to the resource requests and scheduling rules declared by the Pod. When none of the cluster nodes can meet the Pod scheduling requirements, the Pod will be in Pending status. Typical reasons for scheduling failure are as follows:

• Insufficient node resources

K8s quantifies node resources (CPU, memory, disk, etc.) and defines node resource capacity and node resource allocatable. The resource capacity refers to the current resource information of the computing node obtained by Kubelet, and the resource distributable quota is the available resource of the Pod. The Pod container has two resource quota concepts: the request value and the limit value. The container can obtain at least the size of the request value and at most the limit value. The resource request amount of a Pod is the sum of the resource requests of all containers in the Pod, and the resource limit amount of a Pod is the sum of the resource limits of all containers in the Pod. The K8s default scheduler uses the smaller request value as the scheduling basis to ensure that the resource allocatable amount of the schedulable node is not less than the Pod resource request value. When no node in the cluster meets the resource requests of the Pod, the Pod will be stuck in the Pending state.

Pod is suspended because it cannot meet the resource requirements. This may be due to insufficient cluster resources and the need for capacity expansion, or it may be caused by cluster fragments. Taking a typical scenario as an example, the user cluster has more than 10 4c8g nodes. The resource utilization rate of the entire cluster is about 60%. Each node has fragments, but the fragments are too small to expand a 2c4g pod. In general, small node clusters are more likely to generate resource fragments, which cannot be used for Pod scheduling. If you want to minimize resource waste, using larger nodes may lead to better results.

• Exceed the Namespace resource quota

K8s users can limit the resource usage of Namespace through resource quota, including two dimensions:

1. Limit the total number of objects that can be created for an object type (such as Pod).

2. Limit the total number of resources that can be consumed by an object type.

If the requested resource exceeds the resource quota when creating or updating a Pod, the Pod will fail to schedule. At this time, you need to check the status of the namespace resource quota and make appropriate adjustments.

• Does not satisfy the NodeSelector node selector

The Pod is scheduled to a node with a specific label through the NodeSelector node selector. If there are no available nodes that meet the NodeSelector, the Pod cannot be scheduled, and the NodeSelector or the node label needs to be adjusted reasonably.

• Incompatibility

Node affinity and anti affinity are used to constrain which nodes Pod is scheduled to, and affinity is subdivided into soft affinity (Preferred) and hard affinity (Required). For soft affinity rules, the K8s scheduler will try to find nodes that meet the corresponding rules. If no matching node is found, the scheduler will still schedule the pod. When the hard affinity rules are not satisfied, the Pod cannot be scheduled. You need to check the Pod scheduling rules and the target node status, and make reasonable adjustments to the scheduling rules or nodes.

• The node has stains

K8s provides Taints and Tolerances mechanisms to prevent Pods from being allocated to inappropriate nodes. If there is a stain on the node and the Pod does not set the corresponding tolerance, the Pod will not be dispatched to the node. At this time, you need to confirm whether it is necessary for nodes to carry stains. If not, you can remove stains; If the Pod can be assigned to a node with a stain, you can increase the stain tolerance for the Pod.

• No nodes available

The node may not be available (NotReady) due to insufficient resources, network failure, Kubelet not ready, etc. When there is no schedulable node in the cluster, the Pod will also be stuck in Pending status. At this time, you need to check the node status, troubleshoot and repair the problem of unavailable nodes, or expand the cluster.

Image pull failed

Common error status: ImagePullBackOff

The Pod is allocated to the target node after scheduling. The node needs to pull the image required by the Pod to prepare for the creation of the container. The pull image phase may fail for the following reasons:

• The image name is misspelled or incorrectly configured

After an image pull failure occurs, first confirm whether the image address is configured incorrectly.

• The privacy free configuration of the private warehouse is incorrect

The cluster needs to perform a password free configuration to pull a private image. When building a self built image warehouse, you need to create a secret free certificate in the cluster, specify ImagePullSecrets in the Pod, or embed the Secret in the ServicAccount, and let the Pod use the corresponding ServiceAccount. For the image service cloud products such as acr, they generally provide the password free plug-ins. You need to correctly install the password free plug-ins in the cluster to pull the images in the warehouse. The exceptions of the security free plug-in include: the cluster security free plug-in is not installed, the security free plug-in Pod is abnormal, and the security free plug-in is misconfigured. You need to view the relevant information for further troubleshooting.

• Network failure

There are three common scenarios of network failure:

1. The cluster accesses the image warehouse through the public network, but the image warehouse is not configured with the public network access policy. For the self built warehouse, the port may not be open, or the image service may not listen to the public IP; For image service cloud products such as acr, you need to confirm the access portal to the public network and configure access control policies such as white list.

2. The cluster is located in the VPC, and access control of the VPC needs to be configured for the image service to establish the connection between the cluster node and the image service.

3. To pull overseas images, such as gcr.io warehouse images, you need to configure image acceleration services.

• Image pull timeout

It is common for insufficient bandwidth or too large image volume to cause pull timeout. You can try to manually pull the image on the node to observe the transmission rate and time. If necessary, you can upgrade the cluster bandwidth, or adjust Kubelet's -- image full progress headline and -- runtime request timeout options appropriately.

• Pull multiple images at the same time to trigger parallelism control

It is common for users to elastically expand the capacity of a node, and a large number of pods to be scheduled are scheduled at the same time. As a result, a large number of pods are started at a node at the same time, and multiple images are pulled from the image warehouse at the same time. However, due to factors such as cluster bandwidth, stability of image warehouse service, and parallelism control of image pulling during container operation, image pulling does not support large amounts of parallelism. In this case, you can manually interrupt the pulling of some images and let the images be pulled in batches according to the priority.

Dependency Error

Common error status: Error

Before Pod starts, Kubelet will try to check all dependencies with other K8s elements. There are three main types of dependencies: PersistentVolume, ConfigMap, and Secret. When these dependencies do not exist or cannot be read, the Pod container cannot be created normally, and the Pod will be in the Pending state until the dependencies are satisfied. When these dependencies can be read correctly, but a configuration error occurs, the container cannot be created. For example, attaching a read-only persistent storage volume PersistentVolume to a container in a read-write format, or attaching a storage volume to an illegal path such as/proc will also cause container creation failure.

Container creation failed

Common error status: Error

An error occurred during the creation of the Pod container. Common causes include:

• Violate the security policy of the cluster, such as the PodSecurityPolicy.

• The container has no right to operate the resources in the cluster. For example, after enabling RBAC, you need to configure role binding for ServiceAccount.

• The startup command is missing, and no startup command is specified in the Pod description file and the image Dockerfile.

• Start command configuration error. The Pod configuration file can define the command line through the command field, and define parameters for the command line through the args field. The startup command configuration error is very common. Pay special attention to the format of commands and parameters. The correct filling method can refer to:

initialization failed

Common error status: CrashLoopBackOff

K8s provides the Init Container feature, which is used to start one or more initialization containers before starting the application container to complete the preset conditions required by the application. The Init container is essentially the same as the application container, but they are tasks that end after only one run, and the system can continue to execute the next container only after the execution is completed. If the Init Container of the Pod fails to execute, the start of the business container will be blocked. After locating the Init Container fault by viewing the Pod status and events, you need to view the Init Container log to further troubleshoot the fault point.

Callback failed

Common error status: FailedPostStartHook or FailedPreStopHook event

K8s provides two container lifecycle callbacks, PostStart and PreStop, which run before the process in the container starts or before the process in the container terminates. PostStart is executed immediately after the container is created. However, as it is executed asynchronously, it cannot be guaranteed to be associated with the execution order of the container start command. PreStop is called synchronously before the container terminates. It is often used to release resources gracefully before the container ends. If the PostStart or PreStop callback fails to execute, the container will be terminated, and whether to restart will be determined according to the restart strategy. When a callback fails, a FailedPostStartHook or FailedPreStopHook event will occur, which will be further combined with the logs printed by the container for troubleshooting.

Ready probe failed

Common error status: The containers have all been started, but the Pod is in the NotReady state, and the service traffic cannot reach the Pod from the Service

K8s uses Readiness Probe to determine whether the container is ready to accept traffic. Only when all the containers in the pod are in the ready state, K8S determines that the pod is in the ready state and forwards the service traffic to the container. Generally, the ready probe failure can be divided into several cases:

• Reasons for application in the container: The port or script corresponding to the rule configured for health check cannot be successfully detected, such as the application in the container is not started normally.

• Improper probe configuration: the detection fails due to the wrong writing of the check port; The detection interval and failure threshold are set unreasonably, for example, every inspection interval is 1s, and failure occurs if one inspection fails; The startup delay setting is too short. For example, it takes 15s for the normal startup of the application, while the probe is enabled 10s after the container startup.

• System layer problem: High node load causes container processes to hang.

• Insufficient CPU resources: The CPU resource limit is too low, resulting in slow response of the container process.

It should be noted that for microservice applications, the registration and discovery of services are managed by the registry, and the traffic does not go through the service, but directly flows from the upstream pod to the downstream pod. However, the registry does not have a check mechanism like the K8s ready probe. For Java applications that start slowly, the resources required after successful service registration may still be in initialization, resulting in loss of traffic after going online. For this kind of scenario, EDAS provides solutions such as delayed registration and service warm-up to solve the problem that K8s microservice applications are damaged when they go online.

Survival probe failed

Common error status: CrashLoopBackOff

K8s uses the Survival Probe to determine whether the container is running. If the alive state detection fails, the container will be killed, and whether to restart will be determined according to the restart strategy. The reason for the failure of the survival probe is similar to that of the ready probe. However, the container will be killed after the failure of the survival probe, so the troubleshooting process is much more difficult. A typical user scenario is that the user uses HPA to elastically expand multiple new pods during the pressure test. However, the new pods are blocked by large traffic as soon as they are started, unable to respond to the survival probe, and the pods are killed. After killing, it restarts, and after restarting, it hangs up. It always oscillates in the Running and CrashLoopBackOff states. In the microservice scenario, you can use methods such as delayed registration and service warm-up to avoid the container being hung by instantaneous traffic. If the running block is caused by the program itself, it is recommended to remove the Liveness probe first, and find out the root cause of the process blocking after the traffic influx through the monitoring and process stack information after the Pod is started.

Container Exit

Common error status: CrashLoopBackOff

Container exit is divided into two scenarios:

• Exit immediately after startup, possible reasons are:

1. The path to start the command is not included in the environment variable PATH.

2. The startup command refers to a non-existent file or directory.

3. The execution of the startup command fails, either because the running environment lacks dependencies or because the program itself.

4. The startup command has no permission to execute.

5. There is no foreground process in the container. The container should at least contain a foreground process of long running, which cannot run in the background. For example, you can start the process through nohup or use the startup.sh script of tomcat.

When a container exits immediately after startup, it is usually difficult to locate the problem directly on the spot because the container disappears directly and its output stream log cannot be obtained. A simple troubleshooting method is to lock the container by setting a special startup command (such as tail - f/dev/null), and then enter the container to manually execute the command to check the results and confirm the cause of the problem.

• Exit after running for a period of time. In this case, a process in the container Crash or is terminated by the system. At this point, first view the container exit status code, and then further view the context information to locate the error. When this happens, the container has been deleted and disappeared, and you cannot enter the container to view log, stack and other on-site information. Therefore, it is generally recommended that users configure persistent storage for log, error record and other files to retain more on-site information.

OOMKilled

Common error status: OOMKilled

There are two resource concepts in K8s: compressible resources (CPU) and incompressible resources (memory, disk). When the compressible resources such as CPU are insufficient, Pod will only "starve", but will not exit; When the uncompressed resources such as memory and disk IO are insufficient, Pod will be killed or expelled. The abnormal exit of Pod caused by insufficient/overrun memory resources is called Pod OOMKilled. There are two scenarios in K8s that lead to Pod OOMKilled:

• Container Limit Reached, the amount of container memory exceeds the limit

Each container in the pod can configure its memory resource quota. When the memory actually occupied by the container exceeds the quota, the container will be OOMKilled and exit with a status code of 137. OOMKilled usually occurs after the Pod has been running normally for a period of time, which may be due to the increase of traffic or the gradual increase of memory accumulated during long-term operation. In this case, you need to check the program log to understand why Pod uses more memory than expected and whether there is abnormal behavior. If it is found that OOM occurs only when the program runs as expected, you need to increase the Pod memory limit appropriately. A common error scenario is that the JAVA container has set a memory resource limit value of Limit. However, the JVM heap size limit value is larger than the memory limit value. As a result, the heap space becomes larger and larger during the process running. Finally, the OOM is terminated. For JAVA containers, it is generally recommended that the limit of container memory should be slightly larger than the maximum heap memory of the JVM.

• Limit Overcommit, node memory is exhausted

K8s has two resource quota concepts: request value and limit value. By default, the scheduler uses a smaller request value as the scheduling basis to ensure that the total request value of all pod resources of the node does not exceed the node capacity, while the total limit value is allowed to exceed the node capacity. This is the Overcommit phenomenon in K8s resource design. Oversold design can improve throughput and resource utilization to a certain extent, but node resources may be exhausted. When the total memory actually used by the pods on the node exceeds a certain threshold, K8s will terminate one or more of them. To avoid this situation as far as possible, it is recommended to select memory request values and limit values that are equal or similar in size when creating pods. You can also use scheduling rules to disperse memory sensitive pods to different nodes.

Pod expulsion

Common error status: Pod Evicted

When the node memory, disk and other incompressible resources are insufficient, K8S will evict some pods on the node according to the QoS level to release resources to ensure node availability. When the Pod is evicted, the upper controller, such as Deployment, will create a new Pod to maintain the number of replicas, and the new Pod will be dispatched to other nodes for creation and operation. As for memory resources, it has been analyzed previously that reasonable request values and limit values can be set to avoid node memory exhaustion. For disk resources, the Pod will generate temporary files and logs during operation, so the capacity of the Pod disk must be limited, otherwise some Pods may fill the disk quickly. Similar to the way of limiting memory and CPU usage, you can limit the amount of local temporary storage when creating a Pod. At the same time, the Kubelet eviction condition defaults to less than 10% of the available disk space. You can adjust the cloud monitoring disk alarm threshold to alert in advance.

Pod loss

Common error status: Unknown

The Pod is in the Unknown state, and its details cannot be obtained. Generally, because the Kubelet node is abnormal, the Pod information cannot be reported to APIServer. First, check the node status, locate the error through the Kubelet and container runtime log information, and repair it. If the node cannot be repaired in time, you can delete the node from the cluster first.

Cannot be deleted

Common error status: stuck in Terminating

When a Pod is deleted, it is in Terminating status for a long time. There are several reasons for this:

• Finalizer of Pod association is not completed. First, check whether the metadata field of the Pod contains a finalizer. Confirm what the finalizer task is through some specific context information. Usually, the unfinished finalizer task may be related to the Volume. If the finalizer cannot be completed, you can remove the finalizer from the corresponding pod through the patch operation to complete the deletion.

• Pod did not respond to the interrupt signal. If the Pod is not terminated, the process may not respond to the signal. You can try to forcibly delete the Pod.

• Node failure. Check the status of other pods on the same node to confirm whether the node is faulty, and try to restart Kubelet and the container runtime. If it cannot be repaired, first delete the node from the cluster.

EDAS troubleshooting tool chain

EDAS precipitates and analyzes most of the anomalies in the whole life cycle of the application, reducing user learning costs and reducing troubleshooting time. EDAS provides a series of solutions and tools to help users solve abnormal problems in the application life cycle, including pre check of changes before application changes, observable tracking of application changes and running events, and diagnostic tools when application exceptions occur.

Application change pre check

EDAS will go through the pre check phase before the application change task is issued. The application change pre check can check whether the cluster status and change parameters are valid before the application deployment, which can effectively avoid errors in the application change process and reduce the change risk. The current application change pre check provides cluster available resources check, cluster health check, various dependency configuration check and other items, and provides analysis and disposal suggestions for unexpected pre check results. For example, for the exception scenario where the cluster resource margin does not meet the Pod scheduling requirements, the change pre check result will show that the resource check fails, and the user can make targeted adjustments at the first time.

Application event observation

EDAS provides observable ability to track events in the application life cycle. It provides a complete display of the application change process, so that users can observe each step of the change and relevant context information in a blank screen. In case of abnormal changes, specific events and related resource information will be revealed on the white screen, and the abnormal events will be analyzed and interpreted, and operation suggestions will be given. For example, the container service warehouse image is configured for the Pod, but the cluster password free plug-in is not configured correctly. EDAS throws the image pull failure event and guides the user to check the image pull permission.

Diagnostic toolbox

For exception Pod, you usually need to connect to the Pod container to diagnose the business process and, if necessary, repeat the exception. EDAS provides a cloud native toolkit, allowing users to connect to the pod container shell on the web page, and provides tools such as Arthas and Tcpdump to make up for the lack of the image software toolkit. For scenarios where the Pod has disappeared and is not suitable for diagnosis in the business pod, the cloud native toolkit provides the Pod replication capability. According to different diagnosis scenarios, users can choose to enable the diagnosis pod as needed.

For the above scenario where the container process is blocked by large traffic, causing the Pod to be hung up by Liveness, users can open a diagnostic Pod that removes Liveness, set the full link traffic control rules, enter some test traffic, and use the trace, stack, watch and other tools provided by Arthas to accurately locate the problem.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us