All Products
Search
Document Center

Container Service for Kubernetes:Pod troubleshooting

Last Updated:Mar 26, 2026

This topic describes how to troubleshoot pod issues, including diagnostic procedures and solutions to common issues.

Note

To perform common pod troubleshooting tasks on the console, such as viewing a pod's status, basic information, configuration, events, and logs, accessing a container using the terminal, and enabling pod diagnostics, see Common Troubleshooting Procedures.

Quick diagnostic procedure

To diagnose an abnormal workload pod, go to the details page of the target Pods. Click the Events tab to review the descriptions of abnormal events. Then, click the Logs tab to check for recent abnormal logs.

Pod in Pending state

If a Pod has an Unschedulable status in its Status Details or a FailedScheduling event appears in Events, go to Nodes > Nodes to check the health status and resource levels (CPU and memory) of the target node. In addition, check whether the Pod's affinity policy is too strict, including its nodeSelector, nodeAffinity, and Taint and Toleration configurations. To further troubleshoot the issue, see Scheduling issues.

Image pull fails (ImagePullBackOff/ErrImagePull)

On the Pods details page, go to the Container tab and check the Image address. Log on to the pod's node and run crictl pull <image-address> or curl -v https://<image-address> to verify network connectivity to the image repository. In the upper-right corner, click Edit YAML and check that the Secret specified in the workload's spec.imagePullSecrets field exists and is valid. For further troubleshooting, see image pulling issues.

Pod fails to start (CrashLoopBackOff)

This error occurs when an application repeatedly crashes and restarts. On the Pods details page, click the Logs tab and select Show the log of the last container exit to view the cause of the failure. For further troubleshooting, see Troubleshoot pod startup failures.

Pod Running but not ready

This status occurs when the pod's readiness probe fails. On the Edit page of the target Workloads, verify that the health check request path (for example, /healthz) and port match those provided by the application. For further troubleshooting, see The pod is Running but not ready (Ready: False).

You can temporarily disable the health check. Then, access the pod terminal or its host node and use a command, such as curl, to verify that the health check passes.

Pod is OOMKilled

On the Pods details page, click the Logs tab and select Show the log of the last container exit to view OOM logs. Check if the application has a memory leak or an out-of-memory (OOM) error. For Java applications, you can optimize the -Xmx parameter. Adjust the application's memory resource limit (resources.limits.memory) as needed. For further troubleshooting, see OOMKilled.

If a liveness probe is configured, the pod remains in the OOMKilled state only briefly before it automatically restarts.

Diagnostic workflow

To diagnose an abnormal pod, inspect its events, logs, and configuration.

Troubleshooting workflow

image

Phase 1: Scheduling issues

Pod not scheduled to a node

If a pod remains in the Pending state for an extended period, it has not been scheduled to a node. This section describes common causes and solutions.

Error message

Description

Solution

no nodes available to schedule pods.

The cluster has no available nodes for pod scheduling.

  1. Check if any nodes in the cluster are in the NotReady state. If a node is NotReady, inspect and repair it.

  2. Check if the pod defines a nodeSelector, nodeAffinity, or taint tolerations. If no such scheduling constraints are defined, consider adding more nodes to the node pool.

  • 0/x nodes are available: x Insufficient cpu.

  • 0/x nodes are available: x Insufficient memory.

No available nodes in the cluster can meet the pod's CPU or memory resource requests.

A node is considered unschedulable if the sum of its allocated resource requests has reached its capacity, even if the actual CPU or memory utilization is low.

On the target cluster's details page, go to Nodes > Nodes and check the CPU or memory requests allocation rate for the target node. You can hover over the allocation rate to view the specific resource allocation values.

request中

To view detailed node resource usage, see Use kubectl to view node resource usage.

  • Optimize resource configuration:

    • If a node's resource usage is consistently lower than its requests, it indicates that resources are being wasted. You can lower the requests configuration for the workload. For more information, see Set CPU and memory resource limits for a container.

      You can enable resource profiling to obtain the recommended requests configuration.
    • Enable the Horizontal Pod Autoscaler (HPA) for your business containers to reduce the number of replicas during off-peak hours, thereby lowering overall resource consumption.

  • Clean up unnecessary workloads: Decommission or scale down non-essential pods.

  • Scale out the node pool: If the resource usage on the target nodes is consistently high, the nodes are saturated. You can scale out the node pool.

x node(s) didn't match pod's node affinity/selector.

The existing nodes in the cluster do not match the node affinity policy (nodeAffinity/nodeSelector) declared for the Pod. For more information, see Assigning Pods to Nodes.

  1. View all labels on a node.

    Console

    1. On the target cluster's details page, go to Nodes > Nodes.

    2. On the Nodes page, find the target node, and in the Actions column, click More > Manage Labels and Taints to view its labels.

    Kubectl

    Replace <YOUR_NODE_NAME> with your actual node name.

    kubectl get node <YOUR_NODE_NAME> --show-labels
  2. Check and adjust the node affinity rule for the workload (deployment).

    Console

    When creating a new workload:

    1. On the Advanced page for creating a Create Deployment, find Node Affinity in the Scheduling section, and click Add.

    2. Configure either Required (hard affinity) or Optional (soft affinity) based on your business needs. Multiple Selector have a logical AND relationship, while multiple Rule have a logical OR relationship.

    For existing workloads:

    1. On the Nodes > Nodes page, click image > Node Affinity in the Actions column of the target Deployment.

    2. The configuration method is the same as described above.

    YAML example

    NodeAffinity

    Affinity policies are divided into hard affinity (requiredDuringSchedulingIgnoredDuringExecution) and soft affinity (preferredDuringSchedulingIgnoredDuringExecution). Hard affinity specifies a rule that must be met, while soft affinity specifies a preference. The following example uses hard affinity.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: app-demo-node-affinity-deploy
      labels:
        app: demo-node-affinity
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: demo-node-affinity
      template:
        metadata:
          labels:
            app: demo-node-affinity
        spec:
          containers:
          - name: nginx
            image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          affinity:
            nodeAffinity:
              # Hard affinity: The rule must be met.
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: disktype
                    operator: In
                    values:
                    - ssd
                    - nvme  # Logic: The node's 'disktype' label must be either 'ssd' or 'nvme'.

    NodeSelector

    This provides a simple exact match. The pod is scheduled only if the node's labels meet the conditions.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: app-demo-node-selector-deploy
      labels:
        app: demo-node-selector
    spec:
      replicas: 2  
      selector:
        matchLabels:
          app: demo-node-selector  
      template:
        metadata:
          labels:
            app: demo-node-selector
        spec:
          containers:
          - name: nginx
            image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          # The pod is scheduled only if the node has the label disktype=ssd.
          nodeSelector:
            disktype: ssd
  • x node(s) didn't match pod affinity rules.

  • x node(s) didn't match pod anti-affinity rules.

  • Affinity rule mismatch. The pod has a pod affinity rule (for example, requiring a specific label), but no nodes host a pod with a matching label, preventing scheduling.

  • Anti-affinity conflict. The pod has a pod anti-affinity rule (for example, it cannot coexist with another application), but all available nodes already host a conflicting pod, preventing scheduling.

  1. View the pod labels on a node.

    Console

    1. On the target cluster's details page, go to Nodes > Nodes.

    2. On the Nodes page, click the name of the target node to view its details page. Scroll down to the Pods section to view the label values for different pods in the Label column.

    Kubectl

    • View pods and their labels on a specific node: Replace <YOUR_NAMESPACE> with your namespace name and <YOUR_NODE_NAME> with your actual node name.

      kubectl get pods -n <YOUR_NAMESPACE> --field-selector spec.nodeName=<YOUR_NODE_NAME> -o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels
    • Query pods by label: Replace <LABEL> with the actual label key-value pair, such as app=nginx.

      kubectl get pods -A -l <LABEL> -o wide
  2. Check and adjust the pod affinity rule for the workload (deployment).

    Console

    1. When you create a new workload, on the Create Deployment's Advanced page, find Pod Affinity/Pod Anti-affinity in the Scheduling section, and click Add.

    2. Configure either Required (hard affinity) or Optional (soft affinity) based on your business needs. Multiple Selector have a logical AND relationship, while multiple Add Rule have a logical OR relationship.

    YAML example

    Affinity policies are classified into hard affinity (requiredDuringSchedulingIgnoredDuringExecution) and soft affinity (preferredDuringSchedulingIgnoredDuringExecution). Hard affinity rules must be met, while soft affinity rules are preferred. The following example shows a configuration for a required pod affinity.

    To configure pod anti-affinity, simply replace podAffinity with podAntiAffinity.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: app-demo-podaffinity-deploy
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: demo-podaffinity
      template:
        metadata:
          labels:
            app: demo-podaffinity
        spec:
          containers:
          - name: nginx
            image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
          affinity:
            podAffinity:
              # Hard affinity: Pod must be co-located with a pod that has the 'app: nginx' label.
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - nginx
                # Topology domain scope: host-level isolation.
                topologyKey: kubernetes.io/hostname

0/x nodes are available: x node(s) had volume node affinity conflict.

Scheduling fails due to a volume node affinity conflict. This typically occurs because a cloud disk cannot be mounted across different zones.

  • For a statically provisioned PV, configure the pod's node affinity to ensure it is scheduled to a node in the same zone as the PV.

  • For a dynamically provisioned PV, set the volumeBindingMode of the StorageClass to WaitForFirstConsumer. This ensures that the PV is created only after the pod has been scheduled to a node, ensuring the cloud disk is created in the same zone as the pod's node.

InvalidInstanceType.NotSupportDiskCategory

The ECS instance does not support the specified cloud disk type.

See Instance families to confirm the cloud disk types supported by your ECS instance. When mounting, update the cloud disk type to one that is supported by the ECS instance.

0/x nodes are available: x node(s) had taints that the pod didn't tolerate.

The pod cannot be scheduled to a node because it lacks a toleration for one of the node's taints.

  • If the taint was added manually, you can remove the unintended taint. If the taint cannot be removed, you can configure a corresponding toleration for the pod. For more information, see Taints and Tolerations and Manage node labels and taints.

  • If the taint was added automatically by the system, resolve the underlying issue as described below and wait for the pod to be rescheduled.

    View taints added by the system

    • node.kubernetes.io/not-ready: The node is in the NotReady state.

    • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This is equivalent to the node's Ready status being Unknown.

    • node.kubernetes.io/memory-pressure: The node is under memory pressure.

    • node.kubernetes.io/disk-pressure: The node is under disk pressure.

    • node.kubernetes.io/pid-pressure: The node is under PID pressure.

    • node.kubernetes.io/network-unavailable: The node's network is unavailable.

    • node.kubernetes.io/unschedulable: The node is marked as unschedulable.

0/x nodes are available: x Insufficient ephemeral-storage.

The node has insufficient ephemeral storage.

  1. Check the Pod's ephemeral storage request, which is the value of spec.containers.resources.requests.ephemeral-storage in the Pod YAML. If the value is too high and exceeds the actual available capacity of the node, the Pod will fail to be scheduled.

  2. Run the kubectl describe node | grep -A10 Capacity command to view the total ephemeral storage capacity on each node. If the capacity is insufficient, expand the node's disk or add more nodes.

0/x nodes are available: pod has unbound immediate persistent volume claims.

The pod failed to bind to a persistent volume claim (PVC).

Check if the PVC or PV specified by the pod has been created. Run kubectl describe pvc <pvc-name> or kubectl describe pv <pv-name> to view the events of the PVC and PV for further diagnosis. For more information, see Storage FAQ - CSI.

Pod is scheduled but remains Pending

If a pod has been scheduled to a node but remains in the Pending state, follow these steps to resolve the issue.

  1. Determine if a Pod is configured with hostPort: If a Pod is configured with hostPort, only one Pod instance that uses that hostPort can run on each node. Therefore, the Replicas value in a Deployment or ReplicationController cannot exceed the number of nodes in the cluster. If this port is in use by another application, Pod scheduling fails.

    hostPort introduces some management and scheduling complexities. We recommend that you use a Service to access Pods. For more information, see Service.

  2. If the Pod is not configured with hostPort, follow the steps below to troubleshoot.

    1. Run kubectl describe pod <pod-name> to view the pod's events and resolve any issues found. The events can explain why the pod failed to start, with common reasons including image pull failures, insufficient resources, security policy restrictions, or configuration errors.

    2. If the Event object contains no useful information, check the kubelet logs on the node to troubleshoot issues during the Pod startup process. You can use the grep -i <pod name> /var/log/messages* | less command to search the system log file (/var/log/messages*) for log entries that contain the specified Pod name.

Phase 2: Image pull issues

ImagePullBackOff or ErrImagePull

A pod status of ImagePullBackOff or ErrImagePull indicates that the image pull has failed. In this case, examine the pod events and use the information below to troubleshoot the issue.

Error message

Description

Suggested solution

Failed to pull image "xxx": rpc error: code = Unknown desc = Error response from daemon: Get xxx: denied:

Access to the image repository is denied because an imagePullSecret was not specified when the pod was created.

Verify that the Secret specified in the spec.imagePullSecrets field of the workload's YAML file exists.

When using Container Registry (ACR), you can use a credential helper to pull images without a password. For more information, see Pull images from the same account.

Failed to pull image "xxxx:xxx": rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxx/xxxxx/: dial tcp: lookup xxxxxxx.xxxxx: no such host

The image repository address could not be resolved when pulling an image over HTTPS.

  1. Verify that the image repository address in spec.containers.image of the pod's YAML file is correct. If it is incorrect, update it.

  2. If the address is correct, verify the network connectivity from the node where the pod is running to the image repository. Log on to the node (for more information, see Choose an ECS remote connection method) and run the curl -kv https://xxxxxx/xxxxx/ command to check if the address is accessible. If an error occurs, investigate for potential network issues, such as incorrect network configuration, firewall rules, or DNS resolution problems.

Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "xxxxxxxxx": Error response from daemon: mkdir xxxxx: no space left on device

The node has insufficient disk space.

Log on to the node where the pod is running (for more information, see Choose an ECS remote connection method) and run df -h to check the disk space. If the disk is full, resize the cloud disk. For more information, see Step 1: Resize a cloud disk.

Failed to pull image "xxx": rpc error: code = Unknown desc = error pulling image configuration: xxx x509: certificate signed by unknown authority

The third-party image repository uses a certificate signed by an unknown or insecure Certificate Authority (CA).

  1. The third-party repository should use a certificate issued by a trusted CA.

  2. If you are using a private image repository, see Create an application from a private image repository.

  3. If you cannot change the certificate, you can configure the node to allow pulling and pushing images from a repository that uses an insecure certificate. We recommend using this method only in test environments, as it may affect other pods on the node.

View detailed steps

Console

Configure containerd parameters using the console

Important

This change does not affect existing containers. To keep your cluster stable, perform this operation during off-peak hours.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Nodes > Node Pools.

  3. On the node pool list page, click image > Containerd Configuration in the Actions column of the target node pool.

  4. Read the important notes on the current page. Add the parameters you need, select the target nodes, and set the batch configuration policy. Then click Submit. > See the configuration examples below.

    • Removing a container runtime configuration parameter reverts it to its default value automatically.

    • After you click Submit, the configuration applies to nodes in batches. You can track progress and control execution in the Events section — pause, resume, or cancel as needed. If a node task fails, troubleshoot the node and click Continue to retry. When you pause, nodes currently being configured finish applying the changes before pausing. Nodes not yet started wait until you resume. Complete the task as soon as possible — tasks paused for more than 7 days are canceled automatically, and the related events and logs are cleaned up.

Configuration examples

Configure a replacement image repository for docker.io

Skip certificate verification for a private repository

Configure an HTTP private image repository

image

image

image

CLI

  1. Create a certificate directory for containerd to store certificate configuration files for specific image repositories.

    mkdir -p /etc/containerd/cert.d/xxxxx
  2. Configure containerd to trust a specific insecure image repository.

    cat << EOF > /etc/containerd/cert.d/xxxxx/hosts.toml
       server = "https://harbor.test-cri.com"
       [host."https://harbor.test-cri.com"]
         capabilities = ["pull", "resolve", "push"]
         skip_verify = true
         # ca = "/opt/ssl/ca.crt"  # Or upload a CA certificate
       EOF
  3. Modify the Docker daemon configuration to add the insecure repository.

    vi /etc/docker/daemon.json

    Add the following content. Replace your-insecure-registry with your private repository's address.

       {
         "insecure-registries": ["your-insecure-registry"]
       }
  4. Restart the containerd service for the changes to take effect.

    systemctl restart containerd

Failed to pull image "XXX": rpc error: code = Unknown desc = context canceled

The operation was canceled, possibly because the image file is too large. Kubernetes has a default timeout for pulling images. If the pull makes no progress for a specific period, Kubernetes assumes the operation has failed or is unresponsive and cancels the task.

  1. Verify that imagePullPolicy is set to IfNotPresent in the pod's YAML file.

  2. Log on to the node where the pod is running (for more information, see Choose an ECS remote connection method) and run docker pull or crictl pull to check if the image can be pulled.

Failed to pull image "xxxxx": rpc error: code = Unknown desc = Error response from daemon: Get https://xxxxxxx: xxxxx/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Cannot connect to the image repository due to network issues.

  1. Log on to the node where the pod is running (for more information, see Choose an ECS remote connection method) and run the curl https://xxxxxx/xxxxx/ command to check if the address is accessible. If an error occurs, investigate for potential network issues, such as incorrect network configuration, firewall rules, or DNS resolution problems.

  2. Verify the node's public network policy, including configurations for SNAT entries and bound Elastic IP Addresses (EIPs).

Failed to pull image "xxxx:xxx": failed to pull and unpack image "xxxx:xxx": failed to resolve reference "xxxx:xxx": failed to do request: Head "xxxx:xxx": dial tcp xxx.xxx.xx.x:xxx: i/o timeout

Connection timed out due to network issues when pulling an image from an overseas repository.

Pulling images from overseas repositories, such as Docker Hub, may fail in ACK clusters due to unstable carrier networks. To resolve this, consider the following solutions:

Too Many Requests.

Docker Hub imposes rate limits on image pull requests.

Upload the image to Container Registry (ACR) and pull it from an ACR image repository.

The status Pulling image is consistently displayed

The kubelet's image pull rate limiting mechanism may have been triggered.

Adjust the registryPullQPS (maximum QPS for the image repository) and registryBurst (maximum number of burst image pulls) using the Customize kubelet configurations for a node pool feature.

Phase 3: Startup issues

Pod is in the Init state

Error message

Description

Solution

Stuck in the Init:N/M state

The pod contains M init containers. N of them have completed, but the remaining M-N init containers have failed to start.

  1. Run the kubectl describe pod -n <ns> <pod name> command to view the pod's events and check for issues with the unstarted init containers.

  2. Run the kubectl logs -n <ns> <pod name> -c <container name> command to view the logs of the unstarted init containers and use them to troubleshoot the issue.

  3. Review the pod's configuration, such as the health check settings, to ensure the init containers are configured correctly.

For more information about init containers, see Debug init containers.

Stuck in the Init:Error state

An init container in the pod failed to start.

Stuck in the Init:CrashLoopBackOff state

An init container in the pod failed to start and is in a restart loop.

Pod is in the Creating state

Error message

Description

Solution

failed to allocate for range 0: no IP addresses available in range set: xx.xxx.xx.xx-xx.xx.xx.xx

This is expected behavior due to the design of the Flannel network plugin.

Upgrade the Flannel component to v0.15.1.11-7e95fe23-aliyun or later. For more information, see Flannel.

In clusters that run a Kubernetes version earlier than 1.20, an IP address leak can occur if a pod restarts repeatedly or if pods from a CronJob complete their tasks and exit quickly.

Upgrade the cluster to Kubernetes 1.20 or later. We recommend using the latest cluster version. For more information, see Manually upgrade a cluster.

Defects in containerd and runC cause this issue.

For an emergency fix, see Why does my pod fail to start with the error "no IP addresses available in range"?

error parse config, can't found dev by mac 00:16:3e:01:c2:e8: not found

The Terway network plugin maintains an internal database on the node to track and manage elastic network interfaces (ENIs). This error occurs when the database state is inconsistent with the actual network device configuration, causing ENI allocation to fail.

  1. Network interfaces load asynchronously. The interface might still be loading during CNI configuration, which triggers an automatic CNI retry. This process does not affect the final ENI allocation. Check the pod's final status to confirm success.

  2. If pod creation still fails after a long time and this error persists, the driver likely failed to load the ENI due to insufficient high-order memory. Restart the ECS instance to resolve this issue. For more information, see Restart an instance.

  • cmdAdd: error alloc ip rpc error: code = DeadlineExceeded desc = context deadline exceeded

  • cmdAdd: error alloc ip rpc error: code = Unknown desc = error wait pod eni info, timed out waiting for the condition

The Terway network plugin may have failed to request an IP address from the vSwitch.

  1. View the logs of the Terway container within the Terway component pod on the node to check the ENI allocation process.

  2. Run the kubectl logs -n kube-system  <terwayPodName > -c terway | grep <podName> command to view ENI information for the Terway pod. Obtain the Request ID for the IP address request and the OpenAPI error message.

  3. Use the Request ID and error message to investigate the failure.

Pod fails to start (CrashLoopBackOff)

Error message

Description

Solution

The log contains exit(0).

  1. Log on to the node where the abnormal workload is deployed.

  2. Run the docker ps -a | grep $podName command. If the container has no persistent process, it exits with the status code 0.

The pod's events show Liveness probe failed: ....

The liveness probe failed, causing the application to restart.

  • Liveness probe configuration: On the Edit page of the target Workloads, verify that the health check request path (for example, /healthz) and port match those that the application provides. Increase the Initial Delay (s) to ensure the liveness probe starts only after the application has fully launched.

    You can temporarily disable the Liveness. Then, access the pod terminal or its host node and use a command, such as curl, to verify that the health check method works correctly.
  • Troubleshoot application issues: Investigate the issue by checking the pod's Events and Log. Select Show the log of the last container exit.

The pod's events show Startup probe failed: ....

The startup probe failed, causing the application to restart.

  • Startup probe configuration: On the Edit page of the target Workloads, verify that the health check request path (for example, /healthz) and port match those that the application provides. If the application takes a long time to start, increase the Unhealthy Threshold to prevent premature restarts.

    You can temporarily disable the Startup. Then, access the pod terminal or its host node and use a command, such as curl, to verify that the health check method works correctly.
  • Troubleshoot application issues: Investigate the issue by checking the pod's Events and Logs. Select Show the log of the last container exit.

The pod log contains no space left on device.

Insufficient cloud disk space.

  • Resize the cloud disk. For more information, see Step 1: Resize a cloud disk.

  • Clean up unnecessary images to free up disk space, and configure imageGCHighThresholdPercent to set the threshold for image garbage collection on the node.

Startup fails without event information.

This issue occurs when a container requires more resources than its declared limits, causing it to fail.

Check whether the pod's resource configuration is correct. You can enable resource profiling to get recommended Request and Limit configurations for the container.

The pod log shows Address already in use.

A port conflict exists between containers in the same pod.

  1. Check whether the pod is configured with hostNetwork: true. This setting causes containers in the pod to share the host's network namespace and port space. If this is not required, change it to hostNetwork: false.

  2. If the pod requires hostNetwork: true, configure pod anti-affinity to ensure that pods from the same replica set are scheduled to different nodes.

  3. Verify that no other pod on the same node is using the port.

The pod log shows container init caused "setenv: invalid argument": unknown.

The workload mounts a Secret, but the value in the Secret is not Base64-encoded.

  • Create the Secret in the console, where values are automatically Base64-encoded. For more information, see Manage Secrets.

  • Create the Secret from a YAML file and manually Base64-encode the value by running the echo -n "xxxxx" | base64 command.

Application-specific issue.

Examine the pod logs to troubleshoot the issue.

Pod is Running but not ready (Ready: False)

Error message

Description

Solution

image The pod's events show Readiness probe failed: ....

The readiness probe failed, preventing the target pod from receiving traffic.

  • Readiness probe configuration: On the Edit page of the target Workloads, verify that the health check request path (for example, /healthz) and port match those that the application provides. If the application has a long startup time, increase the Unhealthy Threshold to avoid premature failures.

    You can temporarily disable the Readiness, log on to the Pod's terminal or its host, and use a command, such as curl, to verify that the health check method responds correctly.
  • Troubleshoot application issues: Investigate the issue by checking the pod's Events and Logs. Select Show the log of the last container exit.

The pod status is the same as above. The pod's events show Startup probe failed: ....

A failed startup probe causes the container to restart. This error should not result in a persistent Running/NotReady state but rather a 'CrashLoopBackOff' state.

Troubleshoot this issue as described in the "Pod fails to start (CrashLoopBackOff)" section for Startups.

Phase 4: Pod runtime issues

OOMKilled

When a container in your cluster uses more memory than its specified limit, it may be terminated due to an out-of-memory (OOM) event, causing the container to exit unexpectedly. For more information about OOM events, see Assign Memory Resources to Containers and Pods.

  • If the terminated process is the container's main process, the container might restart unexpectedly.

  • When an OOM event occurs, it appears on the Events tab of the pod details page in the console, such as pod was OOM killed. node:XXX pod:XXX namespace:XXX.

  • If you configure an alert for container replica exceptions for the cluster, you will receive a notification when an OOM event occurs. For more information, see Container replica exception alert rule set.

OOM level

Description

Recommended solution

OS level

Check the kernel log at /var/log/messages on the pod's node. If the log shows a killed process but contains no cgroup logs, the OOM event occurred at the OS level.

cgroup level

Check the kernel log at /var/log/messages on the pod's node. If the log contains an error message similar to Task in /kubepods.slice/xxxxx killed as a result of limit of /kubepods.slice/xxxx, the OOM event occurred at the cgroup level.

  • Increase the pod's memory limits based on your business requirements. Actual memory usage should not exceed 80% of the limit. For more information, see Manage pods and Scale node resources.

  • Enable resource profiling to get recommended request and limit configurations for your containers.

For more information about OOM event causes and solutions, see Causes and solutions for OOM Killer.

Terminating

Possible cause

Description

Recommended solution

The node is in the NotReady state.

The pod is automatically deleted after the node recovers from the NotReady state.

The pod is configured with finalizers.

If a pod is configured with finalizers, Kubernetes performs the cleanup operations specified by the finalizers before deleting the pod. If a cleanup operation fails to respond, the pod remains in the Terminating state.

Run the kubectl get pod -n <ns> <pod name> -o yaml command to view the pod's finalizer configuration and investigate the cause.

The pod's preStop hook is invalid or stuck.

If a preStop hook is configured for the pod, Kubernetes executes the hook before terminating the container. The pod remains in the Terminating state while the hook is running.

Run the kubectl get pod -n <ns> <pod name> -o yaml command to view the pod's preStop hook configuration and investigate the cause.

A graceful shutdown period is configured for the pod.

If a Pod is configured with a graceful shutdown period (terminationGracePeriodSeconds), the Pod enters the Terminating state after it receives a termination command, such as kubectl delete pod <pod_name>. Kubernetes considers the Pod to be successfully shut down only after the time specified in terminationGracePeriodSeconds elapses or the container exits.

Kubernetes automatically deletes the pod after the container completes a graceful shutdown.

The container is unresponsive.

When you request to stop or delete a pod, Kubernetes sends a SIGTERM signal to the containers in the pod. If a container does not correctly handle the SIGTERM signal during termination, the pod may remain in the Terminating state.

  1. Run the kubectl delete pod <pod-name> --grace-period=0 --force command to forcefully delete the pod and release its resources.

  2. Check the containerd or Docker logs on the pod's node to investigate further.

Evicted

Possible cause

Description

Recommended solution

The node is under resource pressure from factors like memory or disk usage.

The node may be experiencing memory pressure, disk pressure, or PID pressure.

  • Run the kubectl describe node <node name> | grep Taints command. The output may include the following taints:

    • Memory pressure: The node has the node.kubernetes.io/memory-pressure taint.

    • Disk pressure: The node has the node.kubernetes.io/disk-pressure taint.

    • PID pressure: The node has the node.kubernetes.io/pid-pressure taint.

  • The pod status is one of the following:

    • Evicted

    • ContainerStatusUnknown, and the reason field in the pod's YAML file shows Evicted.

  • Memory pressure:

    • Adjust the pod's resource configuration based on your business requirements. For more information, see Manage pods.

    • Upgrade the node. For more information, see Scale node resources.

  • Disk pressure:

    • Periodically clear application logs from pods on the node to free up disk space.

    • Expand the node's disk. For more information, see Step 1: Resize a cloud disk.

  • PID pressure: Adjust the pod's resource configuration based on your business requirements. For more information, see Process ID Limits and Reservations.

An unexpected eviction occurs.

A manually added NoExecute taint on the pod's node caused an unexpected eviction.

Run the kubectl describe node <node name> | grep Taints command to check if the node has a NoExecute taint. If it does, remove it.

Eviction does not proceed as expected.

  • --pod-eviction-timeout: Pods on a failed node are evicted after this timeout period. The default is 5 minutes.

  • --node-eviction-rate: The number of pods evicted from a node per second. The default is 0.1, meaning at most one pod is evicted from a node every 10 seconds.

  • --secondary-node-eviction-rate: The secondary node eviction rate. If too many nodes in a cluster fail, the eviction rate is reduced to this value. The default is 0.01.

  • --unhealthy-zone-threshold: The unhealthy availability zone threshold. The default is 0.55. When the fraction of failed nodes in an availability zone exceeds this threshold, the zone is considered unhealthy.

  • --large-cluster-size-threshold: The large cluster size threshold. The default is 50. A cluster is considered large when it has more than 50 nodes.

In a small cluster (50 nodes or fewer), if more than 55% of the nodes fail, pod eviction stops. For more information, see Rate limits on eviction.

In a large cluster (more than 50 nodes), if the fraction of unhealthy nodes exceeds the --unhealthy-zone-threshold (default 0.55), the eviction rate reduces to the value of --secondary-node-eviction-rate (default 0.01 pods per second). For more information, see Rate limits on eviction.

A pod is frequently rescheduled to its original node after being evicted.

The kubelet evicts pods based on actual resource usage, whereas the scheduler places pods based on resource requests. Because an eviction frees up resources, the scheduler might reschedule a pod to the same node if its requests still fit.

Ensure the pod's resource requests are appropriate for the node's allocatable resources, and adjust them if necessary. For more information, see Set CPU and memory resources for a container. You can also enable resource profiling to get recommended request and limit configurations for your containers.

Completed

When a pod is in the Completed state, all its containers have finished their commands and exited successfully. This state is common for workloads such as jobs and init containers.

FAQ

Pod is running but not working

Errors in your application's YAML file can cause a pod to enter the Running state but fail to function correctly.

  1. Verify the container settings in the pod's configuration.

  2. Use the following methods to check your YAML configuration for spelling errors.

    When you create a Pod, if a key in the YAML file is misspelled (for example, spelling command as commnd), the cluster ignores the error and successfully creates the resource. However, the system cannot execute the command specified in the YAML file while the container is running.

    The following example, in which command is misspelled as commnd, describes how to troubleshoot spelling issues.

    1. Add the --validate flag to the kubectl apply -f command, and then run the kubectl apply --validate -f XXX.yaml command.

      If you misspell a word, an error is reported: XXX] unknown field: commnd XXX] this may be a false alarm, see https://gXXXb.XXX/6842pods/test.

    2. Run the following command and compare the output pod.yaml with the original YAML file used to create the pod.

      Note

      [$Pod] is the name of the abnormal Pod, which you can obtain by running the kubectl get pods command.

        kubectl get pods [$Pod] -o yaml > pod.yaml
      • If the pod.yaml file has more lines than the original file, it means the pod was created as expected, and the cluster added default values.

      • If lines from your original YAML file are missing from pod.yaml, this indicates a spelling error in your original file.

  3. Check the pod's logs to troubleshoot the issue.

  4. Access the container through a terminal and verify that the local files within the container are as expected.

Check node resource usage with kubectl

  1. Check the CPU and memory usage of all nodes in the cluster.

    kubectl describe nodes | awk '/^Name:/{print "\n"$2} /Resource +Requests +Limits/{print $0} /^[ \t]+cpu.*%/{print $0} /^[ \t]+memory.*%/{print $0}'

    Expected output:

    cn-hangzhou.192.168.0.xxx
      Resource           Requests      Limits
      cpu                1725m (44%)   10320m (263%)
      memory             1750Mi (11%)  16044Mi (109%)
    
    cn-hangzhou.192.168.16.xxx
      Resource           Requests      Limits
      cpu                1885m (48%)   16820m (429%)
      memory             2536Mi (17%)  25760Mi (179%)

    A node with high request utilization may be unable to satisfy the requests of a new Pod, preventing the Pod from being scheduled.

  2. Replace YOUR_NODE_NAME with the actual node name to view the resource usage of all Pods on the node.

    kubectl describe node YOUR_NODE_NAME | awk '/Non-terminated Pods/,/Allocated resources/{ if ($0 !~ /Allocated resources/) print }'

    Expected output:

    Non-terminated Pods:          (11 in total)
      Namespace                   Name                                                        CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
      ---------                   ----                                                        ------------  ----------   ---------------  -------------  ---
      arms-prom                   node-exporter-gp95p                                         20m (0%)      1020m (26%)  160Mi (1%)       1152Mi (7%)    6d21h
      csdr                        csdr-velero-77c8bbc9c7-w46lq                                500m (12%)    1 (25%)      128Mi (0%)       2Gi (13%)      6d19h
      kube-system                 ack-cost-exporter-5b647ffc65-zdrsl                          100m (2%)     1 (25%)      200Mi (1%)       1Gi (6%)       6d21h
      kube-system                 ack-node-local-dns-admission-controller-5dfd74f5f4-9rl6n    100m (2%)     1 (25%)      100Mi (0%)       1Gi (6%)       6d21h
      kube-system                 ack-node-problem-detector-daemonset-6wql2                   200m (5%)     1200m (30%)  300Mi (2%)       1324Mi (9%)    6d21h
      kube-system                 coredns-7784559f6-dr9sn                                     100m (2%)     0 (0%)       100Mi (0%)       2Gi (13%)      6d21h
      kube-system                 csi-plugin-knz7j                                            130m (3%)     2 (51%)      176Mi (1%)       4Gi (27%)      6d21h
      kube-system                 kube-proxy-worker-rkbzv                                     100m (2%)     0 (0%)       100Mi (0%)       0 (0%)         6d21h
      kube-system                 loongcollector-ds-kw7cj                                     100m (2%)     2 (51%)      256Mi (1%)       2Gi (13%)      6d21h
      kube-system                 node-local-dns-pgzcn                                        25m (0%)      0 (0%)       30Mi (0%)        1Gi (6%)       6d21h
      kube-system                 terway-eniip-lnn8n                                          350m (8%)     1100m (28%)  200Mi (1%)       256Mi (1%)     6d21h

    You can adjust the requests configuration based on actual resource consumption.

Intermittent network disconnections from pods to databases

If a pod in your ACK cluster intermittently disconnects from a database, follow these steps to troubleshoot the issue.

1. Check pod
  • Check the pod's events for signs of connection instability, such as network issues, restarts, or insufficient resources.

  • Check the pod's logs for any error messages related to the database connection, such as timeouts, authentication failures, or reconnection triggers.

  • Monitor the pod's CPU and memory usage to ensure resource exhaustion does not cause the application or database driver to crash.

  • Review the pod's resource requests and limits to ensure it has sufficient CPU and memory.

2. Check node
  • Check the node's resource usage for shortages of memory, disk space, or other resources. For more information, see Monitor nodes.

  • Test for intermittent network disruptions between the node and the target database.

3. Check database
  • Check the status and performance metrics of the database for any restarts or performance bottlenecks.

  • Review the number of abnormal connections and the connection timeout settings, and adjust them based on your application's requirements.

  • Inspect the database logs for any records related to disconnections.

4. Check cluster component status

Faulty cluster components can disrupt a pod's network communication.

kubectl get pod -n kube-system  # Check the status of component pods.

Also, check the following network components:

  • CoreDNS: Check the component's status and logs to ensure the pod can correctly resolve the database service address.

  • Flannel: Check the status and logs of the kube-flannel component.

  • Terway: Check the status and logs of the terway-eniip component.

5. Analyze network traffic

You can use tcpdump to capture packets and analyze network traffic to help identify the cause of the problem.

  1. Get Pod and node information:

    Run the following command to get information about the pods in a specific namespace and the nodes they are running on:

    kubectl  get pod -n [namespace] -o wide 
  2. Log on to the target node and run the following commands to find the container PID.

    Containerd

    1. Run the following command to view the container CONTAINER.

      crictl ps |grep <Pod name keyword>

      Expected output:

      CONTAINER           IMAGE               CREATED             STATE                      
      a1a214d2*****       35d28df4*****       2 days ago          Running
    2. Run the following command with the CONTAINER ID parameter to view the container PID.

      crictl inspect a1a214d2***** |grep -i PID

      Expected output:

          "pid": 2309838,    # The PID of the target container.
                  "pid": 1
                  "type": "pid"

    Docker

    1. Run the following command to view the container's CONTAINER ID.

      docker ps |grep <pod name keyword>

      Expected output:

      CONTAINER ID        IMAGE                  COMMAND     
      a1a214d2*****       35d28df4*****          "/nginx
    2. Run the following command with the CONTAINER ID parameter to view the container PID.

      docker inspect  a1a214d2***** |grep -i PID

      Expected output:

                  "Pid": 2309838,  # The PID of the target container.
                  "PidMode": "",
                  "PidsLimit": null,
  3. Run the packet capture commands.

    Use the container PID to run the following command and capture network packets between the pod and the target database.

    nsenter -t <container PID> tcpdump -i any -n -s 0 tcp and host <database IP address> 

    Use the container PID to run the following command and capture network packets between the pod and the host.

    nsenter -t <container PID> tcpdump -i any -n -s 0 tcp and host <node IP address>

    Run the following command to capture network packets between the host and the database.

    tcpdump -i any -n -s 0 tcp and host <database IP address> 
6. Optimize application
  • Implement an automatic reconnection mechanism in your application to ensure it can restore connections automatically during a database switchover or migration.

  • Use persistent connections instead of short-lived connections to communicate with the database. Persistent connections can significantly reduce performance overhead and resource consumption, improving overall system efficiency.

Console troubleshooting

Log on to the ACK console and go to the details page of your cluster to troubleshoot Pod issues.

Actions

Console

Check the status of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. In the upper-left corner of the Pods page, select the Pod's Namespace and check its status.

    • If the status is Running, the Pod is working as expected.

    • If the status is not Running, the Pod is in an abnormal state. See this topic for troubleshooting steps.

Check the basic information of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. In the upper-left corner of the Pods page, select the target Pod's Namespace. Then, click the Pod's name or click Details in the Actions column to view details such as the Pod name, image, IP address, and the node on which it runs.

Check the configuration of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. In the upper-left corner of the Pods page, select the target Pod's Namespace. Then, click the Pod's name or click Details in the Actions column.

  3. In the upper-right corner of the Pod details page, click Edit YAML to view the Pod's YAML configuration file.

Check the events of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. In the upper-left corner of the Pods page, select the target Pod's Namespace. Then, click the Pod's name or click Details in the Actions column.

  3. At the bottom of the Pod details page, click the Events tab to view the Pod's events.

    Note

    By default, Kubernetes retains events for the past hour. To store events for a longer period, see Create and use K8s Event Center.

View the logs of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. In the upper-left corner of the Pods page, select the target Pod's Namespaces. Then, click the Pod's name or click Details in the Actions column.

  3. At the bottom of the Pod details page, click the Logs tab to view the Pod's logs.

Note

ACK clusters are integrated with Simple Log Service (SLS). You can enable SLS in your cluster to quickly collect container logs. For more information, see Collect container logs from an ACK cluster.

Check the monitoring data of a Pod

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Operations > Prometheus Monitoring.

  2. On the Prometheus Monitoring page, click the Cluster Overview tab to view monitoring dashboards for the Pod's CPU, memory, and network I/O.

Note

ACK clusters are integrated with Managed Service for Prometheus. You can quickly enable Managed Service for Prometheus for your cluster to monitor the health of your cluster and containers in real time and view Grafana dashboards. For more information, see Connect to and configure Managed Service for Prometheus.

Use a terminal to access a container and view local files

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. On the Pods page, find the target Pod and click Terminal in the Actions column.

Run Pod diagnostics

  1. On the Clusters page, click the name of your cluster. In the left navigation pane, click Workloads > Pods.

  2. On the Pods page, find the target Pod and click Diagnose in the Actions column. Resolve any identified issues based on the diagnostic results.

Note

Container Intelligent Service provides a one-click diagnostics feature to help you identify issues in your cluster. For more information, see Use cluster diagnostics.

Unexpected Pod deletion

When a cluster contains a large number of Pods with a Completed status, the kube-controller-manager (KCM) garbage-collects them to prevent performance degradation of its controllers. This cleanup occurs when the number of completed Pods exceeds the default threshold of 12,500. The --terminated-pod-gc-threshold parameter configures this threshold. For more information, see the community KCM parameter documentation.

Recommendation: Periodically clean up Pods with a Completed status in your cluster to prevent them from affecting controller efficiency.