Troubleshoot pod startup failures and runtime errors - Container Compute Service

When a pod fails to start, crashes repeatedly, or behaves unexpectedly, use this guide to identify the root cause and resolve the issue.

Diagnostic workflow

Check the pod status. If the status is anything other than Running, see Pod status reference for targeted solutions.
If the pod is Running but not behaving as expected, see Running but not working as expected.
If the pod was terminated due to an out-of-memory (OOM) error, see Troubleshoot OOM errors.
If the issue persists after troubleshooting, submit a ticket.

Pod status reference

Status	Meaning	Solution
Pending	Not scheduled to a node	Pods stuck in Pending
Init:N/M	N of M init containers started	Pods stuck in init container states
Init:Error	Init container failed	Pods stuck in init container states
Init:CrashLoopBackOff	Init container in a crash loop	Pods stuck in init container states
ImagePullBackOff	Failed to pull a container image	Pods stuck in ImagePullBackOff
CrashLoopBackOff	Application crashing repeatedly	Pods stuck in CrashLoopBackOff
Completed	All containers exited after finishing the startup command	Pods stuck in Completed
Running	The pod works as expected, or is running but not working as expected	Running but not working as expected
Terminating	Being deleted	Pods stuck in Terminating

Diagnostic tools

All diagnostic tools are available in the ACS console. Start by navigating to the pod:

Log on to the ACS console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Workloads > Pods.
In the upper-left corner of the Pods page, select the namespace to which the pod belongs. Find the pod in the list.

Then use any of the following tools:

View pod details

Click the pod name or click View Details in the Actions column to view information such as the pod name, image, and IP address.

View pod configuration (YAML)

Open the pod details page and click Edit in the upper-right corner to view the YAML file and configuration of the pod.

View pod events

Open the pod details page and click the Events tab in the lower section.

Note

Kubernetes retains events from the previous hour by default. To retain events for a longer period, create and use an Event Center.

View pod logs

Open the pod details page and click the Logs tab in the lower section.

Note

Alibaba Cloud Container Compute Service (ACS) integrates with Simple Log Service. Enable Simple Log Service when creating a cluster to collect log data from standard output and text files. For more information, see Collect application logs by using the environment variables of pods.

View pod monitoring

Log on to the ACS console. In the left navigation pane, click Clusters.
On the Prometheus Monitoring page, click the Cluster Overview tab to view CPU usage, memory usage, and network I/O for pods.

Connect to a container terminal

Log on to the ACS console. In the left navigation pane, click Clusters.

Run pod diagnostics

On the Pods page, find the pod and click Diagnose in the Actions column. Review the diagnostic result after it completes. For more information, see Work with cluster diagnostics.

Pods stuck in Pending

A Pending pod has not been scheduled to any node. This typically happens when required resources are missing or quota configurations are invalid.

Diagnose the issue:

Check the pod events to identify the scheduling failure reason. Common causes include:

Missing resource dependencies: Some pods depend on specific cluster resources such as ConfigMaps or persistent volume claims (PVCs). For example, a PVC must be bound to a persistent volume (PV) before it can be used in a pod spec.
Invalid quota configurations: The pod's resource requests may exceed the available quota. Check the pod events and audit logs for details.

Pods stuck in init container states

These states indicate that one or more init containers failed to complete:

Init:N/M -- The pod has M init containers, N have started, and M-N have failed.
Init:Error -- An init container exited with an error.
Init:CrashLoopBackOff -- An init container keeps crashing and restarting.

Diagnose the issue:

Check the pod events for errors in the failing init container. See View pod events.
Check the logs of the failing init container for error details. See View pod logs.
Verify that the init container configuration is correct in the pod YAML. See View pod configuration (YAML). For more information about debugging init containers, see Debug init containers.

Pods stuck in ImagePullBackOff

The pod is scheduled but cannot pull one or more container images. Check the pod events to identify which image failed.

Diagnose the issue:

Verify the image name and tag. A typo in the image name or tag is the most common cause.
If the image is stored in a private repository, make sure the correct image pull secret is configured. See Use an image stored in an image repository to create an ACS workload.

Pods stuck in CrashLoopBackOff

The application inside the pod is crashing. Kubernetes restarts it automatically, but the crash recurs.

Diagnose the issue:

Check the pod events for error messages. See View pod events.
Check the pod logs for application errors. See View pod logs.
Review the health check configuration. Misconfigured liveness, readiness, or startup probes can cause Kubernetes to kill a healthy container. See View pod configuration (YAML). For more information about configuring probes, see Configure Liveness, Readiness and Startup Probes.

Pods stuck in Completed

All containers in the pod finished running the startup command and exited. This is expected for batch jobs but not for long-running services.

Diagnose the issue:

Check the startup command in the pod configuration. The containers may have completed their intended command without errors. See View pod configuration (YAML).
Check the pod logs for additional context. See View pod logs.

Pods stuck in Running but not working as expected

The pod shows Running status but the application is not functioning correctly. This is often caused by errors in the pod YAML.

Diagnose the issue:

Compare the pod configuration against your expectations. See View pod configuration (YAML).
Check for typos in environment variable keys. Kubernetes silently ignores misspelled field names -- the pod starts successfully, but the intended configuration does not apply.
To detect field-level typos, run the following command before deploying:
Note
Kubernetes silently ignores misspelled field names -- the pod starts successfully, but the intended configuration does not apply.
1. Run the following command:
```
   kubectl apply --validate -f <your-file>.yaml
```
  If a field name is misspelled (for example, commnd instead of command), the output includes a warning:
```
   [<path>] unknown field: commnd
```
2. Alternatively, export the running pod's YAML and compare it against the original:
```
   kubectl get pods <pod-name> -o yaml > pod.yaml
```
  If the exported YAML is missing fields from the original file, the original may contain misspelled keys.
Check the pod logs for runtime errors. See View pod logs.
Connect to the container terminal to inspect local files and application state. See Connect to a container terminal.

Pods stuck in Terminating

A Terminating pod is being deleted but has not yet stopped. Pods in this state typically resolve on their own after the grace period expires.

If a pod remains in Terminating for an extended period, force-delete it:

kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force

Troubleshoot OOM errors

When a container's memory usage exceeds its configured memory limit, the kernel terminates the container with an out-of-memory (OOM) kill. The terminated container may restart automatically.

Symptoms:

The container restarts unexpectedly.
The Events tab on the pod details page shows the event: pod was OOM killed.

Diagnose the issue:

Check the memory usage graph to identify when spikes occurred. See View pod monitoring.
Determine whether the high memory usage is caused by a memory leak or by legitimate workload demands:
- Memory leak: Investigate the application code based on the timing of memory spikes, log entries, and process names.
- Normal memory growth: Increase the pod's memory limit. Set the limit so that actual memory usage stays below 80% of the configured limit. For details, see Manage pods.

For more background, see Assign Memory Resources to Containers and Pods.