All Products
Search
Document Center

Troubleshoot Container Service for Kubernetes clusters

Last Updated: Aug 20, 2021

Overview

This topic describes how to troubleshoot a Container Service for Kubernetes (ACK) cluster.

Troubleshooting method

Check the nodes in the cluster

  1. Run the following command to check the status of the nodes in the cluster. Make sure that all nodes are listed and they are in the Ready state.
    kubectl get nodes
    The following figure shows the sample command output.
  2. If a node is abnormal, run the following command to view the details and events of the node:
    kubectl describe node [$NODE_NAME]
    Note:
    • Replace [$NODE_NAME] with the name of the node.
    • For more information about how to parse the information returned by the kubectl client, see Nodes in official Kubernetes documentation.

Check the components in the cluster

If you still cannot locate issues, you can view the logs of components on the master node.

  1. Run the following command to view all components in the kube-system namespace:
    kubectl get pods -n kube-system
    The following figure shows the sample command output. The pods whose names start with kube- are system components. The pods whose names start with coredns- are plug-ins for domain name resolution.
  2. If a component is abnormal, run the following command to view the logs of the component. Locate and resolve issues based on the logs.
    kubectl logs -f [$Component_Name] -n kube-system
    Note: Replace [$Consistent_Name] with the name of the abnormal component.

Check the kubelet

  1. Run the following command to view the status of the kubelet:
    systemctl status kubelet
  2. If the kubelet is not running, run the following command to view the logs of the kubelet. Locate and resolve issues based on the logs.
    journalctl -u kubelet

Common failure causes

The following table describes common failure causes and solutions for ACK clusters.

Failure cause Solution

The API server or a component on the master node stops. Results:

  • You cannot create, stop, or update resources such as pods, Services, and Deployments.
  • Existing pods and Services can work unless they need to call API operations.

The components of ACK support high availability. We recommend that you check whether the components are abnormal. For example, the API server used by ACK is a Server Load Balancer (SLB) instance. You can check why your SLB instance stops.

The backend data of the API server is lost. Results:

  • The API server cannot be started.
  • Existing pods and Services can work unless they need to call API operations.
  • You must restore or re-create the backend data of the API server before you can start the API server.

If you have created a snapshot, you can restore data from the snapshot to resolve issues. If no snapshot is created, you must re-create data. We recommend that you take the following measures to prevent the same issues from occurring again:

A node fails. Results:

All pods on this node stop working.

Do not directly create pods. Instead, create pods by using controllers such as Deployments, StatefulSets, and DaemonSets. Controllers can monitor the status of pods and schedule pods on a failed node to healthy nodes.

The kubelet fails. Results:

  • You cannot create pods on a node where the kubelet fails.
  • The kubelet may delete specific pods by mistake.
  • Specific nodes are marked as unhealthy.
  • Deployments or replication controllers create pods on other nodes.

You can perform the steps described in the Check the kubelet section to locate and resolve issues.

  • If you have created a snapshot, you can restore data from the snapshot. If no snapshot is created, you must re-create data. We recommend that you create snapshots on a regular basis for volumes that are used by the kubelet to prevent the same issues from occurring again. For more information, see Use volume snapshots created from disks.
  • Do not directly create pods. Instead, create pods by using controllers such as Deployments, StatefulSets, and DaemonSets. Controllers can monitor the status of pods and schedule pods on a failed node to healthy nodes.

Other causes such as invalid configurations.

If you have created a snapshot, you can restore data from the snapshot to resolve issues. If no snapshot is created, you must re-create data. We recommend that you create snapshots on a regular basis for volumes that are used by the kubelet to prevent the same issues from occurring again. For more information, see Use volume snapshots created from disks.

If the issues persist, locate and resolve the issues based on the diagnostics data. For more information, see How do I collect diagnostics data from nodes in an ACK cluster? Alternatively, you can submit the diagnostics data through a ticket to Alibaba Cloud technical support.

References

 

Application scope

  • ACK