Use cluster diagnostics to locate cluster issues - Container Service for Kubernetes

Artificial Intelligence for IT Operations (AIOps) provides one-click diagnostics for nodes, pods, services, Ingresses, memory, networks, and AI Profiling. This feature helps you locate issues in your cluster. This topic describes how to use the cluster diagnostics feature in an ACK cluster.

Prerequisites

An ACK managed cluster has been created. For more information, see Create an ACK managed cluster.
The status of the Kubernetes cluster is Running.
Note
You can log on to the Container Service Management Console. On the Clusters page, check the Cluster Status column to confirm that the cluster status is Running.

Diagnostic features

AIOps provides the diagnostic features described in the following table.

Diagnostic item	Description
Node diagnostics	Diagnose node-related issues, such as Kubernetes nodes in the NotReady state.
Pod diagnostics	Diagnose issues related to abnormal pod status, such as pod startup failures or frequent pod restarts.
Service diagnostics	Diagnose service-related issues, such as service configurations, resource quotas, and anomalous activity.
Ingress diagnostics	Diagnose Ingress-related issues, such as traffic configurations.
Memory diagnostics	Diagnose node memory issues, such as memory leaks, cgroup leaks, and out-of-memory (OOM) errors. The diagnostic results show the overall memory usage in a visualized chart.
Network diagnostics	Diagnose common network issues, such as connectivity issues between pods, between the cluster and the Internet, or between the Internet and a LoadBalancer.
AI Profiling	Collects real-time data from online GPU containers, including CPU calls, Python processes, system calls, and CUDA kernel functions. You can analyze the data on a visualized chart interface.

Configure diagnostics

Important

When you use the cluster diagnostics feature, a data collection program runs on your cluster nodes to collect check results. The collected information includes the system version, load, Docker and kubelet running status, and critical error messages from system logs. The data collection program does not collect your business information or sensitive data.

The procedures for configuring diagnostics for nodes, pods, services, and Ingresses are similar. The following section uses node diagnostics as an example to show how to configure this feature.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose Inspections and Diagnostics > Diagnostics.
On the Diagnostics page, click Node diagnosis. On the Node diagnosis page that appears, click Diagnosis in the upper-left corner.
In the Select Node panel, select a Node Name, read the notes, select I know and agree, and then click Create diagnosis.
You can view the diagnostic progress on the page. After the diagnosis is complete, the page displays the diagnostic results and a list of diagnostic items. You can then review the results to identify the cause of any issues and resolve them.

View diagnostic results

On the diagnostics page, find the diagnostic report in the list and click Diagnosis details in the Operation column to view the detailed diagnostic results.

Note

The diagnostic items may vary based on the cluster configuration. The actual diagnostic items on the diagnostic page shall prevail.

Diagnostic item	Check item status	Description
Node diagnosis	Normal: No action is required. Warning: Confirmation is required. Handle any situation that causes cluster anomalies. Abnormal: Handle as soon as possible to prevent the cluster from failing to work. Unknown: The check was not completed or the result is unknown.	Node diagnostics include the Node, NodeComponent, ClusterComponent, ECSControllerManager, and GPUNode check items. The cause of a node anomaly is determined based on the node status, node component status, cluster component status, and ECS status. On the diagnostic details page, you can view the node diagnostic results, repair suggestions, and a list of specific check items. Hover the mouse pointer over the icon to the right of a check item to view its description. If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item.
Pod diagnosis		Pod diagnostics include the Pod, ClusterComponent, Node, NodeComponent, and ECSControllerManager check items. The cause of a pod anomaly is determined based on the pod status, cluster component status, node status, node component status, and ECS status. On the diagnostic details page, you can view the pod diagnostic results, repair suggestions, and a list of specific check items. Hover the mouse pointer over the icon to the right of a check item to view its description. If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item.
Service diagnosis		Service diagnostics include the Service and ResourceQuotas check items. The cause of a service anomaly is determined by checking items such as the CLB billing type, certificates, quotas, and anomalous events. Hover the mouse pointer over the icon to the right of a check item to view its description. If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item.
Ingress diagnosis		Ingress diagnostics include the Ingress, Addon, and SLB check items. The cause of an Ingress anomaly is determined based on the Ingress status, Ingress plug-in status, and SLB status. Hover the mouse pointer over the icon to the right of a check item to view its description. If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item.
Memory Diagnosis	None.	On the diagnostic details page, you can view the Memory Overview, Memory Analysis, and OOM Analysis, which include information such as memory leak status, memory utilization, and the memory occupied by each process.
Network diagnostics	Normal: No action is required. Abnormal: Handle as soon as possible.	On the Diagnosis result page, you can view the network diagnostic results. The Packet paths area renders a complete map of the access path for the diagnosis. Abnormal nodes are highlighted in a different color from normal nodes.