Artificial Intelligence for IT Operations (AIOps) provides one-click diagnostics for nodes, pods, services, Ingresses, memory, networks, and AI Profiling. This feature helps you locate issues in your cluster. This topic describes how to use the cluster diagnostics feature in an ACK cluster.
Prerequisites
An ACK managed cluster has been created. For more information, see Create an ACK managed cluster.
The status of the Kubernetes cluster is Running.
NoteYou can log on to the Container Service Management Console. On the Clusters page, check the Cluster Status column to confirm that the cluster status is Running.
Diagnostic features
AIOps provides the diagnostic features described in the following table.
Diagnostic item | Description |
Diagnose node-related issues, such as Kubernetes nodes in the NotReady state. | |
Diagnose issues related to abnormal pod status, such as pod startup failures or frequent pod restarts. | |
Diagnose service-related issues, such as service configurations, resource quotas, and anomalous activity. | |
Diagnose Ingress-related issues, such as traffic configurations. | |
Diagnose node memory issues, such as memory leaks, cgroup leaks, and out-of-memory (OOM) errors. The diagnostic results show the overall memory usage in a visualized chart. | |
Diagnose common network issues, such as connectivity issues between pods, between the cluster and the Internet, or between the Internet and a LoadBalancer. | |
Collects real-time data from online GPU containers, including CPU calls, Python processes, system calls, and CUDA kernel functions. You can analyze the data on a visualized chart interface. |
Configure diagnostics
When you use the cluster diagnostics feature, a data collection program runs on your cluster nodes to collect check results. The collected information includes the system version, load, Docker and kubelet running status, and critical error messages from system logs. The data collection program does not collect your business information or sensitive data.
The procedures for configuring diagnostics for nodes, pods, services, and Ingresses are similar. The following section uses node diagnostics as an example to show how to configure this feature.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the target cluster. In the navigation pane on the left, choose .
On the Diagnostics page, click Node diagnosis. On the Node diagnosis page that appears, click Diagnosis in the upper-left corner.
In the Select Node panel, select a Node Name, read the notes, select I know and agree, and then click Create diagnosis.
You can view the diagnostic progress on the page. After the diagnosis is complete, the page displays the diagnostic results and a list of diagnostic items. You can then review the results to identify the cause of any issues and resolve them.
View diagnostic results
On the diagnostics page, find the diagnostic report in the list and click Diagnosis details in the Operation column to view the detailed diagnostic results.
The diagnostic items may vary based on the cluster configuration. The actual diagnostic items on the diagnostic page shall prevail.
Diagnostic item | Check item status | Description |
Node diagnosis |
| Node diagnostics include the Node, NodeComponent, ClusterComponent, ECSControllerManager, and GPUNode check items. The cause of a node anomaly is determined based on the node status, node component status, cluster component status, and ECS status. On the diagnostic details page, you can view the node diagnostic results, repair suggestions, and a list of specific check items. Hover the mouse pointer over the If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item. |
Pod diagnosis | Pod diagnostics include the Pod, ClusterComponent, Node, NodeComponent, and ECSControllerManager check items. The cause of a pod anomaly is determined based on the pod status, cluster component status, node status, node component status, and ECS status. On the diagnostic details page, you can view the pod diagnostic results, repair suggestions, and a list of specific check items. Hover the mouse pointer over the If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item. | |
Service diagnosis | Service diagnostics include the Service and ResourceQuotas check items. The cause of a service anomaly is determined by checking items such as the CLB billing type, certificates, quotas, and anomalous events. Hover the mouse pointer over the If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item. | |
Ingress diagnosis | Ingress diagnostics include the Ingress, Addon, and SLB check items. The cause of an Ingress anomaly is determined based on the Ingress status, Ingress plug-in status, and SLB status. Hover the mouse pointer over the If there are check items with an abnormal or warning status, they are displayed on the Troubleshoot tab. If a check item has an abnormal status, you can view the anomaly in the tip that appears when you hover over Details in the Status column for that item. | |
Memory Diagnosis | None. | On the diagnostic details page, you can view the Memory Overview, Memory Analysis, and OOM Analysis, which include information such as memory leak status, memory utilization, and the memory occupied by each process. |
Network diagnostics |
| On the Diagnosis result page, you can view the network diagnostic results. The Packet paths area renders a complete map of the access path for the diagnosis. Abnormal nodes are highlighted in a different color from normal nodes. |