Diagnose the health status of a GPU by using the Cloud Assistant plug-in - Elastic GPU Service

You can use the Cloud Assistant plug-in to comprehensively diagnose the GPU or GPU driver of your GPU-accelerated instance and efficiently identify common errors that occur when the GPU runs, such as GPU anomalies and driver anomalies. Once anomalies are diagnosed, the system automatically initiates O&M operations. For example, the system sends notifications to you.

Procedure

Note

The procedure in this topic applies to the diagnosis of GPU-accelerated Linux instances. GPU-accelerated Linux instances are automatically pre-installed with the Cloud Assistant plug-in when you create the instances. For more information about Cloud Assistant, see Overview.

Log on to the ECS console.
In the left-side navigation pane, choose Maintenance & Monitoring > Cloud Assistant.
In the upper part of the page, select the region where the desired GPU-accelerated instance resides.
On the ECS Instances tab, find the instance and click Run Command in the Actions column.
In the Create Command panel, configure parameters in the Command Information section.
The following section describes key parameters. Use default values for other parameters. For more information, see Create a command.
Important
You must set the parameters to the values that are provided in the following section. Otherwise, Cloud Assistant may fail to run the command.
① Command Type: Select Shell.
② Command content: Paste the following command content. For more information about sample shell commands, see View the system configurations of ECS instances.
```
if acs-plugin-manager --list --local | grep ACS-ECS-GpuCheck > /dev/null 2>&1
then
    acs-plugin-manager --remove --plugin ACS-ECS-GpuCheck
fi
acs-plugin-manager --exec --plugin ACS-ECS-GpuCheck
```
③ Timeout: Specify the timeout period for running the command. When the command execution times out, Cloud Assistant forcefully terminates the execution process. In this example, the value is set to 180.
Note
The value of the Timeout parameter must be a positive integer that ranges from 10 to 86400. Unit: seconds. A value of 86400 is equivalent to 24 hours.
Click Run to run the command to diagnose the health status of the GPU-accelerated instance by using Cloud Assistant.
- If the execution result shows that each diagnostic item is in the OK state, the GPU of the instance is not diagnosed with anomalies.
- If the execution result shows that one or more diagnostic items, such as Double Bit Error Check, are in the Failed state, the GPU of the instance is diagnosed with anomalies.

Diagnostic items and troubleshooting methods

The following table describes the diagnostic items involved when you use the Cloud Assistant plug-in to diagnose the GPU status of your GPU-accelerated instance.

Diagnostic item	Description	Troubleshooting method
Double Bit Error Check	Checks whether double-bit errors exist on the GPU.	Restart the instance based on the number of errors returned by the system.
Info Rom Corrupted Check	Checks the infoROM information about the GPU.	Perform operations based on the O&M notifications that are sent by the system.
eRDMA Incorrect Check	Checks the status of the elastic RDMA interface (ERI) of the GPU.	Perform operations based on the O&M notifications that are sent by the system.
Kernel Upgrade Check	Checks whether driver anomalies caused by kernel updates exist.	Uninstall the current driver and install a new driver.
Fabricmanager running Check	Checks the running status of the Fabricmanager component.	Install or start the Fabricmanager component.
Power Cable Error Check	Checks the status of the power cable and power supply of the GPU.	Perform operations based on the O&M notifications that are sent by the system.
GPU Device Lost Check	Checks whether the GPU can be found.	Perform operations based on the O&M notifications that are sent by the system.
GPU Driver Install Check	Checks the installation status of the GPU driver.	Install the driver.
GPU Xid Error Check	Checks whether XID errors exist on the GPU.	Restart the instance based on different XID errors that are reported by the system.