GPU-accelerated instances may encounter faults or security vulnerabilities, such as GPU malfunctions or driver anomalies. The Elastic Compute Service (ECS) console incorporates the troubleshooting feature that lets you perform health checks on GPU devices. This helps you diagnose whether the GPU and driver of your GPU-accelerated instance are abnormal, and identify and resolve potential problems at the earliest opportunity.
Procedure
Before you perform operations, make sure that your GPU-accelerated instance is in the Running state.
Go to the Self-service Troubleshooting page in the ECS console. At the top of the page, select the region where the GPU-accelerated instance is located.
On the Troubleshooting page, configure the issue type, diagnostic item, instance ID, and troubleshooting cycle. Then, click Start.
NoteAfter you click Start, the system automatically creates a diagnostic task. The system runs only one diagnostic task on an instance within a specific period of time. After the diagnostic task is complete, you must wait for at least 5 minutes before you can start another diagnostic task on the instance.

The following table describes the configuration items.
Serial number
Configuration item
Description
①
Issue type
Select Instance Device Check to check whether the instance devices, such as the GPU, run as expected.
②
Diagnostic item
Select GPU Health Check to check the status of the instance devices, such as the status of the GPU and driver.
③
Instance ID
Select the ID of the GPU-accelerated instance that you want to check.
Troubleshooting cycle
Specify a time period as needed. By default, the system troubleshoots issues within the most recent 12 hours.
After the instance is diagnosed, view the diagnostic report.

A diagnostic report includes the following items.
Item
Description
Diagnostic result
The system displays No exceptions are detected on the instance. if all diagnostic items are normal.
The system displays *** exceptions are detected on the instance. if abnormal diagnostic items exist. *** is replaced with the actual number of exceptions. The system also provides solutions that you can reference to resolve the exceptions.
Diagnostic item details
In this topic, the system displays only information about the GPU device and driver status check parameter. The severity levels are classified into serious, warning, and passed.
Basic diagnostic information
The system displays the basic diagnostic information, including the Resource ID, Report ID, and Start At parameters.
(Optional) On the Troubleshooting page, click View History to view the historical diagnostic details of the instance on the Check History page.
NoteOn the Instance Health Diagnostics tab of the Check History page, you can click the
icon on the right of the Status column to filter a desired report by state.