All Products
Search
Document Center

Elastic Compute Service:Best practices for improving GPU stability

Last Updated:Mar 26, 2026

With the development of technologies such as AI, deep learning, scientific computing, and big data processing, GPUs become a key component of high-performance computing. To maintain the stable operation of servers, Alibaba Cloud provides the inspection feature to detect hardware faults in advance and allows you to quickly troubleshoot issues by using the self-service diagnostics feature.

Important

This topic applies only to Linux Elastic Compute Service (ECS) instances.

Scenarios

  • GPU-accelerated instance diagnostics

    The ECS self-service diagnostics feature can detect risks on GPU-accelerated instances to help you troubleshoot issues.

  • GPU O&M

    • Alibaba Cloud can detect GPU hardware issues and send O&M event notifications. You can respond to the ECS system events to automatically perform O&M operations to resolve the issues.

    • You can use the self-service diagnostics feature or call the ReportInstancesStatus operation to report the exceptions that occur on one or more GPU-accelerated instances. After Alibaba Cloud receives the report from the instances, Alibaba Cloud checks the instances for exceptions. If an issue occurs, Alibaba Cloud sends an O&M event notification. To resolve the issue, respond to the event to automatically perform O&M operations.

    Important

    After you respond to a system event for an instance, the instance restarts, which interrupts services. We recommend that you use one of the following methods to restart the instance during off-peak hours:

GPU hardware diagnostics

A running GPU-accelerated instance may encounter potential faults or security risks, such as GPU faults or driver errors. You can use one of the following methods to diagnose the instance:

  • Enable the GPU Health Check feature in the ECS console to check whether the GPUs or GPU drivers on the instance work as expected.

  • Use a Cloud Assistant plug-in to comprehensively check the GPUs or GPU drivers on the instance and identify the errors that occur when the GPUs run, such as GPU and driver anomalies. If anomalies are diagnosed, the system automatically performs O&M operations. For example, the system sends notifications.

Diagnose the GPUs of a GPU-accelerated instance by using the self-service diagnostics feature in the ECS console

  1. Go to ECS console - Troubleshooting.

  2. In the top navigation bar, select the region and resource group of the resource that you want to manage. Region

  3. On the Troubleshooting page, specify the issue type, check item, instance ID, and troubleshooting period and click Start.

    111

    View the diagnostic report after the diagnosis is complete.

    image

    The following table describes the items in the diagnostic report.

    Item

    Description

    Diagnostic result

    • If all diagnostic items and the memory usage are normal, the system displays "No exceptions are detected on the instance."

    • If abnormal diagnostic items exist, the system displays "*** exceptions are detected on the instance." Asterisks (***) indicate the number of detected exceptions. The system also provides solutions that you can use to resolve the exceptions.

    Diagnostic item details

    The details of the diagnostic report include the status of GPUs, NVIDIA transaction ID (XID) exceptions, GPU memory status, Fabricmanager component exceptions, and GPU driver status. The report also contains alerts of Critical, Warning, and Passed levels.

    Basic diagnostic information

    The resource ID, report ID, diagnostic time range, and start time of the diagnosis.

Diagnose a GPU-accelerated instance by using Cloud Assistant

  1. Go to ECS console - ECS Cloud Assistant.

  2. In the top navigation bar, select the region and resource group of the resource that you want to manage. Region

  3. On the ECS Instances tab, find the GPU-accelerated instance that you want to diagnose and click Run Command in the Actions column.

  4. In the Create Command panel, configure parameters in the Command Information section.

    The following figure shows the main parameters. Use default values for other parameters. For more information, see Create command.

    2025-02-24_17-23-58

    Important

    You must set the parameters to the following values. Otherwise, Cloud Assistant may fail to run the command.

    • Command Type: Select Shell.

    • Command content: Copy and paste the following command. For information about shell commands, see View the system configurations of ECS instances.

      if acs-plugin-manager --list --local | grep ACS-ECS-GpuCheck > /dev/null 2>&1
      then
          acs-plugin-manager --remove --plugin ACS-ECS-GpuCheck
      fi
      acs-plugin-manager --exec --plugin ACS-ECS-GpuCheck
    • Timeout: the timeout period of the command execution. If a command execution times out, Cloud Assistant forcefully stops the command execution.

      Note

      The value of the Timeout parameter must be a positive integer that ranges from 10 to 86400. Unit: seconds. A value of 86400 is equivalent to 24 hours.

  5. Click Run to run the command to check the health status of the GPU-accelerated instance by using Cloud Assistant.

    • If the execution result shows that each diagnostic item is in the OK state, the GPUs of the instance do not have anomalies.

      image

    • If the execution result shows that one or more diagnostic items are in the Failed state, the GPUs have anomalies.

      image

    The following table describes diagnostic information in the command output.

    Item

    Description

    Troubleshooting method

    Double Bit Error Check

    Checks whether double-bit errors exist on the GPUs of the instance.

    Restart the instance based on the on-screen instructions. The system determines whether an instance restart is required based on the number of returned errors.

    Info Rom Corrupted Check

    Checks the infoROM information of the GPUs of the instance.

    Perform operations based on the O&M notifications.

    eRDMA Incorrect Check

    Checks the status of the elastic RDMA interface (ERI) of the GPU-accelerated instance.

    Perform operations based on the O&M notifications.

    Kernel Upgrade Check

    Checks whether driver anomalies occur due to kernel updates.

    Uninstall and re-install the driver.

    Fabricmanager running Check

    Checks the status of the Fabricmanager component.

    Install or start the Fabricmanager component service.

    Power Cable Error Check

    Checks the status of the power cable and power supply of the GPU-accelerated instance.

    Perform operations based on the O&M notifications.

    GPU Device Lost Check

    Checks whether the GPUs can be found.

    Perform operations based on the O&M notifications.

    GPU Driver Install Check

    Checks the installation status of the GPU driver.

    Install the GPU driver.

    GPU Xid Error Check

    Checks whether XID errors exist on the GPUs.

    Restart the instance based on the on-screen instructions. The system determines whether an instance restart is required based on the detected XID error.