All Products
Search
Document Center

Elastic GPU Service:What do I do if a GPU falls off the bus due to an XID 119 or XID 120 error?

Last Updated:Sep 02, 2024

This topic describes the cause of and solution to GPU initialization errors, such as XID 119 or XID 120, that occur on a GPU-accelerated Linux instance. The errors may be caused by exceptions in the GPU System Processor (GSP) component.

Problem description

A GPU falls off the bus on a GPU-accelerated Linux instance. For example, an error message appears indicating that the GPU fails initialization on the instance. After you run the sh nvidia-bug-report.sh nvidia-bug-report.sh command, you can view XID 119 or XID 120 error messages in the command output. The following figure shows an example of XID 119 error messages.

报错信息.png

Note

For information about other XID errors, visit NVIDIA Common XID Errors.

Cause

The preceding issue may occur because an exception occurs in the GSP component. You can update the NVIDIA driver to the latest version. If the issue persists after the update, we recommend that you disable the GSP component.

Note

For more information about GSP, see Chapter 42. GSP Firmware in the official NVIDIA documentation.

Solution

  1. Connect to the GPU-accelerated instance.

    For more information, see Connect to a Linux instance by using a password or key.

  2. Run the following commands to disable the GSP component:

    sudo su
    echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
  3. Restart the GPU-accelerated instance.

    For more information, see Restart instances.

  4. Reconnect to the GPU-accelerated instance.

  5. Run the following command to obtain the value of the EnableGpuFirmware parameter:

    cat /proc/driver/nvidia/params | grep EnableGpuFirmware:
    • If 0 is returned for the EnableGpuFirmware parameter, the GSP component is disabled. In this case, the preceding issue is resolved.

      Dingtalk_20240813131616.jpg

      Note

      If the value of the EnableGpuFirmware parameter is 0, the output of the nvidia-smi command indicates that the NVIDIA GPU runs as expected when you run the nvidia-smi command to check the status of the NVIDIA GPU.

    • If 0 is not returned for the EnableGpuFirmware parameter, the GSP component is not disabled. In this case, proceed to the next step to check whether the NVIDIA GPU runs as expected.

  6. Run the nvidia-smi command to check whether the NVIDIA GPU runs as expected.

    • If the command output indicates that the GPU runs as expected, such as if the command output displays the normal values of the fan speed, temperature, and performance mode of the GPU, as shown in the following figure, the preceding issue is resolved.

      GPU卡.jpg

    • If an error is returned, the issue persists on the GPU. Contact Alibaba Cloud technical support to shut down the instance and migrate data.