All Products
Search
Document Center

Elastic GPU Service:Resolve XID 119/120 errors causing GPU drops

Last Updated:Jun 21, 2026

On a Linux GPU-accelerated instance, an issue with the GPU System Processor (GSP) component can cause the GPU to fail initialization and generate XID 119 or XID 120 error messages. This topic explains how to resolve this issue.

Symptoms

A GPU falls off the bus and fails to initialize on a Linux system. When you run the sh nvidia-bug-report.sh command, the log contains XID 119 or XID 120 error messages. The following example shows an XID 119 error:

Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0120 0x0).
Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014a 0x10c).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800810 0x7c).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208001a4 0x10).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800609 0x8).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
Note

For more information about other XID errors, see NVIDIA Common XID Errors.

Cause

An abnormal state of the GPU System Processor (GSP) component causes this issue. If the issue persists after you upgrade to the latest NVIDIA driver, you should disable the GSP feature.

Note

To learn more about the impact of the GSP feature, see Impact of enabling or disabling the GSP feature.

Solution

  1. Connect to the GPU-accelerated instance.

    For more information, see Connect to a Linux instance by using Workbench.

  2. Run the following commands to disable the GSP component.

    sudo su
    echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf
  3. Restart the GPU-accelerated instance.

    For more information, see Restart an instance.

  4. Connect to the GPU-accelerated instance again.

  5. Run the following command to check the value of the EnableGpuFirmware parameter.

    cat /proc/driver/nvidia/params | grep EnableGpuFirmware:
    • If the output is EnableGpuFirmware: 0, the GSP component is disabled and the issue is resolved.

      cat /proc/driver/nvidia/params | grep EnableGpuFirmware
      EnableGpuFirmware: 0
      Note

      If the output is EnableGpuFirmware: 0, the nvidia-smi command will report a normal GPU status.

    • If the output is not EnableGpuFirmware: 0, the GSP component is not disabled. Proceed to the next step to verify the status of the NVIDIA GPU.

  6. Run the nvidia-smi command to verify the status of the NVIDIA GPU.

    • The issue is resolved if the command output shows a normal GPU status. For example, the fan speed, temperature, and performance mode are normal in the following output.

      [ecs-usexxxukZ ~]$ nvidia-smi
      Wed Aug 14 11:02:11 2024
      +-----------------------------------------------------------------------------------------+
      | NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
      |-----------------------------------------+------------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
      |                                         |                        |               MIG M. |
      |=========================================+========================+======================|
      |   0  NVIDIA A10                     On  |   00000000:00:07.0 Off |                  Off |
      |  0%   26C    P8              9W /  150W |       1MiB /  24564MiB |      0%      Default |
      |                                         |                        |                  N/A |
      +-----------------------------------------+------------------------+----------------------+
      +-----------------------------------------------------------------------------------------+
      | Processes:                                                                              |
      |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
      |        ID   ID                                                               Usage      |
      |=========================================================================================|
      |  No running processes found                                                             |
      +-----------------------------------------------------------------------------------------+
    • If the output is abnormal, contact Alibaba Cloud technical support to request an offline migration.