On a Linux GPU-accelerated instance, an issue with the GPU System Processor (GSP) component can cause the GPU to fail initialization and generate XID 119 or XID 120 error messages. This topic explains how to resolve this issue.
Symptoms
A GPU falls off the bus and fails to initialize on a Linux system. When you run the sh nvidia-bug-report.sh command, the log contains XID 119 or XID 120 error messages. The following example shows an XID 119 error:
Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0120 0x0).
Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
Xid (PCI:0000:69:00): 119, pid=18584, name=cache_mgr_main, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014a 0x10c).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800810 0x7c).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208001a4 0x10).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800609 0x8).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
Xid (PCI:0000:69:00): 119, pid=25394, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
For more information about other XID errors, see NVIDIA Common XID Errors.
Cause
An abnormal state of the GPU System Processor (GSP) component causes this issue. If the issue persists after you upgrade to the latest NVIDIA driver, you should disable the GSP feature.
To learn more about the impact of the GSP feature, see Impact of enabling or disabling the GSP feature.
Solution
-
Connect to the GPU-accelerated instance.
For more information, see Connect to a Linux instance by using Workbench.
-
Run the following commands to disable the GSP component.
sudo su echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf -
Restart the GPU-accelerated instance.
For more information, see Restart an instance.
-
Connect to the GPU-accelerated instance again.
-
Run the following command to check the value of the
EnableGpuFirmwareparameter.cat /proc/driver/nvidia/params | grep EnableGpuFirmware:-
If the output is
EnableGpuFirmware: 0, the GSP component is disabled and the issue is resolved.cat /proc/driver/nvidia/params | grep EnableGpuFirmware EnableGpuFirmware: 0NoteIf the output is
EnableGpuFirmware: 0, thenvidia-smicommand will report a normal GPU status. -
If the output is not
EnableGpuFirmware: 0, the GSP component is not disabled. Proceed to the next step to verify the status of the NVIDIA GPU.
-
-
Run the
nvidia-smicommand to verify the status of the NVIDIA GPU.-
The issue is resolved if the command output shows a normal GPU status. For example, the fan speed, temperature, and performance mode are normal in the following output.
[ecs-usexxxukZ ~]$ nvidia-smi Wed Aug 14 11:02:11 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10 On | 00000000:00:07.0 Off | Off | | 0% 26C P8 9W / 150W | 1MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ -
If the output is abnormal, contact Alibaba Cloud technical support to request an offline migration.
-