On GPU-accelerated instances, such as ebmgn7, ebmgn7e, running the Ubuntu operating system, if you installed the nvidia-fabricmanager service from a package, the apt-daily service may automatically update the package. This causes a version mismatch with the Tesla driver, which prevents the nvidia-fabricmanager service from starting and makes the GPU unavailable. This topic describes how to resolve this issue.
Problem description
After you install nvidia-fabricmanager by using an installation package, the following error message appears when you view the service status. In this case, the GPU fails to work as expected.
root@xxx:~# systemctl status nvidia-fabricmanager
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2024-09-09 18:05:58 CST; 22s ago
Process: 36178 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
CPU: 66ms
Sep 09 18:05:58 iZ2xxx0d5fZ systemd[1]: Starting NVIDIA fabric manager service...
Sep 09 18:05:58 iZ2xxx fZ nv-fabricmanager[36180]: fabric manager NVIDIA GPU driver interface version 550.90.07 don't match with driver version 550.54.15. Please up
Sep 09 18:05:58 iZ2xxx fZ nv-fabricmanager[36180]: fabric manager NVIDIA GPU driver interface version 550.90.07 don't match with driver version 550.54.15. Please up
Sep 09 18:05:58 iZ2xxx fZ systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Sep 09 18:05:58 ixxxd5fZ systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 09 18:05:58 iZ2xxx5fZ systemd[1]: Failed to start NVIDIA fabric manager service.
Cause
If you install nvidia-fabricmanager by using an installation package on a GPU-accelerated compute-optimized instance that runs Ubuntu, the apt-daily service automatically updates nvidia-fabricmanager. This results in version inconsistency between nvidia-fabricmanager and the Tesla driver. As a result, nvidia-fabricmanager fails to start and the GPU fails to work as expected.
Solution
The GPU can work as expected only if the nvidia-fabricmanager version is consistent with the Tesla driver version. To prevent or resolve GPU unavailability caused by version inconsistency between nvidia-fabricmanager and the Tesla driver, perform the following steps:
-
Check the nvidia-fabricmanager version and the Tesla driver version.
-
Run the following command to check the nvidia-fabricmanager version:
sudo dpkg --list |grep nvidia-fabricmanagerIn this example, the nvidia-fabricmanager version is
550.90.07.nvidia-fabricmanager-550is the name of the installation package.ii nvidia-fabricmanager-550 550.90.07-1 amd64 Fabric Manager for NVSwitch based systems. -
Run the following command to check the Tesla driver version:
nvidia-smiIn this example, the Tesla driver version is
550.90.07.NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | MIG M. ====================================================================================== 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 0% 35C P8 9W / 150W | 1MiB / 23028MiB | 0% Default | | | N/A | Processes: GPU GI CI PID Type Process name GPU Memory ID ID Usage No running processes found
-
Check whether the current nvidia-fabricmanager version is consistent with the Tesla driver version.
If the two versions are consistent, proceed to the next step.
If the two versions are inconsistent, perform one of the following operations:
Upgrade the Tesla driver to ensure that the Tesla driver version is consistent with the nvidia-fabricmanager version. For more information, see Upgrade an NVIDIA Tesla driver.
Uninstall and reinstall nvidia-fabricmanager. Then, proceed to the next step.
NoteFor information about how to uninstall nvidia-fabricmanager, see Step 1: Uninstall nvidia-fabricmanager.
-
Run the following command to prevent nvidia-fabricmanager from being automatically updated:
In this example, the installation package
nvidia-fabricmanager-550is used. Replace the installation package name in the command with the actual nvidia-fabricmanager package name.sudo apt-mark hold nvidia-fabricmanager-550If the following result is displayed, nvidia-fabricmanager is prohibited from being updated.
nvidia-fabricmanager-550 set on hold. -
Run the following command to verify that updates to
nvidia-fabricmanagerare prohibited:sudo apt-mark showholdIf the
cloud-initandnvidia-fabricmanager-550information is displayed, updates to nvidia-fabricmanager are prohibited.