When your CUDA libraries require a higher NVIDIA driver version than what is currently installed, upgrade the driver manually by uninstalling the old version and installing the new one. This topic describes how to do that on a GPU-accelerated node in an ACK cluster.
How it works
NVIDIA driver upgrades require the GPU kernel modules to be fully unloaded before a new driver can be loaded. Because running processes — including DaemonSet pods — hold references to those modules, you must release all GPU consumers before the uninstall can proceed.
The upgrade follows this sequence:
-
Mark the node as unschedulable and evict workload pods.
-
Stop kubelet and containerd to release DaemonSet pods that
kubectl draincannot evict. -
Terminate any remaining processes that hold GPU device references.
-
Uninstall the current driver (and NVIDIA Fabric Manager if present).
-
Install the new driver and apply the required post-install settings.
-
Restore the node to the cluster.
Prerequisites
Before you begin, ensure that you have:
-
kubectlconfigured with access to the cluster -
SSH access to the target GPU-accelerated node
-
The new NVIDIA driver
.runfile downloaded from the NVIDIA official site to the node -
(Conditional) The matching NVIDIA Fabric Manager
.rpmpackage, if required — see Step 3, substep 5
Step 1: Cordon and drain the node
-
Mark the GPU-accelerated node as unschedulable:
kubectl cordon <NODE_NAME>Replace
<NODE_NAME>with the name of the target node. The expected output is:node/<NODE_NAME> cordoned -
Evict workload pods from the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true--grace-period=120gives pods up to 120 seconds to shut down cleanly.--ignore-daemonsets=trueskips DaemonSet-managed pods — you will handle those in the next step. The expected output is:There are pending nodes to be drained: <NODE_NAME>
Step 2: Uninstall the current NVIDIA driver
-
Log in to the node and stop kubelet and containerd. This terminates DaemonSet pods that hold GPU references and that
kubectl draincannot evict.sudo systemctl stop kubelet containerd -
Check whether any processes are still using GPU devices:
sudo fuser -v /dev/nvidia*If the command produces no output, no processes are holding GPU references and you can proceed. If output is returned, terminate each listed process. For example:
USER PID ACCESS COMMAND /dev/nvidia0: root 3781 F.... dcgm-exporter /dev/nvidiactl: root 3781 F...m dcgm-exporterTerminate the process:
sudo kill 3781Run
sudo fuser -v /dev/nvidia*again and repeat until no processes remain. -
Uninstall the current NVIDIA driver:
sudo nvidia-uninstall -
(Optional) Uninstall NVIDIA Fabric Manager if it is installed. Check whether NVIDIA Fabric Manager is installed:
sudo rpm -qa | grep ^nvidia-fabric-managerIf the command produces no output, NVIDIA Fabric Manager is not installed and you can skip this step. Otherwise, uninstall it:
yum remove nvidia-fabric-manager
Step 3: Install the new NVIDIA driver on the node
-
Install the new driver. The following example installs
NVIDIA-Linux-x86_64-510.108.03.run:sudo bash NVIDIA-Linux-x86_64-510.108.03.run -a -s -q -
Verify the installation:
sudo nvidia-smiThe output should show the new driver version. For example, after installing
510.108.03:+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 | | N/A 35C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ -
Apply the required post-install settings:
sudo nvidia-smi -pm 1 || true # Enable Persistence mode sudo nvidia-smi -acp 0 || true # Set permission to UNRESTRICTED sudo nvidia-smi --auto-boost-default=0 || true # Disable auto boost mode sudo nvidia-smi --auto-boost-permission=0 || true # Allow non-admin users to control auto boost mode sudo nvidia-modprobe -u -c=0 -m || true # Load the NVIDIA kernel module -
(Optional) To automatically load the NVIDIA driver on boot, make sure
/etc/rc.d/rc.localcontains:sudo nvidia-smi -pm 1 || true sudo nvidia-smi -acp 0 || true sudo nvidia-smi --auto-boost-default=0 || true sudo nvidia-smi --auto-boost-permission=0 || true sudo nvidia-modprobe -u -c=0 -m || true -
Check whether NVIDIA Fabric Manager is required. Nodes with NVSwitch-based multi-GPU interconnects need NVIDIA Fabric Manager.
sudo lspci | grep -i 'Bridge:.*NVIDIA'If the command produces no output, NVIDIA Fabric Manager is not required. If output is returned, download the matching NVIDIA Fabric Manager package from the NVIDIA YUM repository. The package version must match the new NVIDIA driver version. Install and start NVIDIA Fabric Manager:
# Install NVIDIA Fabric Manager sudo yum localinstall nvidia-fabric-manager-510.108.03-1.x86_64.rpm # Enable NVIDIA Fabric Manager on boot systemctl enable nvidia-fabricmanager.service # Start NVIDIA Fabric Manager systemctl start nvidia-fabricmanager.service -
Restart kubelet and containerd:
sudo systemctl restart containerd kubelet
Step 4: Return the node to the cluster
Mark the node as schedulable:
sudo kubectl uncordon <NODE_NAME>
Replace <NODE_NAME> with the name of the node. Kubernetes resumes scheduling pods on the node.