Manually update the NVIDIA driver of a GPU-accelerated node - Container Service for Kubernetes

When your CUDA libraries require a higher NVIDIA driver version than what is currently installed, upgrade the driver manually by uninstalling the old version and installing the new one. This topic describes how to do that on a GPU-accelerated node in an ACK cluster.

How it works

NVIDIA driver upgrades require the GPU kernel modules to be fully unloaded before a new driver can be loaded. Because running processes — including DaemonSet pods — hold references to those modules, you must release all GPU consumers before the uninstall can proceed.

The upgrade follows this sequence:

Mark the node as unschedulable and evict workload pods.
Stop kubelet and containerd to release DaemonSet pods that kubectl drain cannot evict.
Terminate any remaining processes that hold GPU device references.
Uninstall the current driver (and NVIDIA Fabric Manager if present).
Install the new driver and apply the required post-install settings.
Restore the node to the cluster.

Prerequisites

Before you begin, ensure that you have:

kubectl configured with access to the cluster
SSH access to the target GPU-accelerated node
The new NVIDIA driver .run file downloaded from the NVIDIA official site to the node
(Conditional) The matching NVIDIA Fabric Manager .rpm package, if required — see Step 3, substep 5

Step 1: Cordon and drain the node

Mark the GPU-accelerated node as unschedulable:
```
kubectl cordon <NODE_NAME>
```
Replace <NODE_NAME> with the name of the target node. The expected output is:
```
node/<NODE_NAME> cordoned
```
Evict workload pods from the node:
```
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
```
--grace-period=120 gives pods up to 120 seconds to shut down cleanly. --ignore-daemonsets=true skips DaemonSet-managed pods — you will handle those in the next step. The expected output is:
```
There are pending nodes to be drained:
 <NODE_NAME>
```

Step 2: Uninstall the current NVIDIA driver

Log in to the node and stop kubelet and containerd. This terminates DaemonSet pods that hold GPU references and that kubectl drain cannot evict.
```
sudo systemctl stop kubelet containerd
```
Check whether any processes are still using GPU devices:
```
sudo fuser -v /dev/nvidia*
```
If the command produces no output, no processes are holding GPU references and you can proceed. If output is returned, terminate each listed process. For example:
```
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        root       3781 F.... dcgm-exporter
/dev/nvidiactl:      root       3781 F...m dcgm-exporter
```
Terminate the process:
```
sudo kill 3781
```
Run sudo fuser -v /dev/nvidia* again and repeat until no processes remain.
Uninstall the current NVIDIA driver:
```
sudo nvidia-uninstall
```
(Optional) Uninstall NVIDIA Fabric Manager if it is installed. Check whether NVIDIA Fabric Manager is installed:
```
sudo rpm -qa | grep ^nvidia-fabric-manager
```
If the command produces no output, NVIDIA Fabric Manager is not installed and you can skip this step. Otherwise, uninstall it:
```
yum remove nvidia-fabric-manager
```

Step 3: Install the new NVIDIA driver on the node

Install the new driver. The following example installs NVIDIA-Linux-x86_64-510.108.03.run:
```
sudo bash NVIDIA-Linux-x86_64-510.108.03.run -a -s -q
```

Verify the installation:

sudo nvidia-smi

The output should show the new driver version. For example, after installing 510.108.03:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0    40W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Apply the required post-install settings:

sudo nvidia-smi -pm 1 || true                            # Enable Persistence mode
sudo nvidia-smi -acp 0 || true                           # Set permission to UNRESTRICTED
sudo nvidia-smi --auto-boost-default=0 || true           # Disable auto boost mode
sudo nvidia-smi --auto-boost-permission=0 || true        # Allow non-admin users to control auto boost mode
sudo nvidia-modprobe -u -c=0 -m || true                  # Load the NVIDIA kernel module

(Optional) To automatically load the NVIDIA driver on boot, make sure /etc/rc.d/rc.local contains:

sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Check whether NVIDIA Fabric Manager is required. Nodes with NVSwitch-based multi-GPU interconnects need NVIDIA Fabric Manager.
```
sudo lspci | grep -i 'Bridge:.*NVIDIA'
```
If the command produces no output, NVIDIA Fabric Manager is not required. If output is returned, download the matching NVIDIA Fabric Manager package from the NVIDIA YUM repository. The package version must match the new NVIDIA driver version. Install and start NVIDIA Fabric Manager:
```
# Install NVIDIA Fabric Manager
sudo yum localinstall nvidia-fabric-manager-510.108.03-1.x86_64.rpm

# Enable NVIDIA Fabric Manager on boot
systemctl enable nvidia-fabricmanager.service

# Start NVIDIA Fabric Manager
systemctl start nvidia-fabricmanager.service
```

Restart kubelet and containerd:

sudo systemctl restart containerd kubelet

Step 4: Return the node to the cluster

Mark the node as schedulable:

sudo kubectl uncordon <NODE_NAME>

Replace <NODE_NAME> with the name of the node. Kubernetes resumes scheduling pods on the node.