All Products
Search
Document Center

Container Service for Kubernetes:Manually update the NVIDIA driver of a node

Last Updated:Mar 26, 2026

When your CUDA libraries require a higher NVIDIA driver version than what is currently installed, upgrade the driver manually by uninstalling the old version and installing the new one. This topic describes how to do that on a GPU-accelerated node in an ACK cluster.

How it works

NVIDIA driver upgrades require the GPU kernel modules to be fully unloaded before a new driver can be loaded. Because running processes — including DaemonSet pods — hold references to those modules, you must release all GPU consumers before the uninstall can proceed.

The upgrade follows this sequence:

  1. Mark the node as unschedulable and evict workload pods.

  2. Stop kubelet and containerd to release DaemonSet pods that kubectl drain cannot evict.

  3. Terminate any remaining processes that hold GPU device references.

  4. Uninstall the current driver (and NVIDIA Fabric Manager if present).

  5. Install the new driver and apply the required post-install settings.

  6. Restore the node to the cluster.

Prerequisites

Before you begin, ensure that you have:

  • kubectl configured with access to the cluster

  • SSH access to the target GPU-accelerated node

  • The new NVIDIA driver .run file downloaded from the NVIDIA official site to the node

  • (Conditional) The matching NVIDIA Fabric Manager .rpm package, if required — see Step 3, substep 5

Step 1: Cordon and drain the node

  1. Mark the GPU-accelerated node as unschedulable:

    kubectl cordon <NODE_NAME>

    Replace <NODE_NAME> with the name of the target node. The expected output is:

    node/<NODE_NAME> cordoned
  2. Evict workload pods from the node:

    kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true

    --grace-period=120 gives pods up to 120 seconds to shut down cleanly. --ignore-daemonsets=true skips DaemonSet-managed pods — you will handle those in the next step. The expected output is:

    There are pending nodes to be drained:
     <NODE_NAME>

Step 2: Uninstall the current NVIDIA driver

  1. Log in to the node and stop kubelet and containerd. This terminates DaemonSet pods that hold GPU references and that kubectl drain cannot evict.

    sudo systemctl stop kubelet containerd
  2. Check whether any processes are still using GPU devices:

    sudo fuser -v /dev/nvidia*

    If the command produces no output, no processes are holding GPU references and you can proceed. If output is returned, terminate each listed process. For example:

                         USER        PID ACCESS COMMAND
    /dev/nvidia0:        root       3781 F.... dcgm-exporter
    /dev/nvidiactl:      root       3781 F...m dcgm-exporter

    Terminate the process:

    sudo kill 3781

    Run sudo fuser -v /dev/nvidia* again and repeat until no processes remain.

  3. Uninstall the current NVIDIA driver:

    sudo nvidia-uninstall
  4. (Optional) Uninstall NVIDIA Fabric Manager if it is installed. Check whether NVIDIA Fabric Manager is installed:

    sudo rpm -qa | grep ^nvidia-fabric-manager

    If the command produces no output, NVIDIA Fabric Manager is not installed and you can skip this step. Otherwise, uninstall it:

    yum remove nvidia-fabric-manager

Step 3: Install the new NVIDIA driver on the node

  1. Install the new driver. The following example installs NVIDIA-Linux-x86_64-510.108.03.run:

    sudo bash NVIDIA-Linux-x86_64-510.108.03.run -a -s -q
  2. Verify the installation:

    sudo nvidia-smi

    The output should show the new driver version. For example, after installing 510.108.03:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
    | N/A   35C    P0    40W / 300W |      0MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
  3. Apply the required post-install settings:

    sudo nvidia-smi -pm 1 || true                            # Enable Persistence mode
    sudo nvidia-smi -acp 0 || true                           # Set permission to UNRESTRICTED
    sudo nvidia-smi --auto-boost-default=0 || true           # Disable auto boost mode
    sudo nvidia-smi --auto-boost-permission=0 || true        # Allow non-admin users to control auto boost mode
    sudo nvidia-modprobe -u -c=0 -m || true                  # Load the NVIDIA kernel module
  4. (Optional) To automatically load the NVIDIA driver on boot, make sure /etc/rc.d/rc.local contains:

    sudo nvidia-smi -pm 1 || true
    sudo nvidia-smi -acp 0 || true
    sudo nvidia-smi --auto-boost-default=0 || true
    sudo nvidia-smi --auto-boost-permission=0 || true
    sudo nvidia-modprobe -u -c=0 -m || true
  5. Check whether NVIDIA Fabric Manager is required. Nodes with NVSwitch-based multi-GPU interconnects need NVIDIA Fabric Manager.

    sudo lspci | grep -i 'Bridge:.*NVIDIA'

    If the command produces no output, NVIDIA Fabric Manager is not required. If output is returned, download the matching NVIDIA Fabric Manager package from the NVIDIA YUM repository. The package version must match the new NVIDIA driver version. Install and start NVIDIA Fabric Manager:

    # Install NVIDIA Fabric Manager
    sudo yum localinstall nvidia-fabric-manager-510.108.03-1.x86_64.rpm
    
    # Enable NVIDIA Fabric Manager on boot
    systemctl enable nvidia-fabricmanager.service
    
    # Start NVIDIA Fabric Manager
    systemctl start nvidia-fabricmanager.service
  6. Restart kubelet and containerd:

    sudo systemctl restart containerd kubelet

Step 4: Return the node to the cluster

Mark the node as schedulable:

sudo kubectl uncordon <NODE_NAME>

Replace <NODE_NAME> with the name of the node. Kubernetes resumes scheduling pods on the node.