When your CUDA library requires a newer NVIDIA driver, you need to upgrade the driver on the affected GPU nodes. Container Service for Kubernetes (ACK) lets you manage different NVIDIA driver versions across a cluster using node pools — assign nodes to a node pool configured with the target driver version, and ACK installs the specified driver when those nodes join the pool.
This topic describes how to move existing nodes to a new node pool to upgrade their NVIDIA driver.
How it works
Upgrading the NVIDIA driver on an existing node requires removing it from the current node pool and adding it to a new node pool configured with the target driver version. You cannot upgrade the driver for a specific node within its existing node pool, because the pool may contain other nodes you do not want to change.
ACK installs the specified driver only when a node joins a node pool. The installation does not affect nodes already in the pool. To apply a new driver to nodes already in a pool, you must first remove the nodes, then add them back to the cluster.
During this process, the node's operating system and NVIDIA driver are reinstalled. Make sure the node has no running workloads and no important data before proceeding. Upgrade a single node first to verify the process before upgrading in batches.
Use cases
-
Selective driver upgrade: Upgrade the NVIDIA driver on specific nodes without affecting others in the cluster.
-
Multi-version cluster: Run nodes with different driver versions in the same cluster. For example, upgrade some nodes to version 550.144.03 and others to 535.161.07 by assigning them to separate node pools.
-
Workload targeting: After the upgrade, schedule workloads to the upgraded nodes by setting the workload's
nodeSelectorto the label of the new node pool.
Prerequisites
Before you begin, ensure that you have:
-
An ACK cluster with GPU nodes
-
Nodes with no running workloads and no important data to be preserved
-
Access to the ACK console
Limitations
-
ACK does not guarantee compatibility between the NVIDIA driver version and the CUDA Toolkit version. Verify compatibility before proceeding.
-
For driver version requirements for specific NVIDIA GPU models, see the NVIDIA official website.
-
For custom operating system images with pre-installed GPU components (such as the GPU driver or NVIDIA Container Runtime), ACK cannot guarantee that the pre-installed driver is compatible with other ACK GPU components, such as monitoring components.
-
If you specify a driver version that is not in ACK's supported NVIDIA driver versions list, ACK automatically installs the default driver version. Always select the latest supported version when the target version is incompatible with the current operating system, otherwise node addition may fail.
-
If you use an NVIDIA driver uploaded to an Object Storage Service (OSS) bucket, the NVIDIA driver may be incompatible with the OS, Elastic Compute Service (ECS) instance type, or container runtime. Consequently, the GPU-accelerated nodes that are installed with the NVIDIA driver fail to be added. ACK does not guarantee that all nodes can be successfully added to the cluster.
Instance type compatibility
Certain Elastic Compute Service (ECS) instance types have driver version restrictions:
| Instance type | Restriction | Recommended versions |
|---|---|---|
| gn7 and ebmgn7 | Versions 510.xxx and 515.xxx have compatibility issues | Versions earlier than 510 (e.g., 470.xxx.xxxx) with GPU System Processor (GSP) disabled, or 525.125.06 and later |
| ebmgn7 and ebmgn7e | Minimum version required | 460.32.03 or later |
Step 1: Determine the NVIDIA driver version
Check the CUDA Toolkit Release Notes to find NVIDIA driver versions compatible with your CUDA library, then select the target driver version.
Step 2: Remove nodes from the current node pool
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
-
Select the nodes to upgrade, then click Batch Remove. In the Remove Node dialog box, select Drain Node and click OK.
Step 3: Create a node pool with the target driver version
Choose one of the following methods based on whether the target driver version is available in ACK's supported version list.
Method 1: Select a driver from the version list (recommended)
Use this method when the target driver version is available in ACK's supported NVIDIA driver versions list. This example uses driver version 550.144.03.
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Node Pools.
-
In the upper-left corner, click Create Node Pool. For details about configuration parameters, see Create and manage a node pool.
-
Under Advanced Options, go to the Node Labels section and click the
icon to add a label. Set the Key to ack.aliyun.com/nvidia-driver-versionand the Value to550.144.03.
Method 2: Use a custom driver version
Use this method when the target driver version is not available in ACK's supported list. This example uses driver version 515.86.01.
Part 1: Prepare the driver files
-
Download the target driver from the NVIDIA official website.
-
Download NVIDIA Fabric Manager from the NVIDIA YUM repository. The Fabric Manager version must match the driver version.
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-515.86.01-1.x86_64.rpm -
Log on to the OSS console and create an OSS bucket in the same region as the target ACK cluster. Using the same region lets nodes pull the driver over the internal network during installation. For instructions, see Create buckets.
-
Upload the
NVIDIA-Linux-x86_64-515.86.01.runandnvidia-fabric-manager-515.86.01-1.x86_64.rpmfiles to the root directory of the bucket. -
In the OSS console, go to Files > Files for the target bucket. Find the uploaded driver file, then click View Details in the Actions column.
-
In the View Details panel, turn off the HTTPS switch.
-
Return to the bucket's Overview page and copy the internal endpoint.
Part 2: Create the node pool
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Node Pools.
-
In the upper-left corner, click Create Node Pool. For details about configuration parameters, see Create and manage a node pool.
-
Under Node Label, click the
icon and add the following labels:Key Value ack.aliyun.com/nvidia-driver-oss-endpointThe internal endpoint of the OSS bucket (e.g., my-nvidia-driver.oss-cn-beijing-internal.aliyuncs.com)ack.aliyun.com/nvidia-driver-runfileThe driver file name (e.g., NVIDIA-Linux-x86_64-515.86.01.run)ack.aliyun.com/nvidia-fabricmanager-rpmThe Fabric Manager file name (e.g., nvidia-fabric-manager-515.86.01-1.x86_64.rpm)
Step 4: Add the nodes to the new node pool
Add the nodes removed in Step 2 to the new node pool. ACK reinstalls the operating system and the specified NVIDIA driver on each node as it joins.
Step 5: Verify the upgrade
-
Get the nvidia-device-plugin pods running on the upgraded nodes:
kubectl get po -n kube-system -l component=nvidia-device-plugin -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nvidia-device-plugin-cn-beijing.192.168.1.127 1/1 Running 0 6d 192.168.1.127 cn-beijing.192.168.1.127 <none> <none> nvidia-device-plugin-cn-beijing.192.168.1.128 1/1 Running 0 17m 192.168.1.128 cn-beijing.192.168.1.128 <none> <none> nvidia-device-plugin-cn-beijing.192.168.8.12 1/1 Running 0 9d 192.168.8.12 cn-beijing.192.168.8.12 <none> <none> nvidia-device-plugin-cn-beijing.192.168.8.13 1/1 Running 0 9d 192.168.8.13 cn-beijing.192.168.8.13 <none> <none> nvidia-device-plugin-cn-beijing.192.168.8.14 1/1 Running 0 9d 192.168.8.14 cn-beijing.192.168.8.14 <none> <none>The newly added node appears with a shorter
AGEvalue. In this example, the pod for the newly upgraded node isnvidia-device-plugin-cn-beijing.192.168.1.128. -
Run nvidia-smi on the pod to check the installed driver version:
kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smiExpected output:
Mon Mar 24 08:51:55 2025 +-----------------------------------------------------------------------------+ | NVIDIA-SMI xxx.xxx.xx Driver Version: xxx.xxx.xx CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 | | N/A 27C P0 40W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:08.0 Off | 0 | | N/A 27C P0 40W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 31C P0 39W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 | | N/A 27C P0 41W / 300W | 0MiB / 16130MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+If the Driver Version in the output matches your target version, the upgrade is successful.