Use a node pool to upgrade the NVIDIA driver on nodes - Container Service for Kubernetes

How it works

Upgrading the NVIDIA driver on an existing node requires removing it from the current node pool and adding it to a new node pool configured with the target driver version. You cannot upgrade the driver for a specific node within its existing node pool, because the pool may contain other nodes you do not want to change.

ACK installs the specified driver only when a node joins a node pool. The installation does not affect nodes already in the pool. To apply a new driver to nodes already in a pool, you must first remove the nodes, then add them back to the cluster.

Important

During this process, the node's operating system and NVIDIA driver are reinstalled. Make sure the node has no running workloads and no important data before proceeding. Upgrade a single node first to verify the process before upgrading in batches.

Use cases

Selective driver upgrade: Upgrade the NVIDIA driver on specific nodes without affecting others in the cluster.
Multi-version cluster: Run nodes with different driver versions in the same cluster. For example, upgrade some nodes to version 550.144.03 and others to 535.161.07 by assigning them to separate node pools.
Workload targeting: After the upgrade, schedule workloads to the upgraded nodes by setting the workload's nodeSelector to the label of the new node pool.

Prerequisites

Before you begin, ensure that you have:

An ACK cluster with GPU nodes
Nodes with no running workloads and no important data to be preserved
Access to the ACK console

Limitations

ACK does not guarantee compatibility between the NVIDIA driver version and the CUDA Toolkit version. Verify compatibility before proceeding.
For driver version requirements for specific NVIDIA GPU models, see the NVIDIA official website.
For custom operating system images with pre-installed GPU components (such as the GPU driver or NVIDIA Container Runtime), ACK cannot guarantee that the pre-installed driver is compatible with other ACK GPU components, such as monitoring components.
If you specify a driver version that is not in ACK's supported NVIDIA driver versions list, ACK automatically installs the default driver version. Always select the latest supported version when the target version is incompatible with the current operating system, otherwise node addition may fail.
If you use an NVIDIA driver uploaded to an Object Storage Service (OSS) bucket, the NVIDIA driver may be incompatible with the OS, Elastic Compute Service (ECS) instance type, or container runtime. Consequently, the GPU-accelerated nodes that are installed with the NVIDIA driver fail to be added. ACK does not guarantee that all nodes can be successfully added to the cluster.

Instance type compatibility

Certain Elastic Compute Service (ECS) instance types have driver version restrictions:

Instance type	Restriction	Recommended versions
gn7 and ebmgn7	Versions 510.xxx and 515.xxx have compatibility issues	Versions earlier than 510 (e.g., 470.xxx.xxxx) with GPU System Processor (GSP) disabled, or 525.125.06 and later
ebmgn7 and ebmgn7e	Minimum version required	460.32.03 or later

Step 1: Determine the NVIDIA driver version

Check the CUDA Toolkit Release Notes to find NVIDIA driver versions compatible with your CUDA library, then select the target driver version.

Step 2: Remove nodes from the current node pool

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Nodes.
Select the nodes to upgrade, then click Batch Remove. In the Remove Node dialog box, select Drain Node and click OK.

Step 3: Create a node pool with the target driver version

Choose one of the following methods based on whether the target driver version is available in ACK's supported version list.

Method 1: Select a driver from the version list (recommended)

Use this method when the target driver version is available in ACK's supported NVIDIA driver versions list. This example uses driver version 550.144.03.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Node Pools.
In the upper-left corner, click Create Node Pool. For details about configuration parameters, see Create and manage a node pool.
Under Advanced Options, go to the Node Labels section and click the icon to add a label. Set the Key to ack.aliyun.com/nvidia-driver-version and the Value to 550.144.03.

Method 2: Use a custom driver version

Use this method when the target driver version is not available in ACK's supported list. This example uses driver version 515.86.01.

Part 1: Prepare the driver files

Download the target driver from the NVIDIA official website.
Download NVIDIA Fabric Manager from the NVIDIA YUM repository. The Fabric Manager version must match the driver version.
```
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-515.86.01-1.x86_64.rpm
```
Log on to the OSS console and create an OSS bucket in the same region as the target ACK cluster. Using the same region lets nodes pull the driver over the internal network during installation. For instructions, see Create buckets.
Upload the NVIDIA-Linux-x86_64-515.86.01.run and nvidia-fabric-manager-515.86.01-1.x86_64.rpm files to the root directory of the bucket.
In the OSS console, go to Files > Files for the target bucket. Find the uploaded driver file, then click View Details in the Actions column.
In the View Details panel, turn off the HTTPS switch.
Return to the bucket's Overview page and copy the internal endpoint.

Part 2: Create the node pool

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the cluster name. In the left navigation pane, choose Nodes > Node Pools.
In the upper-left corner, click Create Node Pool. For details about configuration parameters, see Create and manage a node pool.

Under Node Label, click the icon and add the following labels:

Key	Value
`ack.aliyun.com/nvidia-driver-oss-endpoint`	The internal endpoint of the OSS bucket (e.g., `my-nvidia-driver.oss-cn-beijing-internal.aliyuncs.com`)
`ack.aliyun.com/nvidia-driver-runfile`	The driver file name (e.g., `NVIDIA-Linux-x86_64-515.86.01.run`)
`ack.aliyun.com/nvidia-fabricmanager-rpm`	The Fabric Manager file name (e.g., `nvidia-fabric-manager-515.86.01-1.x86_64.rpm`)

Step 4: Add the nodes to the new node pool

Add the nodes removed in Step 2 to the new node pool. ACK reinstalls the operating system and the specified NVIDIA driver on each node as it joins.

Step 5: Verify the upgrade

Get the nvidia-device-plugin pods running on the upgraded nodes:

kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

The newly added node appears with a shorter AGE value. In this example, the pod for the newly upgraded node is nvidia-device-plugin-cn-beijing.192.168.1.128.

Run nvidia-smi on the pod to check the installed driver version:

kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi

Expected output:

Mon Mar 24 08:51:55 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI xxx.xxx.xx   Driver Version: xxx.xxx.xx   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If the Driver Version in the output matches your target version, the upgrade is successful.

What's next

Specify a custom NVIDIA driver version for a node