Use a node pool to update the NVIDIA driver for an existing node - Container Service for Kubernetes

If your Compute Unified Device Architecture (CUDA) library requires a newer NVIDIA driver, you must upgrade the driver on your nodes. Node pools let you manage different driver versions across your cluster in batches. This topic describes how to use a node pool to upgrade the NVIDIA driver on existing nodes.

Limitations

Upgrading the NVIDIA driver requires removing the node from its original node pool and adding it to a new one. You cannot upgrade the NVIDIA driver for a specific node within its existing node pool because the pool may contain other nodes that you do not want to change.

Important

During this process, the node's operating system and the specified NVIDIA driver are reinstalled. Before you proceed, ensure the node is not running any workloads and contains no important data. To minimize risk, upgrade a single node first to verify the process before upgrading in batches.

Usage notes

ACK does not guarantee the compatibility between the NVIDIA driver version and the CUDA toolkit version. You must verify the compatibility between them.
For detailed driver version requirements for different NVIDIA GPU models, see the official NVIDIA documentation.
For custom operating system images that already have GPU components installed, such as the GPU driver or NVIDIA Container Runtime, ACK cannot guarantee that the pre-installed driver is compatible with other ACK GPU components, such as monitoring components.
This method applies the custom driver only to new or scaled-out nodes. The installation is triggered upon node addition and does not affect existing nodes. To apply a new driver to existing nodes, you must first remove the nodes, then add them back to the cluster.
For instance types gn7 and ebmgn7, driver versions 510.xxx and 515.xxx have compatibility issues. We recommend using driver versions earlier than 510 (for example, 470.xxx.xxxx) with GSP disabled, or driver versions 525.125.06 or later.
Elastic Compute Service (ECS) instances of the ebmgn7 or ebmgn7e instance types support only NVIDIA driver versions 460.32.03 or later.
During node pool creation, if the specified driver version is not listed in ACK's supported NVIDIA driver versions, ACK will automatically install the default driver version. Specifying driver versions incompatible with the latest OS may cause node addition failures. In such cases, always select the latest supported driver version.
If you use an NVIDIA driver that you uploaded to an OSS bucket, the NVIDIA driver may be incompatible with the OS, Elastic Compute Service (ECS) instance type, or container runtime. Consequently, the GPU-accelerated nodes that are installed with the NVIDIA driver fail to be added. ACK does not guarantee that all nodes can be successfully added to a cluster.

Use cases

If you use node pool A to manage nodes with an upgraded NVIDIA driver, you can schedule workloads to these nodes by setting the workload's nodeselector to the label of node pool A.
If you need to upgrade the NVIDIA driver on some nodes in the cluster to version 550.144.03 and others to 535.161.07, you can add one group of nodes to node pool A and the other group to node pool B.

Step 1: Determine the NVIDIA driver version

Before selecting an NVIDIA driver version, you must determine which NVIDIA driver versions are compatible with the CUDA library you are using. You can refer to the CUDA Toolkit Release Notes to check the compatibility between CUDA libraries and NVIDIA drivers, and select a suitable NVIDIA driver version.

Step 2: Remove nodes

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the one you want to change. In the left navigation pane, choose Nodes > Nodes.
Select the node you want to upgrade, and click Batch Remove. In the Remove Node dialog box, select Drain Node and click OK.

Step 3: Create a node pool and specify the driver version

Create a node pool by selecting a driver from the version list

Note

Select an NVIDIA driver version that matches your business needs from . This topic uses driver version 550.144.03 as an example.
This method provides a simple way to install the driver. Add the ack.aliyun.com/nvidia-driver-version=<driver_version> label to the nodes in the node pool, then add the nodes removed from the Cluster in Step 2 to this node pool.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.
In the upper-left corner, click Create Node Pool. For details about the configuration parameters, see Create and manage a node pool. Configure the key parameters as follows:
In the Node Labels section under Advanced Options, add a label. Click the icon. In the Key field, enter ack.aliyun.com/nvidia-driver-version. In the Value field, enter 550.144.03.

Create a node pool with a custom driver version

This section uses driver version 515.86.01 as an example.

Part 1: Prepare the custom driver

If the required driver version is not available in , download the target driver version from the NVIDIA official website.
Download the NVIDIA Fabric Manager from the official NVIDIA YUM repository. The NVIDIA Fabric Manager version must match the driver version.
```
wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-515.86.01-1.x86_64.rpm
```
Log on to the Object Storage Service (OSS) console and create an OSS bucket. For detailed instructions, see Create buckets.
We recommend that the OSS Bucket is in the same region as the target ACK cluster. This lets the ACK nodes pull the driver from the OSS Bucket over the internal network when installing the GPU driver.
Upload the NVIDIA-Linux-x86_64-515.86.01.run and nvidia-fabric-manager-515.86.01-1.x86_64.rpm files to the root directory of the target bucket.
On the page for the target Bucket, click Files > Files in the navigation pane on the left. In the file list, find the uploaded file and click View Details in the Actions column.
In the View Details panel, turn off the HTTPS switch to disable the HTTPS configuration.
On the details page of the target Bucket, click Overview in the navigation pane on the left. In the lower section of the page, obtain the Internal Endpoint.

Part 2: Create the node pool with the custom driver

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.

In the upper-left corner, click Create Node Pool. For detailed information about the configuration items, see Create and manage a node pool.

In the Node Label section, click the icon.

Key	Value
`ack.aliyun.com/nvidia-driver-oss-endpoint`	The Internal Endpoint of the OSS Bucket obtained in Step 1.7. Example: `my-nvidia-driver.oss-cn-beijing-internal.aliyuncs.com`
`ack.aliyun.com/nvidia-driver-runfile`	The name of the NVIDIA driver file downloaded in Step 1.1. Example: `NVIDIA-Linux-x86_64-515.86.01.run`
`ack.aliyun.com/nvidia-fabricmanager-rpm`	The name of the NVIDIA Fabric Manager file downloaded in Step 1.2. Example: `nvidia-fabric-manager-515.86.01-1.x86_64.rpm`

Step 4: Add the nodes to the node pool

After the node pool is created, add the nodes that you removed in Step 2 to this new node pool.

Step 5: Verify the upgrade

Run the following command to view the Pods with the component: nvidia-device-plugin label:

kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

The output shows that the Pod corresponding to the newly added node in the cluster is nvidia-device-plugin-cn-beijing.192.168.1.128.

Run the following command to verify the node's driver version:

kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi

Expected output:

Mon Mar 24 08:51:55 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI xxx.xxx.xx   Driver Version: xxx.xxx.xx   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If the Driver Version in the output matches your target, the upgrade is successful.

Reference

Specify a custom NVIDIA driver version for a node