By default, different types and versions of Alibaba Cloud Container Service for Kubernetes (ACK) clusters install different NVIDIA GPU driver versions. If your Compute Unified Device Architecture (CUDA) toolkit requires a newer driver for compatibility, you can install a custom version on your GPU nodes. This topic explains how to use a node pool label to customize the NVIDIA GPU driver version on GPU nodes.
Important notes
ACK does not guarantee the compatibility between the NVIDIA driver version and the CUDA toolkit version. You must verify the compatibility between them.
For detailed driver version requirements for different NVIDIA GPU models, see the official NVIDIA documentation.
For custom operating system images that already have GPU components installed, such as the GPU driver or NVIDIA Container Runtime, ACK cannot guarantee that the pre-installed driver is compatible with other ACK GPU components, such as monitoring components.
This method applies the custom driver only to new or scaled-out nodes. The installation is triggered upon node addition and does not affect existing nodes. To apply a new driver to existing nodes, you must first remove the nodes, then add them back to the cluster.
For instance types gn7 and ebmgn7, driver versions 510.xxx and 515.xxx have compatibility issues. We recommend using driver versions earlier than 510 (for example, 470.xxx.xxxx) with GSP disabled, or driver versions 525.125.06 or later.
Elastic Compute Service (ECS) instances of the ebmgn7 or ebmgn7e instance types support only NVIDIA driver versions 460.32.03 or later.
During node pool creation, if the specified driver version is not listed in ACK's supported NVIDIA driver versions, ACK will automatically install the default driver version. Specifying driver versions incompatible with the latest OS may cause node addition failures. In such cases, always select the latest supported driver version.
Step 1: Select the NVIDIA GPU driver version
Select the required NVIDIA GPU driver version from the list. This topic uses driver version 550.144.03 as an example.
Step 2: Create a node pool and specify the driver version
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose .
In the upper-left corner, click Create Node Pool. For details about the configuration parameters, see Create and manage a node pool. Configure the key parameters as follows:
In the Node Labels section under Advanced Options, add a label. Click the
icon. In the Key field, enter ack.aliyun.com/nvidia-driver-version. In the Value field, enter550.144.03.
Step 3: Verify the custom NVIDIA driver installation
Run the following command to view pods with the
component: nvidia-device-pluginlabel:kubectl get po -n kube-system -l component=nvidia-device-plugin -o wideExpected output:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ack-nvidia-device-plugin-fnctc 1/1 Running 0 2m33s 10.117.227.43 cn-qingdao.10.117.XXX.XX <none> <none>The output shows a pod named
ack-nvidia-device-plugin-fnctcrunning on the newly added node.Run the following command to verify that the node uses the expected driver version:
kubectl exec -ti ack-nvidia-device-plugin-fnctc -n kube-system -- nvidia-smiExpected output:
Mon Mar 24 08:51:55 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P4 On | 00000000:00:07.0 Off | 0 | | N/A 33C P8 7W / 75W | 0MiB / 7680MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+The output shows
Driver Version: 550.144.03. The output confirms that the node pool label successfully applied the custom NVIDIA driver.
Alternative methods
Alternatively, you can set the custom driver label in the node pool's configuration when using the CreateClusterNodePool API operation. The following example shows the tags section:
{
// Other fields are not shown.
......
"tags": [
{
"key": "ack.aliyun.com/nvidia-driver-version",
"value": "550.144.03"
}
],
// Other fields are not shown.
......
}