How to specify an NVIDIA driver version for nodes by adding an OSS URL - Container Service for Kubernetes

By default, the version of the NVIDIA driver installed in a Container Service for Kubernetes (ACK) cluster varies based on the type and version of the cluster. If the Compute Unified Device Architecture (CUDA) toolkit that you use requires an NVIDIA driver update, you need to manually install the NVIDIA driver on cluster nodes. This topic describes how to specify an NVIDIA driver version for GPU-accelerated nodes in a node pool by adding an Object Storage Service (OSS) URL.

Precautions

ACK does not guarantee the compatibility of NVIDIA drivers with the CUDA toolkit. You need to verify the compatibility.
For custom OS images that are installed with the NVIDIA driver and GPU components such as the NVIDIA Container Runtime, ACK does not guarantee the compatibility of the NVIDIA driver with other GPU components, such as the monitoring components.
If you add a label to a node pool to specify an NVIDIA driver version for GPU-accelerated nodes, the specified NVIDIA driver is installed only when a new node is added to the node pool. The NVIDIA driver is not installed on the existing nodes in the node pool. If you want to install the NVIDIA driver on the existing nodes, you need to remove these nodes from the node pool and re-add them to the node pool. For more information, see Remove a node and Add existing ECS instances to an ACK cluster.
If you use an NVIDIA driver that you uploaded to an OSS bucket, the NVIDIA driver may be incompatible with the OS, Elastic Compute Service (ECS) instance type, or container runtime. Consequently, the GPU-accelerated nodes that are installed with the NVIDIA driver fail to be added. ACK does not guarantee that all nodes can be successfully added to a cluster.
The ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types are incompatible with NVIDIA driver versions 510.xxx and 515.xxx. For the ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types, we recommend that you use driver versions that are earlier than 510.xxx and have the GPU System Processor (GSP) disabled, such as 470.xxx.x1PINSGHEN xxx, or 525.125.06 or later versions.
For more information about the NVIDIA driver versions required by different NVIDIA models, including P100, T4, V100, and A10, see NVIDIA official documentation.

Step 1: Download the NVIDIA driver

If the NVIDIA driver versions supported by ACK list does not contain the desired NVIDIA driver version, you can download the driver from the NVIDIA official site. In this example, the version of the NVIDIA driver is 515.86.01. Click Search in the following figure to open the download page and download the driver NVIDIA-Linux-x86_64-515.86.01.run to your on-premises machine.

Step 2: Download NVIDIA Fabric Manager

Download NVIDIA Fabric Manager from the NVIDIA YUM repository. The version of NVIDIA Fabric Manager must be the same as that of the NVIDIA driver.

wget https://developer.download.nvidia.cn/compute/cuda/repos/rhel7/x86_64/nvidia-fabric-manager-515.86.01-1.x86_64.rpm

Step 3: Create an OSS bucket

Log on to the OSS console and create an OSS bucket. For more information, see Create a bucket.

Note

We recommend that you create an OSS bucket in the region where your ACK cluster resides because the ACK cluster needs to pull the driver from the OSS bucket through the internal network when installing the NVIDIA driver.

Step 4: Upload the NVIDIA driver and nvidia-fabric-manager to the OSS bucket

Log on to the OSS console. In the left-side navigation pane, click Buckets. Find the OSS bucket that you created and upload NVIDIA-Linux-x86_64-515.86.01.run and nvidia-fabric-manager-515.86.01-1.x86_64.rpm to the root directory of the bucket.
For more information, see Upload objects.
Important
Make sure that the files are uploaded to the root directory of the bucket but not a subdirectory.
In the left-side navigation pane of the bucket page, choose Files > Objects and click View Details in the Actions column of the driver file that you uploaded to view details.
In the View Details panel, turn off HTTPS to disable HTTPS.
Important
When ACK creates a cluster, ACK pulls the NVIDIA driver from an HTTP URL. By default, OSS buckets use HTTPS. Therefore, you need to disable HTTPS.
In the left-side navigation pane of the bucket details page, click Overview to obtain the internal endpoint of the bucket.
Important
The process of pulling the NVIDIA driver from an external endpoint is slow and ACK may fail to add GPU-accelerated nodes to the cluster. We recommend that you pull the NVIDIA driver from an internal endpoint (with the -internal keyword) or accelerated domain name (with the oss-accelerate keyword).

Step 5: Create a node pool and add labels to the node pool

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.
Click Create Node Pool in the upper-right corner. In the Create Node Pool dialog box, configure node pool parameters.
The following table describes the parameters. For more information about the parameters, see Create an ACK managed cluster.
1. Click Show Advanced Options.
2. In the Node Label section, click the icon to add the following labels:
  1. First label: Enter ack.aliyun.com/nvidia-driver-oss-endpoint in the Key field and the internal endpoint of the OSS bucket that you obtained in Step 4 in the Value field. In this example, the internal endpoint is my-nvidia-driver.oss-cn-beijing-internal.aliyuncs.com.
  2. Second label: Enter ack.aliyun.com/nvidia-driver-runfile in the Key field and the name of the NVIDIA driver that you downloaded in Step 1 in the Value field. In this example, the name of the NVIDIA driver is NVIDIA-Linux-x86_64-515.86.01.run.
  3. Third label: Enter ack.aliyun.com/nvidia-fabricmanager-rpm in the Key field and the name of the NVIDIA Fabric Manager file that you downloaded in Step 2 in the Value field. In this example, the name of the NVIDIA Fabric Manager file is nvidia-fabric-manager-515.86.01-1.x86_64.rpm.
3. After you set the parameters, click Confirm Order.

Step 6: Check whether the specified NVIDIA driver version is installed

Log on to the ACK console. In the left-side navigation pane, click Clusters.
In the ACK console, select the cluster that you want to manage and choose More > Open Cloud Shell in the Actions column.

Run the following command to query pods that have the component: nvidia-device-plugin label:

kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d

The output indicates that the name of the pod on the newly added node in the NODE column is nvidia-device-plugin-cn-beijing.192.168.1.128.

Run the following command to query the NVIDIA driver version of the node:
```
kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi
```
Expected output:
The output indicates that the NVIDIA driver version is 515.86.01. The specified NVIDIA driver is installed.

Other methods

When you call the API to create or scale out an ACK cluster, you can add the OSS URL of an NVIDIA driver to the node pool configuration. Sample code:

{
  // Other fields are not shown.
  ......
    "tags": [
      {
              "key": "ack.aliyun.com/nvidia-driver-oss-endpoint",
              "value": "xxxx"
      },
      {
        "key": "ack.aliyun.com/nvidia-driver-runfile",
        "value": "xxxx"
      },
      {
        "key": "ack.aliyun.com/nvidia-fabricmanager-rpm",
        "value": "xxxx"
      }
    ],
  // Other fields are not shown.
  ......
}