If the Compute Unified Device Architecture (CUDA) toolkit that you use requires a later version of the NVIDIA driver, you must update the NVIDIA driver of the node. To update the NVIDIA driver for a node, you can remove the node from the cluster and then add the node back to the cluster. When the node is added to the cluster, the operating system of the node is reinstalled. This allows you to install a specified version of the NVIDIA driver. This topic describes how to use a node pool to update the NVIDIA driver for a node.

Background information

Container Service for Kubernetes (ACK) does not allow you to update the NVIDIA driver for a node without removing the node from the cluster. The node pool to which the node belongs may contain different nodes. Therefore, you cannot update the NVIDIA driver for the entire node pool.

Benefits

This solution allows you to manage the NVIDIA drivers of different nodes in batches. The following two scenarios are covered:

  • You manage cluster nodes in two groups: A and B. You want to update the NVIDIA driver of Group A to version 418.181.07 and the NVIDIA driver of Group B to version 450.102.0. In this case, you can add nodes in Group A to Node Pool A and nodes in Group B to Node Pool B.
  • Node Pool A consists of all nodes whose NVIDIA drivers are to be updated to version 418.181.07. If you want to schedule a task to a node that runs the NVIDIA driver of version 418.181.07, you need only to specify the selector of the task to the label of Node Pool A.

Default NVIDIA driver versions of ACK clusters

The following table describes the mappings between the default NVIDIA driver versions and Kubernetes versions.

Kubernetes version Default NVIDIA driver version Support for custom NVIDIA driver version Supported custom NVIDIA driver version
1.14.8 418.181.07 Yes
  • 418.181.07
  • 450.102.04
  • 460.32.03
  • 460.73.01
  • 470.57.02
1.16.6 418.87.01 No NA
1.16.9 418.181.07 Yes
  • 418.181.07
  • 450.102.04
  • 460.32.03
  • 460.73.01
  • 470.57.02
  • 510.47.03
1.18.8 418.181.07 Yes
1.20.4 450.102.04 Yes
1.22.10 460.91.03 Yes
1.24.3 460.91.03 Yes

Step 1: Determine the NVIDIA driver version

Note
  • To update the NVIDIA driver for a node, you must remove the node from the cluster and then add the node back to the cluster. When the node is added to the cluster, the operating system of the node is reinstalled and the NVIDIA driver of the specified version is installed. Before you perform the update, make sure that no task is running on the node and critical data is backed up.
  • To lower the risk of failures, we recommend that you first update the NVIDIA driver for one node. If no error occurs during this process, you can then perform the update on multiple nodes.

You must first obtain the NVIDIA driver versions that are compatible with the CUDA toolkit that you use. The following figure displays a list of CUDA Toolkit versions and the compatible NVIDIA driver versions. For more information, see cuda-toolkit-release-notes.

CUDA Toolkit Linux x86_64 Driver Version
CUDA 11.7 Update 1 ≥515.65.01
CUDA 11.7 GA ≥515.43.04
CUDA 11.6 Update 2 ≥510.47.03
CUDA 11.6 Update 1 ≥510.47.03
CUDA 11.6 GA ≥510.39.01
CUDA 11.5 Update 2 ≥495.29.05
CUDA 11.5 Update 1 ≥495.29.05
CUDA 11.5 GA ≥495.29.05
CUDA 11.4 Update 4 ≥470.82.01
CUDA 11.4 Update 3 ≥470.82.01
CUDA 11.4 Update 2 ≥470.57.02
CUDA 11.4 Update 1 ≥470.57.02
CUDA 11.4.0 GA ≥470.42.01
CUDA 11.3.1 Update 1 ≥465.19.01
CUDA 11.3.0 GA ≥465.19.01
CUDA 11.2.2 Update 2 ≥460.32.03
CUDA 11.2.1 Update 1 ≥460.32.03
CUDA 11.2.0 GA ≥460.27.03
CUDA 11.1.1 Update 1 ≥455.32
CUDA 11.1 GA ≥455.23
CUDA 11.0.3 Update 1 ≥ 450.51.06
CUDA 11.0.2 GA ≥ 450.51.05
CUDA 11.0.1 RC ≥ 450.36.06
CUDA 10.2.89 ≥ 440.33
CUDA 10.1 (10.1.105 general release, and updates) ≥ 418.39
CUDA 10.0.130 ≥ 410.48
CUDA 9.2 (9.2.148 Update 1) ≥ 396.37
CUDA 9.2 (9.2.88) ≥ 396.26
CUDA 9.1 (9.1.85) ≥ 390.46
CUDA 9.0 (9.0.76) ≥ 384.81
CUDA 8.0 (8.0.61 GA2) ≥ 375.26
CUDA 8.0 (8.0.44) ≥ 367.48
CUDA 7.5 (7.5.16) ≥ 352.31
CUDA 7.0 (7.0.28) ≥ 346.46

Step 2: Remove a node from the cluster

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Nodes > Nodes.
  5. Select the node for which you want to update the NVIDIA driver and click Batch Remove.
  6. In the Remove Node dialog box, select Drain the Node and click OK.

Step 3: Create a node pool and specify an NVIDIA driver version

Method 1: Select an NVIDIA driver version provided by ACK

Note This method is simple. You need only to add the ack.aliyun.com/nvidia-driver-version=<Driver Version> label when you create a node pool and then add the node that you removed to the node pool.

The following example shows how to set the NVIDIA driver version to 418.181.07:

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
  5. In the upper-right corner of the Node Pools page, click Create Node Pool.
  6. In the Create Node Pool dialog box, configure the parameters. For more information about the parameters, see Create an ACK dedicated cluster. The following table describes some of the parameters.
    Parameter Description
    vSwitch The vSwitch and Instance Type parameters apply only when new nodes are added to the node pool. In this example, an existing node is added to the node pool. Therefore, you can set the parameters to any values.
    Instance Type
    Operating System The operating system of the node after the node is added to the cluster.
    Quantity Set the value to 0. Otherwise, ACK creates new Elastic Compute Service (ECS) instances. ACK2
    1. Click Show Advanced Options.
    2. In the Node Label section, click 1, set Key to ack.aliyun.com/nvidia-driver-version, and then set Value to 418.181.07.

      For more information about the NVIDIA driver versions provided by ACK, see Default NVIDIA driver versions of ACK clusters.

      Note The ECS instance types ecs.ebmgn7 and ecs.ebmgn7e support only NVIDIA driver versions later than 460.32.03.
    3. After you set the parameters, click Confirm Order.

Method 2: Use a custom NVIDIA driver version

Custom driver version

The following example shows how to upload the NVIDIA-Linux-x86_64-460.32.03.run package to specify a custom driver version. You can download NVIDIA driver package from the NVIDIA official website.

Note The NVIDIA-Linux-x86_64-460.32.03.run package must be stored in the root directory of an Object Storage Service (OSS) bucket.
  1. Create an OSS bucket in the OSS console. For more information, see Create buckets.
  2. Upload the NVIDIA-Linux-x86_64-460.32.03.run package to the OSS bucket. For more information, see Upload objects.
    Note If the ECS instance type is ecs.ebmgn7 or ecs.ebmgn7e, you must also upload the NVIDIA Fabric Manager package to the OSS bucket. The version of the uploaded NVIDIA Fabric Manager package must match the version of the uploaded NVIDIA driver package. The following limits apply when you upload the NVIDIA Fabric Manager package:
    • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, you must upload the NVIDIA Fabric Manager package in RPM format. For more information, see RPM.
    • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, you must upload the NVIDIA Fabric Manager package in tar.gz format. For more information, see tar.
  3. After the package is uploaded to the bucket, click Files in the left-side navigation pane of the bucket details page.
  4. On the Files page, find the package that you uploaded and click View Details in the Actions column.
  5. In the View Details panel, turn off HTTPS to disable HTTPS.
    Note ACK downloads the NVIDIA driver package by using its URL, which uses the HTTP protocol. By default, OSS uses the HTTPS protocol. You must turn off HTTPS to disable HTTPS.
  6. Check the URL of the NVIDIA driver package. Record the URL. The URL consists of the following parts: endpoint and runfile. For example, you can divide the following URL into two parts: http://nvidia-XXX-XXX-cn-beijing.aliyuncs.com/NVIDIA-Linux-x86_64-460.32.03.run.
    • endpoint: nvidia-XXX-XXX-cn-beijing.aliyuncs.com
    • runfile: NVIDIA-Linux-x86_64-460.32.03.run

Create a node pool with a custom driver version

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
  5. In the upper-right corner of the Node Pools page, click Create Node Pool.
  6. In the Create Node Pool dialog box, configure the parameters. For more information about the parameters, see Create an ACK dedicated cluster. The following table describes some of the parameters.
    Parameter Description
    vSwitch The vSwitch and Instance Type parameters apply only when new nodes are added to the node pool. In this example, an existing node is added to the node pool. Therefore, you can set the parameters to any values.
    Instance Type
    Operating System The operating system of the node after the node is added to the cluster.
    Quantity Set the value to 0. Otherwise, ACK creates new Elastic Compute Service (ECS) instances. ACK2
    1. Click Show Advanced Options.
    2. In the Node Label section, click 1 to add the following labels:
      • For the first label, set Key to ack.aliyun.com/nvidia-driver-oss-endpoint and Value to nvidia-XXX-XXX-cn-beijing.aliyuncs.com.
      • For the second label, set Key to ack.aliyun.com/nvidia-driver-runfile and Value to NVIDIA-Linux-x86_64-460.32.03.run.
      • For the third label, set Key and Value based on the Kubernetes version of the cluster.
        • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-rpm and Value to nvidia-fabric-manager-460.32.03-1.x86_64.rpm.
        • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-tarball and Value to fabricmanager-linux-x86_64-460.32.03.tar.gz.
    3. After you set the parameters, click Confirm Order.

Step 4: Add the node to the node pool

After the node pool is created, add the node that you removed to the node pool. For more information, see Add existing ECS instances to an ACK cluster.

Step 5: Check whether the NVIDIA driver version is updated

  1. In the ACK console, select the cluster to which the node belongs and click More > Open Cloud Shell in the Actions column.
  2. Run the following command to query pods that have the component: nvidia-device-plugin label:
    kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

    Expected output:

    NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
    nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

    You can check the NODE column to find the node that is newly added to the cluster. The name of the pod that runs on the node is nvidia-device-plugin-cn-beijing.192.168.1.128.

  3. Run the following command to query the NVIDIA driver version of the node:
    kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi 

    Expected output:

    Sun Feb  7 04:09:01 2021       
    +-----------------------------------------------------------------------------+
     NVIDIA-SMI 418.181.07   Driver Version: 418.181.07   CUDA Version: N/A      
    -------------------------------+----------------------+----------------------+
     GPU  Name        Persistence-M Bus-Id        Disp.A  Volatile Uncorr. ECC 
     Fan  Temp  Perf  Pwr:Usage/Cap         Memory-Usage  GPU-Util  Compute M. 
    ===============================+======================+======================
       0  Tesla V100-SXM2...  On    00000000:00:07.0 Off                     0 
     N/A   27C    P0    40W / 300W       0MiB / 16130MiB       0%      Default 
    +-------------------------------+----------------------+----------------------+
       1  Tesla V100-SXM2...  On    00000000:00:08.0 Off                     0 
     N/A   27C    P0    40W / 300W       0MiB / 16130MiB       0%      Default 
    +-------------------------------+----------------------+----------------------+
       2  Tesla V100-SXM2...  On    00000000:00:09.0 Off                     0 
     N/A   31C    P0    39W / 300W       0MiB / 16130MiB       0%      Default 
    +-------------------------------+----------------------+----------------------+
       3  Tesla V100-SXM2...  On    00000000:00:0A.0 Off                     0 
     N/A   27C    P0    41W / 300W       0MiB / 16130MiB       0%      Default 
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
     Processes:                                                       GPU Memory 
      GPU       PID   Type   Process name                             Usage      
    =============================================================================
      No running processes found                                                 
    +-----------------------------------------------------------------------------+

    The output shows that the NVIDIA driver version is 418.181.07. This indicates that the NVIDIA driver is updated.