All Products
Search
Document Center

Container Service for Kubernetes:Use a node pool to update the NVIDIA driver for an existing node

Last Updated:Mar 15, 2024

If the Compute Unified Device Architecture (CUDA) toolkit that you use requires a later version of the NVIDIA driver, you must update the NVIDIA driver for the node. After you update the NVIDIA driver for the nodes in a node pool, you can manage the NVIDIA driver of different nodes in batches. This topic describes how to update the NVIDIA driver for the nodes in a node pool.

Table of contents

Limits

To update the NVIDIA driver for a node, you need to remove the node from the node pool and then add the node to a newly created node pool. The original node pool may contain other nodes that require the original driver version. Therefore, you cannot directly update the NVIDIA driver for a node in the node pool. You must remove a node from the node pool before you can update the NVIDIA driver used by the node.

Important

When you add the node to a new node pool, the operating system of the node is reinstalled and the NVIDIA driver of the specified version is installed. Before you perform the update, make sure that no task is running on the node and business-critical data is backed up. To reduce the risk of failures when you update the NVIDIA driver for multiple nodes, we recommend that you first update the NVIDIA driver for one node. If no error occurs during the update, you can then perform the update on the remaining nodes.

Usage notes

When you use custom GPU driver versions, take note of the following items:

  • You need to check the compatibility between the GPU driver version and your application (compatibility between the GPU driver version and the CDUA library version). ACK does not guarantee the compatibility.

  • For custom OS images that are pre-installed with GPU components such as the GPU driver and NVIDIA container runtime, ACK does not guarantee that the pre-installed GPU driver is compatible with other GPU components used in ACK, such as the monitoring components.

  • If you use a custom GPU driver that is uploaded to OSS and the GPU driver is incompatible with the OS, ECS instance type, or container runtime, you may fail to add GPU-accelerated nodes. ACK does not guarantee that you can add GPU-accelerated nodes without any failures.

Update the NVIDIA driver for the nodes in a node pool

Mappings between the default NVIDIA driver versions and Kubernetes versions

Kubernetes version

Default NVIDIA driver version

Support for custom NVIDIA driver version

Supported custom NVIDIA driver version

1.14.8

418.181.07

Yes

  • 418.181.07

  • 450.102.04

  • 460.32.03

  • 460.73.01

  • 470.57.02

1.16.6

418.87.01

No

NA

1.16.9

418.181.07

Yes

  • 418.181.07

  • 450.102.04

  • 460.32.03

  • 460.73.01

  • 470.57.02

  • 510.47.03

1.18.8

418.181.07

Yes

1.20.4

450.102.04

Yes

1.22.10

460.91.03

Yes

1.24.3

460.91.03

Yes

Examples

  • Node Pool A consists of nodes on which you want to update the NVIDIA driver to version 418.181.07. If you want to schedule a task to a node that runs the NVIDIA driver 418.181.07, you need to only set the selector of the task to the label of Node Pool A.

  • You manage nodes in two groups: A and B. You want to update the NVIDIA driver to version 418.181.07 for the nodes in Group A and update the NVIDIA driver to version 450.102.0 for the nodes in Group B. In this case, you can add the nodes in Group A to Node Pool A and the nodes in Group B to Node Pool B.

Step 1: Query the NVIDIA driver version

Make sure that the NVIDIA driver version to be installed is compatible with the CUDA toolkit that you use. The following table displays a list of CUDA toolkit versions and the compatible NVIDIA driver versions. For more information, see cuda-toolkit-release-notes.

View the compatibility between the CUDA toolkit and the NVIDIA driver

CUDA toolkit version

Linux x86_64 driver version

CUDA 11.7 Update 1

≥ 515.65.01

CUDA 11.7 GA

≥ 515.43.04

CUDA 11.6 Update 2

≥ 510.47.03

CUDA 11.6 Update 1

≥ 510.47.03

CUDA 11.6 GA

≥ 510.39.01

CUDA 11.5 Update 2

≥ 495.29.05

CUDA 11.5 Update 1

≥ 495.29.05

CUDA 11.5 GA

≥ 495.29.05

CUDA 11.4 Update 4

≥ 470.82.01

CUDA 11.4 Update 3

≥ 470.82.01

CUDA 11.4 Update 2

≥ 470.57.02

CUDA 11.4 Update 1

≥ 470.57.02

CUDA 11.4.0 GA

≥ 470.42.01

CUDA 11.3.1 Update 1

≥ 465.19.01

CUDA 11.3.0 GA

≥ 465.19.01

CUDA 11.2.2 Update 2

≥ 460.32.03

CUDA 11.2.1 Update 1

≥ 460.32.03

CUDA 11.2.0 GA

≥ 460.27.03

CUDA 11.1.1 Update 1

≥ 455.32

CUDA 11.1 GA

≥ 455.23

CUDA 11.0.3 Update 1

≥ 450.51.06

CUDA 11.0.2 GA

≥ 450.51.05

CUDA 11.0.1 RC

≥ 450.36.06

CUDA 10.2.89

≥ 440.33

CUDA 10.1 (10.1.105 general release, and updates)

≥ 418.39

CUDA 10.0.130

≥ 410.48

CUDA 9.2 (9.2.148 Update 1)

≥ 396.37

CUDA 9.2 (9.2.88)

≥ 396.26

CUDA 9.1 (9.1.85)

≥ 390.46

CUDA 9.0 (9.0.76)

≥ 384.81

CUDA 8.0 (8.0.61 GA2)

≥ 375.26

CUDA 8.0 (8.0.44)

≥ 367.48

CUDA 7.5 (7.5.16)

≥ 352.31

CUDA 7.0 (7.0.28)

≥ 346.46

Step 2: Remove a node from the cluster

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Nodes in the left-side navigation pane.

  3. Select the node for which you want to update the NVIDIA driver and click Batch Remove. In the Remove Node dialog box, select Drain the Node and click OK.

Step 3: Create a node pool and specify an NVIDIA driver version

Method 1: Select an NVIDIA driver version to create a node pool

Note

This method is simple. You need to add the ack.aliyun.com/nvidia-driver-version=<Driver version> label when you create a new node pool and then add the node that you removed in Step 2: Remove a node from the cluster to the node pool.

The following example shows how to create a node pool and set the NVIDIA driver version to 418.181.07:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  3. In the upper-right corner of the Node Pools page, click Create Node Pool. In the Create Node Pool dialog box, configure the parameters.

    The following table describes the parameters. For more information about the parameters, see Create an ACK dedicated cluster.

    Parameter

    Description

    vSwitch

    Specify the values of the vSwitch and Instance Type parameters as needed. The vSwitch and Instance Type parameters apply only when new nodes are added to the node pool. In this example, an existing node is added to the node pool. Therefore, you can set the parameters to any values.

    Instance Type

    Operating System

    After you add a node to the node pool, the node uses the specified operating system.

    Expected Nodes

    You can set the value to 0. If you specify a value greater than 0, Container Service for Kubernetes (ACK) automatically creates new instances.

    1. Click Show Advanced Options.

    2. In the Node Label section, click 1, set Key to ack.aliyun.com/nvidia-driver-version, and then set Value to 418.181.07.

      For more information about the NVIDIA driver versions supported by ACK, see Mappings between the default NVIDIA driver versions and Kubernetes versions.

      Important

      The Elastic Compute Service (ECS) instance types ecs.ebmgn7 and ecs.ebmgn7e support only NVIDIA driver versions later than 460.32.03.

    3. After you set the parameters, click Confirm Order.

Method 2: Use a custom NVIDIA driver version to create a node pool

Step 1: Prepare the package of a custom NVIDIA driver version

The following example shows how to upload the NVIDIA-Linux-x86_64-460.32.03.run package to specify a custom driver version. You can download the NVIDIA driver package from the NVIDIA official website.

Note

The NVIDIA-Linux-x86_64-460.32.03.run package must be stored in the root directory of an Object Storage Service (OSS) bucket.

  1. Create an OSS bucket in the OSS console. For more information, see Create buckets.

  2. Upload the NVIDIA-Linux-x86_64-460.32.03.run file to the OSS bucket. For more information, see Upload objects.

    Note

    If the ECS instance type is ecs.ebmgn7 or ecs.ebmgn7e, you must also upload the NVIDIA Fabric Manager package to the OSS bucket. The version of the uploaded NVIDIA Fabric Manager package must match the version of the uploaded NVIDIA driver package. The following limits apply when you upload the NVIDIA Fabric Manager package:

    • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, you must upload the NVIDIA Fabric Manager package in RPM format. For more information, see RPM.

    • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, you must upload the NVIDIA Fabric Manager package in tar.gz format. For more information, see tar.

  3. After the package is uploaded to the bucket, click Files in the left-side navigation pane of the bucket details page. On the Files page, find the uploaded package and click View Details in the Actions column.

  4. In the View Details panel, turn off HTTPS to disable HTTPS.

    Note

    ACK downloads the NVIDIA driver package by through its HTTP URL. By default, OSS uses the HTTPS protocol. You must turn off HTTPS to disable HTTPS.

  5. Confirm the configuration file of the NVIDIA driver and record the URL of the file.

    The URL consists of the following parts: endpoint and runfile.

    In this example, the URL is http://nvidia-XXX-XXX-cn-beijing.aliyuncs.com/NVIDIA-Linux-x86_64-460.32.03.run, which consists of the following parts:

    • endpoint: nvidia-XXX-XXX-cn-beijing.aliyuncs.com

    • runfile: NVIDIA-Linux-x86_64-460.32.03.run

Step 2: Create a node pool that runs a custom NVIDIA driver version

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  3. In the upper-right corner of the Node Pools page, click Create Node Pool and set the parameters in the Create Node Pool dialog box.

    The following table describes the parameters. For more information about the parameters, see Create an ACK dedicated cluster.

    Parameter

    Description

    vSwitch

    Specify the values of the vSwitch and Instance Type parameters as needed. The vSwitch and Instance Type parameters apply only when new nodes are added to the node pool. In this example, an existing node is added to the node pool. Therefore, you can set the parameters to any values.

    Instance Type

    Operating System

    After you add a node to the node pool, the node uses the specified operating system.

    Expected Nodes

    You can set the value to 0. If you specify a value greater than 0, Container Service for Kubernetes (ACK) automatically creates new instances.

    1. Click Show Advanced Options.

    2. In the Node Label section, click the 1 icon to add labels:

      • For the first label, set Key to ack.aliyun.com/nvidia-driver-oss-endpoint and Value to nvidia-XXX-XXX-cn-beijing.aliyuncs.com.

      • For the second label, set Key to ack.aliyun.com/nvidia-driver-runfile and Value to NVIDIA-Linux-x86_64-460.32.03.run.

      • For the third label, set Key and Value based on the Kubernetes version of the cluster.

        • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-rpm and Value to nvidia-fabric-manager-460.32.03-1.x86_64.rpm.

        • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-tarball and Value to fabricmanager-linux-x86_64-460.32.03.tar.gz.

    3. After you set the parameters, click Confirm Order.

Step 4: Add the node to the node pool

After the node pool is created, add the node that you removed in Step 2: Remove a node from the cluster to the node pool. For more information, see Add existing ECS instances to an ACK cluster.

Step 5: Check whether the NVIDIA driver version is updated

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. In the ACK console, select the cluster to which the node belongs and choose More > Open Cloud Shell in the Actions column.

  3. Run the following command to query pods that have the component: nvidia-device-plugin label:

    kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

    Expected output:

    NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
    nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

    You can check the NODE column to find the node that is newly added to the cluster. The name of the pod that runs on the node is nvidia-device-plugin-cn-beijing.192.168.1.128.

  4. Run the following command to query the NVIDIA driver version of the node:

    kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi 

    Expected output:

    Sun Feb  7 04:09:01 2021       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.181.07   Driver Version: 418.181.07   CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
    | N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
    | N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
    | N/A   27C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+

    The output shows that the NVIDIA driver version is 418.181.07. This indicates that the NVIDIA driver is updated.

References

Specify a custom NVIDIA driver version for a node