The version of the NVIDIA driver that is installed in a Container Service for Kubernetes (ACK) cluster varies by Kubernetes version. If the Compute Unified Device Architecture (CUDA) toolkit that you use requires a later version of the NVIDIA driver, you must specify a custom NVIDIA driver version. This topic describes how to use a node pool to create a node that runs a custom NVIDIA driver version.

Background information

  • The following table describes the mappings between the default NVIDIA driver versions and Kubernetes versions.
    Kubernetes version Default NVIDIA driver version Support for custom NVIDIA driver version Supported custom NVIDIA driver version
    1.14.8 418.181.07 Yes
    • 418.181.07
    • 450.102.04
    • 460.32.03
    • 460.73.01
    • 470.57.02
    1.16.6 418.87.01 No NA
    1.16.9 418.181.07 Yes
    • 418.181.07
    • 450.102.04
    • 460.32.03
    • 460.73.01
    • 470.57.02
    • 510.47.03
    1.18.8 418.181.07 Yes
    1.20.4 450.102.04 Yes
    1.22.10 460.91.03 Yes
    1.24.3 460.91.03 Yes
  • Benefits:

    This solution allows you to manage the NVIDIA drivers of different nodes in batches. The following scenarios are covered:

    • Node Pool A consists of nodes on which you want to install the NVIDIA driver of version 418.181.07. If you want to schedule a task to a node that runs the NVIDIA driver of version 418.181.07, you need to only set the selector of the task to the label of Node Pool A.
    • You manage cluster nodes in two groups: A and B. You want to install the NVIDIA driver of version 418.181.07 in Group A and the NVIDIA driver of version 450.102.0 in Group B. In this case, you can add nodes in Group A to Node Pool A and nodes in Group B to Node Pool B.

Step 1: Determine the NVIDIA driver version

You must first obtain the NVIDIA driver versions that are compatible with the CUDA toolkit that you use. The following table displays a list of CUDA Toolkit versions and the compatible NVIDIA driver versions. For more information, see cuda-toolkit-release-notes.

CUDA Toolkit Linux x86_64 Driver Version
CUDA 11.7 Update 1 ≥515.65.01
CUDA 11.7 GA ≥515.43.04
CUDA 11.6 Update 2 ≥510.47.03
CUDA 11.6 Update 1 ≥510.47.03
CUDA 11.6 GA ≥510.39.01
CUDA 11.5 Update 2 ≥495.29.05
CUDA 11.5 Update 1 ≥495.29.05
CUDA 11.5 GA ≥495.29.05
CUDA 11.4 Update 4 ≥470.82.01
CUDA 11.4 Update 3 ≥470.82.01
CUDA 11.4 Update 2 ≥470.57.02
CUDA 11.4 Update 1 ≥470.57.02
CUDA 11.4.0 GA ≥470.42.01
CUDA 11.3.1 Update 1 ≥465.19.01
CUDA 11.3.0 GA ≥465.19.01
CUDA 11.2.2 Update 2 ≥460.32.03
CUDA 11.2.1 Update 1 ≥460.32.03
CUDA 11.2.0 GA ≥460.27.03
CUDA 11.1.1 Update 1 ≥455.32
CUDA 11.1 GA ≥455.23
CUDA 11.0.3 Update 1 ≥ 450.51.06
CUDA 11.0.2 GA ≥ 450.51.05
CUDA 11.0.1 RC ≥ 450.36.06
CUDA 10.2.89 ≥ 440.33
CUDA 10.1 (10.1.105 general release, and updates) ≥ 418.39
CUDA 10.0.130 ≥ 410.48
CUDA 9.2 (9.2.148 Update 1) ≥ 396.37
CUDA 9.2 (9.2.88) ≥ 396.26
CUDA 9.1 (9.1.85) ≥ 390.46
CUDA 9.0 (9.0.76) ≥ 384.81
CUDA 8.0 (8.0.61 GA2) ≥ 375.26
CUDA 8.0 (8.0.44) ≥ 367.48
CUDA 7.5 (7.5.16) ≥ 352.31
CUDA 7.0 (7.0.28) ≥ 346.46

Step 2: Create a node pool and specify an NVIDIA driver version

Method 1: Select an NVIDIA driver version that is supported by ACK to create a node pool

Note This method is simple. You need only to add the ack.aliyun.com/nvidia-driver-version=<Driver Version> label when you create a node pool and then add the node that you removed to the node pool.

The following example shows how to select the NVIDIA driver version 418.181.07 when you create a node pool:

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
  5. In the upper-right corner of the Node Pools page, click Create Node Pool.
  6. In the Create Node Pool dialog box, configure the parameters. For more information about the parameters, see Create an ACK dedicated cluster. The following list describes some of the parameters.
    1. Click Show Advanced Options.
    2. In the Node Label section, click 1, set Key to ack.aliyun.com/nvidia-driver-version, and then set Value to 418.181.07.
      Note The Elastic Compute Service (ECS) instance types ecs.ebmgn7 and ecs.ebmgn7e support only NVIDIA driver versions later than 460.32.03.
    3. After you set the parameters, click Confirm Order.

Method 2: Use a custom NVIDIA driver version to create a node pool

Prepare a package of a custom NVIDIA driver version

The following example shows how to upload the NVIDIA-Linux-x86_64-460.32.03.run package to an Object Storage Service (OSS) bucket. You can download the NVIDIA driver package from the NVIDIA official website.

Note The NVIDIA-Linux-x86_64-460.32.03.run package must be stored in the root directory of an OSS bucket.
  1. Create an OSS bucket in the OSS console. For more information, see Create buckets.
  2. Upload the NVIDIA-Linux-x86_64-460.32.03.run package to the OSS bucket. For more information, see Upload objects.
    Note If the ECS instance type is ecs.ebmgn7 or ecs.ebmgn7e, you must also upload the NVIDIA Fabric Manager package to the OSS bucket. The version of the uploaded NVIDIA Fabric Manager package must match the version of the uploaded NVIDIA driver package. The following limits apply when you upload the NVIDIA Fabric Manager package:
    • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, you must upload the NVIDIA Fabric Manager package in RPM format. For more information, see RPM.
    • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, you must upload the NVIDIA Fabric Manager package in tar.gz format. For more information, see tar.
  3. After the package is uploaded to the bucket, click Files in the left-side navigation pane of the bucket details page.
  4. On the Files page, find the package that you uploaded and click View Details in the Actions column.
  5. In the View Details panel, turn off HTTPS to disable HTTPS.
    Note ACK downloads the NVIDIA driver package by using its URL, which uses the HTTP protocol. By default, OSS uses the HTTPS protocol. You must turn off HTTPS to disable HTTPS.
  6. Check the URL of the NVIDIA driver package. Record the URL. The URL consists of the following parts: endpoint and runfile. In this example, the URL is http://nvidia-XXX-XXX-cn-beijing.aliyuncs.com/NVIDIA-Linux-x86_64-460.32.03.run, which consists of the following parts:
    • endpoint: nvidia-XXX-XXX-cn-beijing.aliyuncs.com
    • runfile: NVIDIA-Linux-x86_64-460.32.03.run

Create a node pool that runs a custom NVIDIA driver version

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
  4. In the left-side navigation pane of the details page, choose Nodes > Node Pools.
  5. In the upper-right corner of the Node Pools page, click Create Node Pool.
  6. In the Create Node Pool dialog box, configure the parameters. For more information about the parameters, see Create an ACK dedicated cluster. The following list describes some of the parameters.
    1. Click Show Advanced Options.
    2. In the Node Label section, click 1 to add the following labels:
      • For the first label, set Key to ack.aliyun.com/nvidia-driver-oss-endpoint and Value to nvidia-XXX-XXX-cn-beijing.aliyuncs.com.
      • For the second label, set Key to ack.aliyun.com/nvidia-driver-runfile and Value to NVIDIA-Linux-x86_64-460.32.03.run.
      • For the third label, set Key and Value based on the Kubernetes version of the cluster.
        • If the Kubernetes version of the cluster is 1.18.8-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-rpm and Value to nvidia-fabric-manager-460.32.03-1.x86_64.rpm.
        • If the Kubernetes version of the cluster is 1.20.11-aliyun.1 or 1.22.3-aliyun.1, set Key to ack.aliyun.com/nvidia-fabricmanager-tarball and Value to fabricmanager-linux-x86_64-460.32.03.tar.gz.
    3. After you set the parameters, click Confirm Order.

Method 3: Call an API operation to create a node pool that runs an NVIDIA driver version

You can call the CreateClusterNodePool operation to create a node pool that runs a specified NVIDIA driver version. For more information, see Create a node pool.

1. Specify an NVIDIA driver version in the API request

You need to only specify a specific label in the tags field in the request body when you call an API operation to create or scale out a cluster. Example:
{
  // Other fields are not shown.
  ......
    "tags": [
        {
            "key": "ack.aliyun.com/nvidia-driver-version",
            "value": "410.104"
        }
    ],
  // Other fields are not shown.
  ......
}

2. Specify the URL of the NVIDIA driver package in the API request

Specify the endpoint and runfile values of the URL in the API request. These values indicate the location of the NVIDIA driver package in OSS. Example: For more information about how to obtain the endpoint and runfile values of an object that is stored in an OSS bucket, see Check the URL of the NVIDIA driver package.
{
  // Other fields are not shown.
  ......
    "tags": [
      {
              "key": "ack.aliyun.com/nvidia-driver-oss-endpoint",
              "value": "nvidia-XXX-XXX-cn-beijing.aliyuncs.com"
      },
      {
        "key": "ack.aliyun.com/nvidia-driver-runfile",
        "value": "NVIDIA-Linux-x86_64-410.104.run"
      }
    ],
  // Other fields are not shown.
  ......
}

Step 3: Check whether the custom NVIDIA driver version is installed

  1. Log on to the ACK console.
  2. In the left-side navigation pane of the ACK console, click Clusters.
  3. In the ACK console, select the cluster that you want to manage and choose More > Open Cloud Shell in the Actions column.
  4. Run the following command to query pods that have the component: nvidia-device-plugin label:
    kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

    Expected output:

    NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
    nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
    nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

    You can check the NODE column to find the node that is newly added to the cluster. The name of the pod that runs on the node is nvidia-device-plugin-cn-beijing.192.168.1.128.

  5. Run the following command to query the NVIDIA driver version of the node:
    kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi 

    Expected output:

    Sun Feb  7 04:09:01 2021       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.181.07   Driver Version: 418.181.07   CUDA Version: N/A      |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
    | N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
    | N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
    | N/A   27C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+

    The output shows that the NVIDIA driver version is 418.181.07. This indicates that the specified NVIDIA driver is installed.