NVIDIA has reported the CVE-2021-1056 vulnerability, which is related to device isolation and NVIDIA GPU drivers. Elastic GPU Service (EGS) instances that are deployed in a Container Service for Kubernetes (ACK) cluster may also be exposed to this vulnerability. This topic describes the background information, impact, and fixes of this vulnerability.

Background information

The CVE-2021-1056 vulnerability is related to device isolation and NVIDIA GPU drivers. This vulnerability allows an attacker to gain access to all GPU devices on a node by creating character device files in non-privileged containers that run on this node.

For more information about this vulnerability, see CVE-2021-1056.

Affected versions

The affected versions of NVIDIA GPU drivers are listed in the following figure based on the information published on the NVIDIA official website. For more information, see NVIDIA official website. an19
  • If you selected a custom NVIDIA driver or updated an NVIDIA driver, check whether the NVIDIA driver that you installed is affected by this vulnerability based on the preceding figure.
  • If the NVIDIA driver is installed by default for the ACK cluster, you must check whether the ACK cluster is affected by this vulnerability. ACK clusters that are affected by this vulnerability are:
    • ACK 1.16.9-aliyun.1. By default, the NVIDIA driver of version 418.87.01 is installed.
    • ACK 1.18.8-aliyun.1. By default, the NVIDIA driver of version 418.87.01 is installed.
Note In other versions of ACK clusters, the NVIDIA GPU drivers that are installed by default are not affected. The ACK team of Alibaba Cloud will keep you informed of further CVE content updates and help you fix the vulnerability.

Verify the version of the NVIDIA driver on a GPU-accelerated node

Log on to the GPU-accelerated node and run the following command to query the version of the NVIDIA driver.
Note For more information about how to log on to a GPU-accelerated node, see Connect to a Linux instance by using password authentication and Connect to a Windows instance by using password authentication.
nvidia-smi

Expected output:

Fri Apr 16 10:58:19 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    37W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The output indicates that the version of the NVIDIA driver is 418.87.01.

Fixes

Notice When you upgrade the NVIDIA driver for a node, the node must be restarted. This disrupts the services that are deployed on the node.

Upgrade the NVIDIA driver based on the preceding figure.

  • If your NVIDIA driver belongs to the R390 branch, upgrade it to version 390.141.
  • If your NVIDIA driver belongs to the R418 branch, upgrade it to version 418.181.07.
  • If your NVIDIA driver belongs to the R450 branch, upgrade it to version 450.102.04.
  • If your NVIDIA driver belongs to the R460 branch, upgrade it to version 460.32.03.

For more information about how to upgrade the NVIDIA driver, see Use a node pool to upgrade the NVIDIA driver for a node, Manually upgrade the NVIDIA driver for a node, and Use a node pool to create a node with a custom NVIDIA driver version.