NVIDIA Container Toolkit 1.17.3 or earlier contains a security vulnerability in handling Compute Unified Device Architecture (CUDA) forward compatibility. When a container image contains maliciously crafted symbolic link files, libnvidia-container incorrectly mounts the host directory in read-only mode inside the container. Attackers can exploit this vulnerability to bypass container isolation mechanisms, potentially leading to sensitive information theft or host privilege escalation. For more information about this vulnerability, see NVIDIA Container Toolkit. Fix this vulnerability at the earliest opportunity.
Affected versions
This vulnerability affects clusters that run Kubernetes earlier than 1.32 and that have GPU-accelerated nodes with NVIDIA Container Toolkit 1.17.3 or earlier installed.
You can run nvidia-container-cli --version
to check the component version.
How to prevent
Before the vulnerability is fixed, we recommend that you do not run untrusted container images in the cluster to ensure system security and stability. The following methods can be used:
Enable the ACKAllowedRepos policy to use images in the trusted repositories and ensure that only trusted personnel have permission to import images based on the principle of least privilege. For more information, see Enable the policy governance feature.
For more information about how to deploy only trusted images in the cluster, see Use Notation and Ratify for OCI artifact signing and signature verification.
Solution
Usage notes
The solution is applicable to ACK managed Pro cluster, ACK managed Basic cluster, ACK dedicated cluster, ACK Edge cluster cloud node pools and ACK Lingjun cluster managed node pools.
If your cluster type is ACK Lingjun cluster and the node pool is a Lingjun node pool, submit a ticket.
Perform vulnerability patching on the nodes in batches. Do not patch all nodes at the same time to maintain system stability.
The entire process fixes the issue by restarting the application pods running on the node. Perform the fix operation during off-peak hours.
Solution
Configuration solution for new GPU-accelerated nodes
This solution is applicable to clusters that run Kubernetes 1.20 and later. If the Kubernetes version of your cluster is earlier than 1.20, Upgrade clusters.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose .
Find the node pool that you want to manage, click Edit, add the
ack.aliyun.com/nvidia-container-runtime-version=1.17.5
label to the node pool, and then click Confirm.NoteThis label locks the nvidia-container-toolkit version used when scaling out the node pool to 1.17.5, and will not automatically upgrade when new versions are released.
If you want to use a new version of nvidia-container-toolkit, you must manually delete this label, and the scale-out nodes will use the latest version by default.
Fix solution for existing GPU-accelerated nodes
For existing GPU-accelerated nodes, you can manually fix the issue by running the CVE fix script. The following section describes the details of the fix solution.