A time-of-check to time-of-use (TOCTOU) race condition vulnerability has been identified in NVIDIA Container Toolkit versions 1.17.7 and earlier when used with default configurations. This vulnerability does not affect deployments using the Container Device Interface (CDI). If exploited, this vulnerability could lead to container escape, allowing attackers to execute arbitrary commands on the host or access sensitive host system information. Known attack scenarios require the victim to run malicious container images and use GPU resources within containers via the NVIDIA Container Toolkit.
For official details, see NVIDIA Container Toolkit. Immediate remediation is required for affected clusters.
Affected scope
This vulnerability affects clusters with Kubernetes versions below 1.32 if the GPU-accelerated nodes have NVIDIA Container Toolkit version 1.17.7 or earlier installed.
Preventative measures
Until the vulnerability is fully patched, we recommend that you do not run untrusted container images in the cluster to ensure system security and stability. The following methods can be used:
Enable the ACKAllowedRepos policy to use images in the trusted repositories and ensure that only trusted personnel have permission to import images based on the principle of least privilege. For more information, see Enable the policy governance feature.
For more information about how to deploy only trusted images in the cluster, see Use Notation and Ratify for OCI artifact signing and signature verification.
Solutions
Precautions
Fixes apply only to:
Cloud node pools of ACK managed Pro clusters, ACK managed Basic clusters, ACK dedicated clusters, and ACK Edge clusters
Managed node pools of ACK Lingjun clusters.
For ACK Lingjun clusters using Lingjun node pools, submit a ticket for assistance.
Fix nodes in batches during off-peak hours to maintain stability. Do not fix all nodes at the same time.
NoteThis fix will restart running application pods on the nodes.
Procedures
New GPU-accelerated nodes
This solution applies only to clusters running Kubernetes 1.20 or later. Upgrade your cluster if needed.
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose .
Find the target node pool and click Edit. In the Edit Node Pool dialog box, add the
ack.aliyun.com/nvidia-container-runtime-version=1.17.8node label, then click OK.NoteThis label locks the
nvidia-container-toolkitversion to 1.17.8 during node pool scaling. Future toolkit releases will not trigger automatic upgrades.To use newer toolkit versions, manually remove this label. Newly scaled nodes will then default to the latest version.
Existing GPU-accelerated nodes
Manually fix by executing the Common Vulnerabilities and Exposures (CVE) repair script.