A time-of-check to time-of-use (TOCTOU) race condition vulnerability exists in NVIDIA Container Toolkit versions 1.17.7 and earlier when used with default configurations. This vulnerability does not affect deployments using the Container Device Interface (CDI). However, if exploited, this vulnerability could lead to container escape, allowing attackers to execute arbitrary commands on the host or access sensitive host system information. Known attack scenarios require running malicious container images and using GPU resources within containers via the NVIDIA Container Toolkit. Immediate remediation is required for affected clusters.
For official details, see NVIDIA Container Toolkit.
Affected scope
Your cluster is affected if both of the following conditions are met:
The Kubernetes version is 1.32 or earlier.
At least one GPU-accelerated node is running NVIDIA Container Toolkit version 1.17.7 or earlier.
Check the NVIDIA Container Toolkit version
Log on to the GPU-accelerated node.
Run the
nvidia-container-cli --versioncommand.Sample output:
cli-version: 1.17.8 lib-version: 1.17.8 build date: 2025-05-30T13:47+00:00 build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64
Preventive measures
Enable the container security policy rule in the policy governance feature to restrict image pulls to trusted image repositories only.
Solutions
New GPU-accelerated nodes
ACK edge clusters (Kubernetes 1.20 or later)
Nodes created on or after August 4, 2025 will automatically install the patched version (V1.17.8) of the NVIDIA Container Toolkit.
No further action is required.
Clusters with Kubernetes versions earlier than 1.20
You must upgrade your cluster before creating new nodes to ensure they receive the patched version.
For instructions, see Upgrade clusters.
Existing GPU-accelerated nodes
All existing GPU-accelerated nodes created before August 4, 2025 require a manual fix.
For cloud nodes, see Vulnerability CVE-2025-23266.
For edge nodes, follow the procedure below.
To ensure system stability, apply fixes in batches.
Manual fix procedure for edge nodes
Step 1: Drain the node
Drain the node to safely migrate its workloads to other available nodes. Before you begin, confirm you've selected the correct target edge node.
Mark the node as unschedulable:
kubectl cordon <NODE_NAME>Drain the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
Step 2: Apply the fix script
Log on to the affected node and run the following script.
Set the required environment variables
REGIONandINTERCONNECT_MODE. Replace the example values with your actual configuration.export REGION="cn-hangzhou" INTERCONNECT_MODE="basic"; # Replace "cn-hangzhou" and "basic" with your actual configurations.Parameter reference:
Parameter
Description
Example
REGION
Region ID of your ACK Edge cluster. See Supported regions for details.
cn-hangzhouINTERCONNECT_MODE
Network access type:
basic: Internet access.private: Leased line access.
basicRun the fix script:
#!/bin/bash set -e if [[ $REGION == "" ]];then echo "Error: REGION is null" exit 1 fi if [[ $INTERCONNECT_MODE == "" ]]; then echo "Error: INTERCONNECT_MODE is null" exit 1 fi NV_TOOLKIT_VERSION=1.17.8 INTERNAL=$( [ "$INTERCONNECT_MODE" = "private" ] && echo "-internal" || echo "" ) PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz cd /tmp export PKG_URL_PREFIX="http://aliacs-k8s-${REGION}.oss-${REGION}${INTERNAL}.aliyuncs.com" curl -o ${PACKAGE} ${PKG_URL_PREFIX}/public/pkg/nvidia-container-runtime/${PACKAGE} tar -xf ${PACKAGE} cd pkg/nvidia-container-runtime/upgrade/common bash upgrade-nvidia-container-toolkit.shVerify the fix.
If you see the following output, the node is not vulnerable and no changes were made:
2025-03-22/xxxx INFO No need to upgrade current nvidia-container-toolkit(1.17.8)If you see the following output, the node was vulnerable and has been patched:
2025-03-22/xxxxx INFO succeed to upgrade nvidia container toolkit
Step 3: Re-enable the node
Restore the node's scheduling status:
kubectl uncordon <NODE_NAME>Step 4 (optional): Test the node
After applying the fix, deploy a GPU workload to confirm the node is functioning correctly. Use sample YAML templates from:
Exclusive GPU: Use the default GPU scheduling mode.
Shared GPU: Examples of using GPU sharing to share GPUs.