CVE-2025-23266 is a critical time-of-check to time-of-use (TOCTOU) race condition in NVIDIA Container Toolkit 1.17.7 and earlier. If exploited, an attacker can escape a container and execute arbitrary commands on the host or access sensitive host data. Immediate remediation is required for affected clusters.
For the official vulnerability disclosure, see NVIDIA Security Bulletin 5659.
Affected scope
Your cluster is affected if both conditions are true:
Kubernetes version is 1.32 or earlier
At least one GPU-accelerated node runs NVIDIA Container Toolkit 1.17.7 or earlier
This vulnerability does not affect deployments that use the Container Device Interface (CDI).
Known attack scenarios require running a malicious container image and accessing GPU resources through the NVIDIA Container Toolkit.
Check the NVIDIA Container Toolkit version
Log on to each GPU-accelerated node and run:
nvidia-container-cli --versionSample output (unaffected version):
cli-version: 1.17.8
lib-version: 1.17.8
build date: 2025-05-30T13:47+00:00
build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64If cli-version is 1.17.7 or earlier, the node is affected.
Preventive measures
While you prepare to apply the fix, restrict image pulls to trusted registries only. Enable the container security policy rule in the policy governance feature.
Solution
New GPU-accelerated nodes
ACK edge clusters running Kubernetes 1.20 or later
Nodes created on or after August 4, 2025 automatically install the patched version (1.17.8) of the NVIDIA Container Toolkit. No further action is required.
Clusters running Kubernetes earlier than 1.20
Upgrade the cluster before creating new nodes to ensure new nodes receive the patched version. See Upgrade clusters.
Existing GPU-accelerated nodes
All GPU-accelerated nodes created before August 4, 2025 require a manual fix.
Cloud nodes: Follow Vulnerability CVE-2025-23266.
Edge nodes: Follow the procedure below.
Apply fixes in batches to maintain system stability.
Fix edge nodes
The fix procedure drains each node, upgrades the NVIDIA Container Toolkit to version 1.17.8, and then restores the node to service.
Prerequisites
Before you begin, ensure that you have:
SSH access to the target GPU-accelerated edge node
kubectlaccess to the cluster with sufficient permissions to cordon and drain nodes
Step 1: Drain the node
Draining a node safely migrates its workloads to other available nodes before you make changes.
Mark the node as unschedulable:
kubectl cordon <NODE_NAME>Drain the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
Step 2: Apply the fix
Log on to the affected node and run the following commands.
Set the
REGIONandINTERCONNECT_MODEenvironment variables. Replace the example values with your actual configuration:Parameter Description Example REGIONRegion ID of your ACK edge cluster. See Supported regions. cn-hangzhouINTERCONNECT_MODENetwork access type: basic(Internet access) orprivate(leased line access).basicexport REGION="cn-hangzhou" INTERCONNECT_MODE="basic"Run the fix script:
#!/bin/bash set -e if [[ $REGION == "" ]];then echo "Error: REGION is null" exit 1 fi if [[ $INTERCONNECT_MODE == "" ]]; then echo "Error: INTERCONNECT_MODE is null" exit 1 fi NV_TOOLKIT_VERSION=1.17.8 INTERNAL=$( [ "$INTERCONNECT_MODE" = "private" ] && echo "-internal" || echo "" ) PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz cd /tmp export PKG_URL_PREFIX="http://aliacs-k8s-${REGION}.oss-${REGION}${INTERNAL}.aliyuncs.com" curl -o ${PACKAGE} ${PKG_URL_PREFIX}/public/pkg/nvidia-container-runtime/${PACKAGE} tar -xf ${PACKAGE} cd pkg/nvidia-container-runtime/upgrade/common bash upgrade-nvidia-container-toolkit.shVerify the output:
Output Meaning INFO No need to upgrade current nvidia-container-toolkit(1.17.8)Node was already running the patched version. No changes were made. INFO succeed to upgrade nvidia container toolkitNode was vulnerable and has been successfully patched.
Step 3: Restore the node
Return the node to service:
kubectl uncordon <NODE_NAME>Step 4 (optional): Verify GPU functionality
Deploy a GPU workload to confirm the node is functioning correctly after the fix. Use the sample YAML templates from:
Exclusive GPU: Use the default GPU scheduling mode
Shared GPU: Examples of using GPU sharing to share GPUs