Container Escape Vulnerability in NVIDIA Container Toolkit 1.17.7 and Earlier - Container Service for Kubernetes

CVE-2025-23266 is a critical time-of-check to time-of-use (TOCTOU) race condition in NVIDIA Container Toolkit 1.17.7 and earlier. If exploited, an attacker can escape a container and execute arbitrary commands on the host or access sensitive host data. Immediate remediation is required for affected clusters.

For the official vulnerability disclosure, see NVIDIA Security Bulletin 5659.

Affected scope

Your cluster is affected if both conditions are true:

Kubernetes version is 1.32 or earlier
At least one GPU-accelerated node runs NVIDIA Container Toolkit 1.17.7 or earlier

Note

This vulnerability does not affect deployments that use the Container Device Interface (CDI).

Known attack scenarios require running a malicious container image and accessing GPU resources through the NVIDIA Container Toolkit.

Check the NVIDIA Container Toolkit version

Log on to each GPU-accelerated node and run:

nvidia-container-cli --version

Sample output (unaffected version):

cli-version: 1.17.8
lib-version: 1.17.8
build date: 2025-05-30T13:47+00:00
build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64

If cli-version is 1.17.7 or earlier, the node is affected.

Preventive measures

While you prepare to apply the fix, restrict image pulls to trusted registries only. Enable the container security policy rule in the policy governance feature.

Solution

New GPU-accelerated nodes

ACK edge clusters running Kubernetes 1.20 or later

Nodes created on or after August 4, 2025 automatically install the patched version (1.17.8) of the NVIDIA Container Toolkit. No further action is required.

Clusters running Kubernetes earlier than 1.20

Upgrade the cluster before creating new nodes to ensure new nodes receive the patched version. See Upgrade clusters.

Existing GPU-accelerated nodes

All GPU-accelerated nodes created before August 4, 2025 require a manual fix.

Cloud nodes: Follow Vulnerability CVE-2025-23266.
Edge nodes: Follow the procedure below.

Important

Apply fixes in batches to maintain system stability.

Fix edge nodes

The fix procedure drains each node, upgrades the NVIDIA Container Toolkit to version 1.17.8, and then restores the node to service.

Prerequisites

Before you begin, ensure that you have:

SSH access to the target GPU-accelerated edge node
kubectl access to the cluster with sufficient permissions to cordon and drain nodes

Step 1: Drain the node

Draining a node safely migrates its workloads to other available nodes before you make changes.

Mark the node as unschedulable:
```
kubectl cordon <NODE_NAME>
```

Drain the node:

kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true

Step 2: Apply the fix

Log on to the affected node and run the following commands.

Set the REGION and INTERCONNECT_MODE environment variables. Replace the example values with your actual configuration:

Parameter	Description	Example
`REGION`	Region ID of your ACK edge cluster. See Supported regions.	`cn-hangzhou`
`INTERCONNECT_MODE`	Network access type: `basic` (Internet access) or `private` (leased line access).	`basic`

export REGION="cn-hangzhou" INTERCONNECT_MODE="basic"

Run the fix script:

#!/bin/bash
set -e

if [[ $REGION == "" ]];then
    echo "Error: REGION is null"
    exit 1
fi

if [[ $INTERCONNECT_MODE == "" ]]; then
   echo "Error: INTERCONNECT_MODE is null"
   exit 1
fi

NV_TOOLKIT_VERSION=1.17.8

INTERNAL=$( [ "$INTERCONNECT_MODE" = "private" ] && echo "-internal" || echo "" )
PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz

cd /tmp

export PKG_URL_PREFIX="http://aliacs-k8s-${REGION}.oss-${REGION}${INTERNAL}.aliyuncs.com"
curl -o ${PACKAGE}  ${PKG_URL_PREFIX}/public/pkg/nvidia-container-runtime/${PACKAGE}

tar -xf ${PACKAGE}

cd pkg/nvidia-container-runtime/upgrade/common

bash upgrade-nvidia-container-toolkit.sh

Verify the output:

Output	Meaning
`INFO No need to upgrade current nvidia-container-toolkit(1.17.8)`	Node was already running the patched version. No changes were made.
`INFO succeed to upgrade nvidia container toolkit`	Node was vulnerable and has been successfully patched.

Step 3: Restore the node

Return the node to service:

kubectl uncordon <NODE_NAME>

Step 4 (optional): Verify GPU functionality

Deploy a GPU workload to confirm the node is functioning correctly after the fix. Use the sample YAML templates from:

Exclusive GPU: Use the default GPU scheduling mode
Shared GPU: Examples of using GPU sharing to share GPUs