Vulnerability CVE-2025-23266 - Container Service for Kubernetes

A time-of-check to time-of-use (TOCTOU) race condition vulnerability has been identified in NVIDIA Container Toolkit versions 1.17.7 and earlier when used with default configurations. This vulnerability does not affect deployments using the Container Device Interface (CDI). If exploited, this vulnerability could lead to container escape, allowing attackers to execute arbitrary commands on the host or access sensitive host system information. Known attack scenarios require the victim to run malicious container images and use GPU resources within containers via the NVIDIA Container Toolkit.

For official details, see NVIDIA Container Toolkit. Immediate remediation is required for affected clusters.

Affected scope

This vulnerability affects clusters with Kubernetes versions below 1.32 if the GPU-accelerated nodes have NVIDIA Container Toolkit version 1.17.7 or earlier installed.

How to check the NVIDIA Container Toolkit version

Run the following command on GPU-accelerated nodes:

nvidia-container-cli --version

Sample output (version 1.17.8):

cli-version: 1.17.8
lib-version: 1.17.8
build date: 2025-05-30T13:47+00:00
build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64

Preventative measures

Until the vulnerability is fully patched, we recommend that you do not run untrusted container images in the cluster to ensure system security and stability. The following methods can be used:

Enable the ACKAllowedRepos policy to use images in the trusted repositories and ensure that only trusted personnel have permission to import images based on the principle of least privilege. For more information, see Enable the policy governance feature.
For more information about how to deploy only trusted images in the cluster, see Use Notation and Ratify for OCI artifact signing and signature verification.

Solutions

Precautions

Fixes apply only to:
- Cloud node pools of ACK managed Pro clusters, ACK managed Basic clusters, ACK dedicated clusters, and ACK Edge clusters
- Managed node pools of ACK Lingjun clusters.
For ACK Lingjun clusters using Lingjun node pools, submit a ticket for assistance.
Fix nodes in batches during off-peak hours to maintain stability. Do not fix all nodes at the same time.
Note
This fix will restart running application pods on the nodes.

Procedures

New GPU-accelerated nodes

This solution applies only to clusters running Kubernetes 1.20 or later. Upgrade your cluster if needed.

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
Find the target node pool and click Edit. In the Edit Node Pool dialog box, add the ack.aliyun.com/nvidia-container-runtime-version=1.17.8 node label, then click OK.
Note
- This label locks the nvidia-container-toolkit version to 1.17.8 during node pool scaling. Future toolkit releases will not trigger automatic upgrades.
- To use newer toolkit versions, manually remove this label. Newly scaled nodes will then default to the latest version.

Existing GPU-accelerated nodes

Manually fix by executing the Common Vulnerabilities and Exposures (CVE) repair script.

Expand to view the manual fix solution

Prerequisites

Confirm the Alibaba Cloud account or RAM user has the CloudOps Orchestration Service (OOS) permissions. For details, see AliyunOOSFullAccess.
Verify whether the cGPU add-on is installed in the cluster:
- If not installed, proceed to Step 1: Drain the node.
- If installed, ensure the version is 1.1.0 or later.
How to check if cGPU is installed and how to upgrade it
1. Log on to the ACK console. In the navigation pane on the left, click Clusters.
2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Helm.
3. On the Helm page, check the add-on version.
  - No ack-cgpu found: The cGPU add-on is not installed.
  - Both ack-ai-installer and ack-cgpu exist: Submit a ticket for assistance.
  - ack-ai-installer exists: If its version is earlier than 1.7.5, upgrade it.
  - ack-cgpu exists: If its version is earlier than 1.5.1, click Update to the right of the add-on and follow the on-screen instructions to upgrade it.
4. After add-on upgrades, update existing cGPU nodes in the cluster.

Step 1: Drain the node

Use the ACK console

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
On the Nodes page, select the node that you want to manage and then click Drain in the lower part of the page. In the dialog box that appears, click OK.

kubectl

Run the following command to set the status of the node to unschedulable:
```
kubectl cordon <NODE_NAME>
```

Run the following command to drain the node:

kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true

Step 2: Run the fix script on the node

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
On the Nodes page, select the target nodes, and click Batch Operations at the bottom of the page. In the Batch Operations dialog box, select Run Shell Scripts, then click OK.
On the Basic Information step, configure the parameters according to the following table, then click Next Step: Parameter Settings.
Parameter
Value
Template Category
Public Template
Template
ACS-ECS-BulkyRunCommand
Execution Mode
Suspend upon Failure

On the Parameter Settings step, click Run Shell Script. Enter the following CVE fix script in the CommandContent field and click Next Step: OK.

#!/bin/bash
set -e
# Set the region where the node resides.
export REGION=$(curl 100.100.100.200/latest/meta-data/region-id 2>/dev/null)

if [[ $REGION == "" ]];then
    echo "Error: failed to get region"
    exit 1
fi

NV_TOOLKIT_VERSION=1.17.8

PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz

cd /tmp

curl -o ${PACKAGE}  http://aliacs-k8s-${REGION}.oss-${REGION}-internal.aliyuncs.com/public/pkg/nvidia-container-runtime/${PACKAGE} 

tar -xf ${PACKAGE} 

cd pkg/nvidia-container-runtime/upgrade/common

bash upgrade-nvidia-container-toolkit.sh

On the OK step, verify the information and click Create.
After the task runs, in the navigation pane on the left of the CloudOps Orchestration Service console, choose Automated Tasks > Task Execution Management. Find and click the ID of the executed task, and view the Outputs in the Execution Steps and Results section.
- If the script returns the following output, the node is not vulnerable to CVE-2025-23266. No changes were applied and no further action is required.
```
2025-03-22/xxxx  INFO  No need to upgrade current nvidia-container-toolkit(1.17.8)
```
- If the script returns the following output, the node was affected by the NVIDIA Container Toolkit vulnerability and has been fixed.
```
2025-03-22/xxxxx  INFO  succeed to upgrade nvidia container toolkit
```

Step 3: Set the node to schedulable

Console

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
On the Nodes page, select the target nodes and click Set Node Schedulability at the bottom of the page. In the dialog box that appears, select Set to Schedulable, read the notes, tick the checkbox, and click OK.

kubectl

Run the following command to uncordon the node (remove scheduling isolation):

kubectl uncordon <NODE_NAME>

(Optional) Step 4: Verify the GPU-accelerated nodes

After completing the preceding steps, deploy a GPU application using a sample YAML template from one of the following topics to verify that the node works as expected:

Exclusive GPU: Use the default GPU scheduling mode
Shared GPU: Use GPU sharing to share GPUs

Parameter	Value
Template Category	Public Template
Template	ACS-ECS-BulkyRunCommand
Execution Mode	Suspend upon Failure