All Products
Search
Document Center

Container Service for Kubernetes:Fix vulnerability CVE-2025-23266

Last Updated:Aug 13, 2025

A time-of-check to time-of-use (TOCTOU) race condition vulnerability exists in NVIDIA Container Toolkit versions 1.17.7 and earlier when used with default configurations. This vulnerability does not affect deployments using the Container Device Interface (CDI). However, if exploited, this vulnerability could lead to container escape, allowing attackers to execute arbitrary commands on the host or access sensitive host system information. Known attack scenarios require running malicious container images and using GPU resources within containers via the NVIDIA Container Toolkit. Immediate remediation is required for affected clusters.

For official details, see NVIDIA Container Toolkit.

Affected scope

Your cluster is affected if both of the following conditions are met:

  1. The Kubernetes version is 1.32 or earlier.

  2. At least one GPU-accelerated node is running NVIDIA Container Toolkit version 1.17.7 or earlier.

Check the NVIDIA Container Toolkit version

  1. Log on to the GPU-accelerated node.

  2. Run the nvidia-container-cli --version command.

    Sample output:

    cli-version: 1.17.8
    lib-version: 1.17.8
    build date: 2025-05-30T13:47+00:00
    build revision: 6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64

Preventive measures

Enable the container security policy rule in the policy governance feature to restrict image pulls to trusted image repositories only.

Solutions

New GPU-accelerated nodes

ACK edge clusters (Kubernetes 1.20 or later)

  • Nodes created on or after August 4, 2025 will automatically install the patched version (V1.17.8) of the NVIDIA Container Toolkit.

  • No further action is required.

Clusters with Kubernetes versions earlier than 1.20

  • You must upgrade your cluster before creating new nodes to ensure they receive the patched version.

  • For instructions, see Upgrade clusters.

Existing GPU-accelerated nodes

All existing GPU-accelerated nodes created before August 4, 2025 require a manual fix.

Important

To ensure system stability, apply fixes in batches.

Manual fix procedure for edge nodes

Step 1: Drain the node

Drain the node to safely migrate its workloads to other available nodes. Before you begin, confirm you've selected the correct target edge node.

  1. Mark the node as unschedulable:

    kubectl cordon <NODE_NAME>
  2. Drain the node:

    kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true

Step 2: Apply the fix script

Log on to the affected node and run the following script.

  1. Set the required environment variables REGION and INTERCONNECT_MODE. Replace the example values with your actual configuration.

    export REGION="cn-hangzhou"  INTERCONNECT_MODE="basic";  # Replace "cn-hangzhou" and "basic" with your actual configurations.

    Parameter reference:

    Parameter

    Description

    Example

    REGION

    Region ID of your ACK Edge cluster. See Supported regions for details.

    cn-hangzhou

    INTERCONNECT_MODE

    Network access type:

    • basic: Internet access.

    • private: Leased line access.

    basic

  2. Run the fix script:

    #!/bin/bash
    set -e
    
    if [[ $REGION == "" ]];then
        echo "Error: REGION is null"
        exit 1
    fi
    
    if [[ $INTERCONNECT_MODE == "" ]]; then
       echo "Error: INTERCONNECT_MODE is null"
       exit 1
    fi
    
    NV_TOOLKIT_VERSION=1.17.8
    
    INTERNAL=$( [ "$INTERCONNECT_MODE" = "private" ] && echo "-internal" || echo "" )
    PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz
    
    cd /tmp
    
    export PKG_URL_PREFIX="http://aliacs-k8s-${REGION}.oss-${REGION}${INTERNAL}.aliyuncs.com"
    curl -o ${PACKAGE}  ${PKG_URL_PREFIX}/public/pkg/nvidia-container-runtime/${PACKAGE} 
    
    tar -xf ${PACKAGE} 
    
    cd pkg/nvidia-container-runtime/upgrade/common
    
    bash upgrade-nvidia-container-toolkit.sh
    
  3. Verify the fix.

    • If you see the following output, the node is not vulnerable and no changes were made:

      2025-03-22/xxxx  INFO  No need to upgrade current nvidia-container-toolkit(1.17.8)
    • If you see the following output, the node was vulnerable and has been patched:

      2025-03-22/xxxxx  INFO  succeed to upgrade nvidia container toolkit

Step 3: Re-enable the node

Restore the node's scheduling status:

kubectl uncordon <NODE_NAME>

Step 4 (optional): Test the node

After applying the fix, deploy a GPU workload to confirm the node is functioning correctly. Use sample YAML templates from: