All Products
Search
Document Center

Container Service for Kubernetes:Vulnerability CVE-2025-23359

Last Updated:Apr 07, 2025

NVIDIA Container Toolkit 1.17.3 or earlier contains a security vulnerability in handling Compute Unified Device Architecture (CUDA) forward compatibility. When a container image contains maliciously crafted symbolic link files, libnvidia-container incorrectly mounts the host directory in read-only mode inside the container. Attackers can exploit this vulnerability to bypass container isolation mechanisms, potentially leading to sensitive information theft or host privilege escalation. For more information about this vulnerability, see NVIDIA Container Toolkit. Fix this vulnerability at the earliest opportunity.

Affected versions

This vulnerability affects clusters that run Kubernetes earlier than 1.32 and that have GPU-accelerated nodes with NVIDIA Container Toolkit 1.17.3 or earlier installed.

Note

You can run nvidia-container-cli --version to check the component version.

How to prevent

Before the vulnerability is fixed, we recommend that you do not run untrusted container images in the cluster to ensure system security and stability. The following methods can be used:

Solution

Usage notes

  • The solution is applicable to ACK managed Pro cluster, ACK managed Basic cluster, ACK dedicated cluster, ACK Edge cluster cloud node pools and ACK Lingjun cluster managed node pools.

  • If your cluster type is ACK Lingjun cluster and the node pool is a Lingjun node pool, submit a ticket.

  • Perform vulnerability patching on the nodes in batches. Do not patch all nodes at the same time to maintain system stability.

  • The entire process fixes the issue by restarting the application pods running on the node. Perform the fix operation during off-peak hours.

Solution

Configuration solution for new GPU-accelerated nodes

This solution is applicable to clusters that run Kubernetes 1.20 and later. If the Kubernetes version of your cluster is earlier than 1.20, Upgrade clusters.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.

  3. Find the node pool that you want to manage, click Edit, add the ack.aliyun.com/nvidia-container-runtime-version=1.17.5 label to the node pool, and then click Confirm.

    Note
    • This label locks the nvidia-container-toolkit version used when scaling out the node pool to 1.17.5, and will not automatically upgrade when new versions are released.

    • If you want to use a new version of nvidia-container-toolkit, you must manually delete this label, and the scale-out nodes will use the latest version by default.

Fix solution for existing GPU-accelerated nodes

For existing GPU-accelerated nodes, you can manually fix the issue by running the CVE fix script. The following section describes the details of the fix solution.

View the manual fix solution

Prerequisites

Skip this step if no cGPU is installed in your cluster. If a cGPU is installed in your cluster, make sure that its version is 1.1.0 or later. The following section describes how to check whether a cGPU is installed in the cluster and how to update the cGPU:

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Helm.

  3. On the Helm page, check the version of the component.

    • If ack-ai-installer and ack-cgpu are installed in your cluster, submit a ticket.

    • If the version of ack-ai-installer in your cluster is 1.7.5 or earlier, update ack-ai-installer. For more information, see Update the GPU sharing component.

    • If the version of ack-cgpu in your cluster is 1.7.5 or earlier, click Update next to the component name and then follow the instructions to update the component.

  4. After the component is updated, you can update the existing GPU-accelerated nodes in the cluster. For more information, see Update the cGPU version on a node.

Step 1: Drain a node

Use the ACK console
  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, select the node that you want to manage and then click Drain in the lower part of the page. In the dialog box that appears, click OK.

kubectl
  1. Run the following command to set the status of the node to unschedulable:

    kubectl cordon <NODE_NAME>
  2. Run the following command to drain the node:

    kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true

Step 2: Run the fix script on the node

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, select the nodes that you want to manage and click Batch Operations at the bottom. In the dialog box that appears, select Run Shell Scripts, and click OK.

  4. On the Basic Information tab of the CloudOps Orchestration Service (OOS) console, configure the parameters as described in the following table, and then click Next Step: Parameters Settings.

    Parameter

    Value

    Template Category

    Public Template

    Template

    ACS-ECS-BulkyRunCommand

    Execution Mode

    Suspend upon Failure

  5. On the Parameter Settings tab, click Run Shell Script, paste the following CVE fix script into the CommandContent section, and click Next Step: OK.

    #!/bin/bash
    set -e
    # Set the region where the node is located
    export REGION=$(curl 100.100.100.200/latest/meta-data/region-id 2>/dev/null)
    
    if [[ $REGION == "" ]];then
        echo "Error: failed to get region"
        exit 1
    fi
    
    NV_TOOLKIT_VERSION=1.17.5
    
    PACKAGE=upgrade_nvidia-container-toolkit-${NV_TOOLKIT_VERSION}.tar.gz
    
    cd /tmp
    
    curl -o ${PACKAGE}  http://aliacs-k8s-${REGION}.oss-${REGION}-internal.aliyuncs.com/public/pkg/nvidia-container-runtime/${PACKAGE} 
    
    tar -xf ${PACKAGE} 
    
    cd pkg/nvidia-container-runtime/upgrade/common
    
    bash upgrade-nvidia-container-toolkit.sh
  6. On the OK tab, confirm the creation information, and then click Create.

  7. After the task is successfully executed, in the left navigation pane of the OOS console, choose Automated Task > Task Execution Management. Find and click the corresponding task execution ID. Then, check the Outputs in the Execution Steps and Results section.

    • If the following output is displayed, the node environment does not have the CVE vulnerability, no changes were made to the device, and you can ignore it.

      2025-03-22/xxxx  INFO  No need to upgrade current nvidia-container-toolkit(1.17.5)
    • If the following output is displayed, the NVIDIA Container Toolkit vulnerability exists on the node and is fixed.

      2025-03-22/xxxxx  INFO  succeed to upgrade nvidia container toolkit

Step 3: Unisolate the node

Use the console

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose Nodes > Nodes.

  3. On the Nodes page, select the node that you want to manage and click Set Node Schedulability at the bottom. In the dialog box that appears, select Set to Schedulable, and then click OK.

Use kubectl

Run the following command to uncordon the node.

kubectl uncordon <NODE_NAME>

(Optional) Step 4: Verify the GPU node

After you complete the preceding operations, we recommend that you deploy a GPU-accelerated node based on the sample YAML file in the following topic to verify that the node works as expected.