NVIDIA Container Toolkit 1.16.1 or earlier contains a Time-of-check to Time-of-Use (TOCTOU) vulnerability when it is used with the default configuration. This vulnerability does not affect the use of Container Device Interface (CDI). Successful exploitation of this vulnerability may lead to container escape, which allows attackers to execute arbitrary commands on the host or obtain sensitive information. This vulnerability can be exploited if a victim uses a malicious image and manages GPU resources within a container by using the NVIDIA Container Toolkit. For more information about this vulnerability, see NVIDIA Container Toolkit. We recommend that you fix the vulnerability at the earliest opportunity.
Scope of impact
The NVIDIA Container Toolkit component is installed on the GPU-accelerated node in the Container Service for Kubernetes (ACK) Edge cluster, and the component version is 1.16.1 or earlier.
Solutions
New nodes: ACK Edge has released NVIDIA Container Toolkit V1.16.2 in clusters that run Kubernetes V1.20 or later, which automatically fixes the vulnerability. For GPU-accelerated nodes created on or after October 27, 2024, you can use them as normal.
If your cluster runs an earlier Kubernetes version, we recommend that you update the Kubernetes version of your cluster at the earliest opportunity. For more information, see Update an ACK Edge cluster.
Existing nodes: For existing GPU-accelerated nodes created before October 27, 2024, manual intervention is required by running the Common Vulnerabilities and Exposures (CVE) repair script.
For cloud nodes, refer to solutions for repair instructions.
For edge nodes, see the repair methods detailed below.
Perform vulnerability patching on the nodes in batches. To ensure system stability, do not patch all nodes at the same time.
Step 1: Drain a node
Use the ACK console
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Nodes page, select the node that you want to manage and then click Drain in the lower part of the page. In the dialog box that appears, click OK.
kubectl
Run the following command to set the status of the node to unschedulable:
kubectl cordon <NODE_NAME>Run the following command to drain the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
Step 2: Run the repair script on the node
Run the following script on affected nodes:
export REGION="" INTERCONNECT_MODE=""; export INTERNAL=$( [ "$INTERCONNECT_MODE" = "private" ] && echo "-internal" || echo "" ); wget http://aliacs-k8s-${REGION}.oss-${REGION}${INTERNAL}.aliyuncs.com/public/pkg/edge/fix-nvidia-cve.sh -O /tmp/fix-nvidia-cve.sh && bash /tmp/fix-nvidia-cve.sh;The following table describes the parameters:
Parameter
Description
Example
REGION
The region ID of the cluster.
cn-hangzhou
For more information about the regions supported by ACK Edge clusters, see Supported regions.
INTERCONNECT_MODE
The network type of connections to the node.
basic: public network.
private: Express Connect circuits.
basic
View the output.
If the following output is returned, your current node environment is not affected by the CVE, and no changes are necessary:
The current version of Nvidia container toolkit is safe, no cve.If the following output is returned, your node environment had the NVIDIA Container Toolkit vulnerability, which has now been remedied:
2024-10-10/xxxxx INFO succeeded to fix nvidia container toolkit cve
Step 3: Bring the node online
Use the ACK console
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Nodes page, select the node that you want to manage and click Set Node Schedulability in the lower part of the page. In the dialog box that appears, select Set to Schedulable, and click OK.
kubectl
Run the following command to connect the node to the cluster:
kubectl uncordon <NODE_NAME>Security hardening suggestions
We recommend that you enable security policy management and implement the ACKAllowedRepos policy to ensure that only images from trusted repositories are used and that permissions to import images are granted based on the principle of least privilege.