NVIDIA Container Toolkit 1.16.1 or earlier contains a Time-of-check to Time-of-Use (TOCTOU) vulnerability when it is used with the default configuration. This vulnerability does not affect the use of Container Device Interface (CDI). Successful exploitation of this vulnerability may lead to container escape, which allows attackers to execute arbitrary commands on the host or obtain sensitive information. This vulnerability can be exploited if a victim uses a malicious image and manages GPU resources within a container by using the NVIDIA Container Toolkit. For more information about this vulnerability, see NVIDIA Container Toolkit. We recommend that you fix the vulnerability at the earliest opportunity.
Scope of impact
The NVIDIA Container Toolkit component is installed on the GPU-accelerated node in the cluster, and the component version is 1.16.1 or earlier.
How to prevent
Before the vulnerability is fixed, we recommend that you avoid running untrusted container images in the cluster in order to ensure the security and stability of the system. You can refer to the following methods:
Solution
This solution is available to managed node pools in ACK Pro clusters, ACK Basic clusters, ACK dedicated clusters, ACK Lingjun clusters, and on-cloud node pools in ACK Edge clusters.
Prerequisites
If a cGPU is installed in your cluster, make sure that its version is 1.1.0 or later. Skip this step if no cGPU is installed in your cluster. The following section describes how to check whether a cGPU is installed in the cluster and how to update the cGPU:
View how to update a cGPU
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose .
On the Helm page, check the version of the component.
If ack-ai-installer and ack-cgpu are installed in your cluster, submit a ticket.
If the version of ack-ai-installer in your cluster is 1.7.5 or earlier, update ack-ai-installer. For more information, see Update the GPU sharing component.
If the version of ack-cgpu in your cluster is 1.7.5 or earlier, click Update next to the component name and then follow the instructions to update the component.
After the component is updated, you can update the existing GPU-accelerated nodes in the cluster. For more information, see Update the cGPU version on a node.
Suggestion
New GPU-accelerated nodes: For GPU-accelerated nodes created on or after October 27, 2024, ACK has released NVIDIA Container Toolkit 1.16.2 for clusters that run Kubernetes 1.20 and later, which can automatically fix the vulnerability.
If your cluster runs an earlier Kubernetes version, we recommend that you update the Kubernetes version of your cluster at the earliest opportunity. For more information, see Upgrade clusters.
Existing GPU-accelerated nodes: For GPU-accelerated nodes created before October 27, 2024, you must manually fix the vulnerability by executing the CVE patching script. Expand the following panel to view the manual patching solution.
Note Perform vulnerability patching on the nodes in batches. Do not patch all nodes at the same time to ensure system stability.
During the patching process, the pods that run business applications on the node are restarted. Perform this operation during off-peak hours.
View the manual patching solution
Step 1: Drain nodes
Use the ACK console
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose .
On the Nodes page, select the node that you want to manage and then click Drain in the lower part of the page. In the dialog box that appears, click OK.
kubectl
Mark the node as unschedulable:
kubectl cordon <NODE_NAME>
Drain the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
Step 2: Run the patching script on the node
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose .
On the Nodes page, select the node that you want to manage, and then click Batch Operations in the lower part of the page. In the dialog box that appears, select Run Shell Scripts and click OK.
On the Basic Information page of the CloudOps Orchestration Service (OOS) console, configure the parameters and click Next Step: Parameter Settings. The following table describes the parameters.
Parameter | Value |
Execution Description | Fix the NVIDIA Container Toolkit vulnerability |
Template Category | Public Template |
Template | ACS-ECS-BulkyRunCommand |
Execution Mode | Suspend upon Failure |
On the Parameter Settings page, click the Run Shell Script tab. Add the following CVE patching script to the box next to the CommandContent parameter, and then click Next Step: OK.
#!/bin/bash
set -e
# Configure the region where the node resides.
export REGION=$(curl 100.100.100.200/latest/meta-data/region-id 2>/dev/null)
if [[ $REGION == "" ]];then
echo "Error: failed to get region"
exit 1
fi
cd /tmp
curl -o upgrade_nvidia-container-toolkit.tar.gz https://aliacs-k8s-${REGION}.oss-${REGION}-internal.aliyuncs.com/public/pkg/nvidia-container-runtime/upgrade_nvidia-container-toolkit.tar.gz
tar -xf upgrade_nvidia-container-toolkit.tar.gz
cd pkg/nvidia-container-runtime/upgrade/common
bash upgrade-nvidia-container-toolkit.sh
In the Confirmation page, verify your settings and click Create.
After the task is executed, log on to the CloudOps Orchestration Service console. In the left-side navigation pane, choose . On the Task Execution Management page, find the task that you want to view and click its Execution ID. On the page that appears, view Output in the Execution Steps and Results section.
If the following output is displayed, no CVE vulnerabilities exist on the node. No changes are performed on the host, and you can ignore the vulnerability.
2024-10-22/xxxx INFO No need to upgrade current nvidia-container-toolkit(1.16.2)
If the following output is displayed, the NVIDIA Container Toolkit vulnerability exists on the node and is fixed.
2024-10-10/xxxxx INFO succeed to upgrade nvidia container toolkit
Step 3: Connect the node to the cluster
Use the ACK console
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose .
On the Nodes page, select the node that you want to manage and click Set Node Schedulability in the lower part of the page. In the dialog box that appears, select Set to Schedulable, and click OK.
kubectl
Run the following command to connect the node to the cluster:
kubectl uncordon <NODE_NAME>
(Optional) Step 4: Verify GPU-accelerated nodes
After you complete the preceding operations, we recommend that you deploy a GPU-accelerated node based on the sample YAML file in the following topic to check whether the node works as expected.