Vulnerabilities CVE-2024-0132 and CVE-2024-0133 - Container Service for Kubernetes

NVIDIA Container Toolkit 1.16.1 or earlier contains a Time-of-check to Time-of-Use (TOCTOU) vulnerability when it is used with the default configuration. This vulnerability does not affect the use of Container Device Interface (CDI). Successful exploitation of this vulnerability may lead to container escape, which allows attackers to execute arbitrary commands on the host or obtain sensitive information. This vulnerability can be exploited if a victim uses a malicious image and manages GPU resources within a container by using the NVIDIA Container Toolkit. For more information about this vulnerability, see NVIDIA Container Toolkit. We recommend that you fix the vulnerability at the earliest opportunity.

Scope of impact

The NVIDIA Container Toolkit component is installed on the GPU-accelerated node in the cluster, and the component version is 1.16.1 or earlier.

How to prevent

Before the vulnerability is fixed, we recommend that you avoid running untrusted container images in the cluster in order to ensure the security and stability of the system. You can refer to the following methods:

Use the ACKAllowedRepos policy described in the Enable the policy governance feature topic to deploy images pulled only from trusted repositories. In addition, follow the least privilege principle. Make sure that only trusted users have permissions to import images.
Use the features described in the Sign container images and Use kritis-validation-hook to automatically verify the signatures of container images topics to ensure the security and integrity of container images.

Solution

This solution is available to managed node pools in ACK Pro clusters, ACK Basic clusters, ACK dedicated clusters, ACK Lingjun clusters, and on-cloud node pools in ACK Edge clusters.

If your cluster is an ACK Lingjun cluster and the node pool is a Lingjun node pool, submit a ticket.
If your cluster is an ACK Edge cluster and the node pool is an edge node pool, see CVE-2024-0132 and CVE-2024-0133 repair solutions.

Prerequisites

If a cGPU is installed in your cluster, make sure that its version is 1.1.0 or later. Skip this step if no cGPU is installed in your cluster. The following section describes how to check whether a cGPU is installed in the cluster and how to update the cGPU:

View how to update a cGPU

Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Helm.
On the Helm page, check the version of the component.
- If ack-ai-installer and ack-cgpu are installed in your cluster, submit a ticket.
- If the version of ack-ai-installer in your cluster is 1.7.5 or earlier, update ack-ai-installer. For more information, see Update the GPU sharing component.
- If the version of ack-cgpu in your cluster is 1.7.5 or earlier, click Update next to the component name and then follow the instructions to update the component.
After the component is updated, you can update the existing GPU-accelerated nodes in the cluster. For more information, see Update the cGPU version on a node.

Suggestion

New GPU-accelerated nodes: For GPU-accelerated nodes created on or after October 27, 2024, ACK has released NVIDIA Container Toolkit 1.16.2 for clusters that run Kubernetes 1.20 and later, which can automatically fix the vulnerability.
If your cluster runs an earlier Kubernetes version, we recommend that you update the Kubernetes version of your cluster at the earliest opportunity. For more information, see Upgrade clusters.
Existing GPU-accelerated nodes: For GPU-accelerated nodes created before October 27, 2024, you must manually fix the vulnerability by executing the CVE patching script. Expand the following panel to view the manual patching solution.
Note
- Perform vulnerability patching on the nodes in batches. Do not patch all nodes at the same time to ensure system stability.
- During the patching process, the pods that run business applications on the node are restarted. Perform this operation during off-peak hours.
View the manual patching solution
Step 1: Drain nodes
Use the ACK console
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
On the Nodes page, select the node that you want to manage and then click Drain in the lower part of the page. In the dialog box that appears, click OK.
kubectl
Mark the node as unschedulable:
kubectl cordon <NODE_NAME>
Drain the node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true
Step 2: Run the patching script on the node
1. Log on to the ACK console. In the navigation pane on the left, click Clusters.
2. On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
3. On the Nodes page, select the node that you want to manage, and then click Batch Operations in the lower part of the page. In the dialog box that appears, select Run Shell Scripts and click OK.
4. On the Basic Information page of the CloudOps Orchestration Service (OOS) console, configure the parameters and click Next Step: Parameter Settings. The following table describes the parameters.
  Parameter
  Value
  Execution Description
  Fix the NVIDIA Container Toolkit vulnerability
  Template Category
  Public Template
  Template
  ACS-ECS-BulkyRunCommand
  Execution Mode
  Suspend upon Failure
5. On the Parameter Settings page, click the Run Shell Script tab. Add the following CVE patching script to the box next to the CommandContent parameter, and then click Next Step: OK.
  #!/bin/bash set -e # Configure the region where the node resides. export REGION=$(curl 100.100.100.200/latest/meta-data/region-id 2>/dev/null) if [[ $REGION == "" ]];then echo "Error: failed to get region" exit 1 fi cd /tmp curl -o upgrade_nvidia-container-toolkit.tar.gz https://aliacs-k8s-${REGION}.oss-${REGION}-internal.aliyuncs.com/public/pkg/nvidia-container-runtime/upgrade_nvidia-container-toolkit.tar.gz tar -xf upgrade_nvidia-container-toolkit.tar.gz cd pkg/nvidia-container-runtime/upgrade/common bash upgrade-nvidia-container-toolkit.sh
6. In the Confirmation page, verify your settings and click Create.
7. After the task is executed, log on to the CloudOps Orchestration Service console. In the left-side navigation pane, choose Automated Task > Task Execution Management. On the Task Execution Management page, find the task that you want to view and click its Execution ID. On the page that appears, view Output in the Execution Steps and Results section.
  If the following output is displayed, no CVE vulnerabilities exist on the node. No changes are performed on the host, and you can ignore the vulnerability.
  2024-10-22/xxxx INFO No need to upgrade current nvidia-container-toolkit(1.16.2)
  If the following output is displayed, the NVIDIA Container Toolkit vulnerability exists on the node and is fixed.
  2024-10-10/xxxxx INFO succeed to upgrade nvidia container toolkit
Step 3: Connect the node to the cluster
Use the ACK console
Log on to the ACK console. In the navigation pane on the left, click Clusters.
On the Clusters page, click the name of the one you want to change. In the navigation pane on the left, choose Nodes > Nodes.
On the Nodes page, select the node that you want to manage and click Set Node Schedulability in the lower part of the page. In the dialog box that appears, select Set to Schedulable, and click OK.
kubectl
Run the following command to connect the node to the cluster:
kubectl uncordon <NODE_NAME>
(Optional) Step 4: Verify GPU-accelerated nodes
After you complete the preceding operations, we recommend that you deploy a GPU-accelerated node based on the sample YAML file in the following topic to check whether the node works as expected.
- Exclusive GPU: Use the default GPU scheduling mode.
- GPU sharing: Examples of using GPU sharing to share GPUs.

Container Service for Kubernetes:Vulnerabilities CVE-2024-0132 and CVE-2024-0133

Scope of impact

How to prevent

Solution

Prerequisites

Suggestion

Step 1: Drain nodes

Use the ACK console

kubectl

Step 2: Run the patching script on the node

Step 3: Connect the node to the cluster

Use the ACK console

kubectl

(Optional) Step 4: Verify GPU-accelerated nodes

Parameter	Value
Execution Description	Fix the NVIDIA Container Toolkit vulnerability
Template Category	Public Template
Template	ACS-ECS-BulkyRunCommand
Execution Mode	Suspend upon Failure