NVIDIA Device Plugin is a Kubernetes cluster component that manages each node's GPU, enabling more convenient and efficient GPU resource utilization. This topic describes how to upgrade and restart NVIDIA Device Plugin, isolate GPU devices, and check and update the NVIDIA Device Plugin version in Container Service for Kubernetes (ACK) clusters in exclusive GPU scheduling scenarios.
NVIDIA Device Plugin
The implementation and management policies of NVIDIA Device Plugin vary based on its version. If the Kubernetes version of your cluster is earlier than 1.20, we recommend that you manually upgrade the cluster. The following table describes the differences between NVIDIA Device Plugin in clusters that run different Kubernetes versions.
Item | Kubernetes 1.32 and later | Kubernetes 1.20 to 1.31 |
Deployment | DaemonSet | Static pod |
Management | Add-ons page in the ACK console | Manual maintenance |
Node label | ack.node.gpu.schedule=default | No special requirements |
Take note of the following items when you deploy NVIDIA Device Plugin as a DaemonSet:
NVIDIA Device Plugin is automatically installed during cluster creation.
If you uninstall NVIDIA Device Plugin, GPU-accelerated nodes that are added by scale-out activities cannot report GPU resources.
When you upgrade a cluster from an earlier version to 1.32, if NVIDIA Device Plugin is deployed in a static pod, NVIDIA Device Plugin will be upgraded to an ACK component.
This DaemonSet uses a node selector (ack.node.gpu.schedule=default) to select GPU-accelerated nodes for deployment. By default, when you add a GPU-accelerated node to an ACK cluster, ACK automatically adds the
ack.node.gpu.schedule=default
label to the node when executing the node initialization script.
If your nodes run Ubuntu 22.04 or Red Hat Enterprise Linux (RHEL) 9.3 64-bit:
The NVIDIA device plugin automatically sets the environment variable
NVIDIA_VISIBLE_DEVICES=all
by default. Executingsystemctl daemon-reload
orsystemctl daemon-reexec
on the node may trigger GPU device access failures, causing the NVIDIA device plugin not to run as expected. For resolution steps, see Why does the system prompt failed to initialize NVML: Unknown Error when I run a GPU container on Ubuntu 22.04?If you upgrade the cluster from an earlier version to 1.32 before May 1, 2025, there may be NVIDIA Device Plugins deployed both as static pods and as DaemonSets in the cluster. You can run the following script to check for nodes deployed as static pods.
#!/bin/bash for i in $(kubectl get po -n kube-system -l component=nvidia-device-plugin | grep -v NAME | awk '{print $1}');do if kubectl get po $i -o yaml -n kube-system | grep 'kubernetes.io/config.source: file' &> /dev/null;then kubectl get pod $i -n kube-system -o jsonpath='{.spec.nodeName}{"\n"}' fi done
Expected output:
cn-beijing.10.12.XXX.XX cn-beijing.10.13.XXX.XX
The expected output shows that some nodes still have the NVIDIA Device Plugin deployed as static pods. You can use the following method to migrate the NVIDIA Device Plugin deployed as static pods to DaemonSets.
kubectl label nodes <NODE_NAME> ack.node.gpu.schedule=default
Upgrade NVIDIA Device Plugin
Kubernetes 1.32 and later
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.
On the Add-ons page, find the ack-nvidia-device-plugin card and click Upgrade.
In the dialog box that appears, click OK.
Kubernetes 1.20 to 1.31
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose .
Select the GPU-accelerated nodes requiring batch maintenance. In the lower part of the node list, click Batch Operations. In the Batch Operations dialog box, select Execute Shell Command and click Confirm.
ImportantWe recommend that you first upgrade GPU Device Plugin on a small number of GPU-accelerated nodes. After the upgrade is completed, verify that NVIDIA Device Plugin runs as normal on these nodes. Then, upgrade the remaining GPU-accelerated nodes on a larger scale.
In the CloudOps Orchestration Service (OOS) console that appears, select Execution Mode as Failure Pause, and then click Next: Parameter Settings.
On the parameter settings page, select Run Shell Script and paste the provided sample script.
NoteYou must set the
RUN_PKG_VERSION
parameter in the following script to the minor version run by your Kubernetes cluster, such as 1.30. Do not set the value to a patch version, such as 1.30.1. If you specify a patch version, script errors may occur.#!/bin/bash set -e RUN_PKG_VERSION=1.30 function update_device_plugin() { base_dir=/tmp/update_device_plugin rm -rf $base_dir mkdir -p $base_dir cd $base_dir region_id=$(curl -ssL 100.100.100.200/latest/meta-data/region-id 2> /dev/null || echo "") if [[ $region_id == "" ]];then echo "Error: failed to get region id,region id is null" exit 1 fi PKG_URL=https://aliacs-k8s-${region_id}.oss-${region_id}.aliyuncs.com/public/pkg/run/run-${RUN_PKG_VERSION}.tar.gz curl -sSL --retry 3 --retry-delay 2 -o run.tar.gz $PKG_URL tar -xf run.tar.gz local dir=pkg/run/$RUN_PKG_VERSION/module sed -i "s@registry.cn-hangzhou.aliyuncs.com/acs@registry-${region_id}-vpc.ack.aliyuncs.com/acs@g" $dir/nvidia-device-plugin.yml mkdir -p /etc/kubernetes/device-plugin-backup mkdir -p /etc/kubernetes/manifests mv /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/device-plugin-backup/nvidia-device-plugin.yml.$(date +%s) sleep 5 cp -a $dir/nvidia-device-plugin.yml /etc/kubernetes/manifests echo "succeeded to update device plugin" } if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ];then update_device_plugin else echo "skip to update device plugin" fi
Click Next: Confirm. After verifying the information, click Create.
You will be redirected to the Task Execution Management page to monitor the task status. If the Execution Output displays
succeeded to update device plugin
, the update was successful.Execute the following command to verify the Device Plugin on the GPU-accelerated node is functioning correctly.
kubectl get nodes <NODE_NAME> -o jsonpath='{.metadata.name} ==> nvidia.com/gpu: {.status.allocatable.nvidia\.com/gpu}'
Expected output:
cn-hangzhou.172.16.XXX.XX ==> nvidia.com/gpu: 1
If the GPU-accelerated node reports the extended resource
nvidia.com/gpu
as not 0, the Device Plugin is operational.
Restart NVIDIA Device Plugin
In ACK exclusive GPU scheduling scenarios, NVIDIA Device Plugin is deployed as a static pod by default. To restart NVIDIA Device Plugin, perform the following steps on the node hosting the static pod:
Kubernetes 1.32 and later
Run the following command to query the pod that runs NVIDIA Device Plugin on a node:
kubectl get pod -n kube-system -l component=nvidia-device-plugin -o wide | grep <NODE>
Run the following command to restart the pod that runs NVIDIA Device Plugin:
$ kubectl delete po <DEVICE_PLUGIN_POD> -n kube-system
Kubernetes 1.20 to 1.31
On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose .
Select the GPU-accelerated nodes that you want to manage and click Batch Operations. In the Batch Operations dialog box, select Run Shell Scripts and click OK.
ImportantWe recommend that you first upgrade GPU Device Plugin on a small number of GPU-accelerated nodes. After the upgrade is completed, verify that NVIDIA Device Plugin runs as normal on these nodes. Then, upgrade the remaining GPU-accelerated nodes on a larger scale.
In the OOS console, select Suspend upon Failure for Execution Mode and click Next Step: Parameter Settings.
In the Parameter Settings step, click Run Shell Script and copy the following sample script to the code editor:
#!/bin/bash set -e if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ];then cp -a /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes rm -rf /etc/kubernetes/manifests/nvidia-device-plugin.yml sleep 5 mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests echo "the nvidia device is restarted" else echo "no need to restart nvidia device plugin" fi
Click Next Step: OK. Confirm the information and click Create. You are redirected to the Task Execution Management page, on which you can view the task status.
Run the following command to check whether NVIDIA Device Plugin runs as normal on the specified GPU-accelerated node:
kubectl get nodes <NODE_NAME> -o jsonpath='{.metadata.name} ==> nvidia.com/gpu: {.status.allocatable.nvidia\.com/gpu}'
Expected output:
cn-hangzhou.172.16.XXX.XX ==> nvidia.com/gpu: 1
If the value of
nvidia.com/gpu
in the output is not 0, it indicates NVIDIA Device Plugin runs as normal.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
Isolate GPU devices
GPU device isolation is only supported by nvidia-device-plugin version 0.9.1 and later. For more information, see View the NVIDIA Device Plugin version.
In ACK exclusive GPU scheduling scenarios, you may need to isolate a specific GPU device on a node due to issues like device failure. ACK provides a manual isolation mechanism to prevent new GPU application pods from being assigned to the affected GPU card. Follow these steps:
On the target node, manage the file unhealthyDevices.json
located in the directory /etc/nvidia-device-plugin/
. Create the file if it does not exist. Ensure that unhealthyDevices.json
adheres to the JSON file format as outlined below.
{
"index": ["x", "x" ..],
"uuid": ["xxx", "xxx" ..]
}
Enter the index
or uuid
of the device you want to isolate in the JSON file (only one is required per device). The changes will take effect automatically after saving the file.
Once the settings are complete, verify the isolation by checking the reported nvidia.com/gpu
resources on the Kubernetes Node.
View the NVIDIA Device Plugin version
Kubernetes 1.32 and later
For NVIDIA Device Plugin deployed as a DaemonSet, you can locate ack-nvidia-device-plugin on the Add-ons page of the console and check the current version on the component card.
Kubernetes 1.20 to 1.31
For NVIDIA Device Plugin deployed as a static pod, you can run the following command to check the component version.
kubectl get pods -n kube-system -l component=nvidia-device-plugin \ -o jsonpath='{range .items[*]}{.spec.containers[0].image}{"\t"}{.spec.nodeName}{"\n"}{end}' \ | awk -F'[:/]' '{split($NF, a, "-"); print a[1] "\t" $0}' \ | sort -k1,1V \ | cut -f2- \ | awk -F'\t' '{split($1, img, ":"); print img[NF] "\t" $2}'
Modify the device checkpoint key for NVIDIA Device Plugin
The Device Plugin creates a checkpoint file on the node to record device allocations and corresponding pod information. By default, NVIDIA Device Plugin uses the UUID of the GPU as the unique identifier. Follow the steps below to change the identifier to the device's index, which can resolve issues like UUID loss during VM cold migration.
Kubernetes 1.32 and later
Run the following command to modify the environment variables of the static pod:
kubectl edit pod -n kube-system ack-nvidia-device-plugin-lr7dr -oyaml
Add the following environment variable:
CHECKPOINT_DEVICE_ID_STRATEGY
.env: - name: CHECKPOINT_DEVICE_ID_STRATEGY value: index
Refer to Restart NVIDIA Device Plugin to restart NVIDIA Device Plugin to make the modifications take effect.
Kubernetes 1.20 to 1.31
Verify the Device Plugin version in the
/etc/kubernetes/manifests/nvidia-device-plugin.yml
file on the target node by checking the image tag. If the version is 0.9.3 or higher, no change is needed. Otherwise, update to the latest versionv0.9.3-0dd4d5f5-aliyun
.Modify the static pod's environment variables in the
/etc/kubernetes/manifests/nvidia-device-plugin.yml
file. Add the environment variableCHECKPOINT_DEVICE_ID_STRATEGY
as shown in the example code.env: - name: CHECKPOINT_DEVICE_ID_STRATEGY value: index
Refer to Restart NVIDIA Device Plugin to restart NVIDIA Device Plugin to make the modifications take effect.
References
If you encounter issues with GPU-accelerated nodes, consult Diagnose GPU-accelerated nodes or refer to the GPU FAQ for assistance.
For details on shared GPU scheduling, refer to the GPU sharing.