All Products
Search
Document Center

Container Service for Kubernetes:GPU Device Plugin-related operations

Last Updated:Jun 16, 2025

NVIDIA Device Plugin is a Kubernetes cluster component that manages each node's GPU, enabling more convenient and efficient GPU resource utilization. This topic describes how to upgrade and restart NVIDIA Device Plugin, isolate GPU devices, and check and update the NVIDIA Device Plugin version in Container Service for Kubernetes (ACK) clusters in exclusive GPU scheduling scenarios.

NVIDIA Device Plugin

The implementation and management policies of NVIDIA Device Plugin vary based on its version. If the Kubernetes version of your cluster is earlier than 1.20, we recommend that you manually upgrade the cluster. The following table describes the differences between NVIDIA Device Plugin in clusters that run different Kubernetes versions.

Item

Kubernetes 1.32 and later

Kubernetes 1.20 to 1.31

Deployment

DaemonSet

Static pod

Management

Add-ons page in the ACK console

Manual maintenance

Node label

ack.node.gpu.schedule=default

No special requirements

Take note of the following items when you deploy NVIDIA Device Plugin as a DaemonSet:

  • NVIDIA Device Plugin is automatically installed during cluster creation.

  • If you uninstall NVIDIA Device Plugin, GPU-accelerated nodes that are added by scale-out activities cannot report GPU resources.

  • When you upgrade a cluster from an earlier version to 1.32, if NVIDIA Device Plugin is deployed in a static pod, NVIDIA Device Plugin will be upgraded to an ACK component.

  • This DaemonSet uses a node selector (ack.node.gpu.schedule=default) to select GPU-accelerated nodes for deployment. By default, when you add a GPU-accelerated node to an ACK cluster, ACK automatically adds the ack.node.gpu.schedule=default label to the node when executing the node initialization script.

Important
  • If your nodes run Ubuntu 22.04 or Red Hat Enterprise Linux (RHEL) 9.3 64-bit:

  • The NVIDIA device plugin automatically sets the environment variable NVIDIA_VISIBLE_DEVICES=all by default. Executing systemctl daemon-reload or systemctl daemon-reexec on the node may trigger GPU device access failures, causing the NVIDIA device plugin not to run as expected. For resolution steps, see Why does the system prompt failed to initialize NVML: Unknown Error when I run a GPU container on Ubuntu 22.04?

  • If you upgrade the cluster from an earlier version to 1.32 before May 1, 2025, there may be NVIDIA Device Plugins deployed both as static pods and as DaemonSets in the cluster. You can run the following script to check for nodes deployed as static pods.

    #!/bin/bash
    for i in $(kubectl get po -n kube-system -l component=nvidia-device-plugin | grep -v NAME | awk '{print $1}');do
        if kubectl get po $i -o yaml -n kube-system | grep 'kubernetes.io/config.source: file' &> /dev/null;then
        kubectl get pod $i -n kube-system -o jsonpath='{.spec.nodeName}{"\n"}'
        fi
    done

    Expected output:

    cn-beijing.10.12.XXX.XX
    cn-beijing.10.13.XXX.XX

    The expected output shows that some nodes still have the NVIDIA Device Plugin deployed as static pods. You can use the following method to migrate the NVIDIA Device Plugin deployed as static pods to DaemonSets.

    kubectl label nodes <NODE_NAME> ack.node.gpu.schedule=default

Upgrade NVIDIA Device Plugin

Kubernetes 1.32 and later

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the one you want to manage and click its name. In the left-side navigation pane, click Add-ons.

  3. On the Add-ons page, find the ack-nvidia-device-plugin card and click Upgrade.

  4. In the dialog box that appears, click OK.

Kubernetes 1.20 to 1.31

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose Nodes > Nodes.

  3. Select the GPU-accelerated nodes requiring batch maintenance. In the lower part of the node list, click Batch Operations. In the Batch Operations dialog box, select Execute Shell Command and click Confirm.

    Important

    We recommend that you first upgrade GPU Device Plugin on a small number of GPU-accelerated nodes. After the upgrade is completed, verify that NVIDIA Device Plugin runs as normal on these nodes. Then, upgrade the remaining GPU-accelerated nodes on a larger scale.

  4. In the CloudOps Orchestration Service (OOS) console that appears, select Execution Mode as Failure Pause, and then click Next: Parameter Settings.

  5. On the parameter settings page, select Run Shell Script and paste the provided sample script.

    Note

    You must set the RUN_PKG_VERSION parameter in the following script to the minor version run by your Kubernetes cluster, such as 1.30. Do not set the value to a patch version, such as 1.30.1. If you specify a patch version, script errors may occur.

    #!/bin/bash
    set -e
    
    RUN_PKG_VERSION=1.30
    
    function update_device_plugin() {
    	base_dir=/tmp/update_device_plugin
    	rm -rf $base_dir
    	mkdir -p $base_dir
    	cd $base_dir
    	region_id=$(curl -ssL 100.100.100.200/latest/meta-data/region-id  2> /dev/null || echo "")
    	if [[ $region_id == "" ]];then
    		echo "Error: failed to get region id,region id is null"
    		exit 1
    	fi
    	PKG_URL=https://aliacs-k8s-${region_id}.oss-${region_id}.aliyuncs.com/public/pkg/run/run-${RUN_PKG_VERSION}.tar.gz
    	curl -sSL --retry 3 --retry-delay 2 -o run.tar.gz $PKG_URL
    	tar -xf run.tar.gz
    
    	local dir=pkg/run/$RUN_PKG_VERSION/module
    	sed -i "s@registry.cn-hangzhou.aliyuncs.com/acs@registry-${region_id}-vpc.ack.aliyuncs.com/acs@g" $dir/nvidia-device-plugin.yml
    	mkdir -p /etc/kubernetes/device-plugin-backup
    	mkdir -p /etc/kubernetes/manifests
    	mv  /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/device-plugin-backup/nvidia-device-plugin.yml.$(date +%s)
    	sleep 5
    	cp -a $dir/nvidia-device-plugin.yml /etc/kubernetes/manifests
    	echo "succeeded to update device plugin"
    }
    
    if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ];then
    	update_device_plugin
    else
    	echo "skip to update device plugin"
    fi
  6. Click Next: Confirm. After verifying the information, click Create.

    You will be redirected to the Task Execution Management page to monitor the task status. If the Execution Output displays succeeded to update device plugin, the update was successful.

  7. Execute the following command to verify the Device Plugin on the GPU-accelerated node is functioning correctly.

    kubectl get nodes <NODE_NAME> -o jsonpath='{.metadata.name} ==> nvidia.com/gpu: {.status.allocatable.nvidia\.com/gpu}'

    Expected output:

    cn-hangzhou.172.16.XXX.XX ==> nvidia.com/gpu: 1

    If the GPU-accelerated node reports the extended resource nvidia.com/gpu as not 0, the Device Plugin is operational.

Restart NVIDIA Device Plugin

In ACK exclusive GPU scheduling scenarios, NVIDIA Device Plugin is deployed as a static pod by default. To restart NVIDIA Device Plugin, perform the following steps on the node hosting the static pod:

Kubernetes 1.32 and later

  1. Run the following command to query the pod that runs NVIDIA Device Plugin on a node:

    kubectl get pod -n kube-system -l component=nvidia-device-plugin -o wide | grep <NODE>
  2. Run the following command to restart the pod that runs NVIDIA Device Plugin:

    $ kubectl delete po <DEVICE_PLUGIN_POD> -n kube-system 

Kubernetes 1.20 to 1.31

  1. On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose Nodes > Nodes.

  2. Select the GPU-accelerated nodes that you want to manage and click Batch Operations. In the Batch Operations dialog box, select Run Shell Scripts and click OK.

    Important

    We recommend that you first upgrade GPU Device Plugin on a small number of GPU-accelerated nodes. After the upgrade is completed, verify that NVIDIA Device Plugin runs as normal on these nodes. Then, upgrade the remaining GPU-accelerated nodes on a larger scale.

  3. In the OOS console, select Suspend upon Failure for Execution Mode and click Next Step: Parameter Settings.

  4. In the Parameter Settings step, click Run Shell Script and copy the following sample script to the code editor:

    #!/bin/bash
    set -e
    
    if [ -f /etc/kubernetes/manifests/nvidia-device-plugin.yml ];then
    	cp -a /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes
    	rm -rf /etc/kubernetes/manifests/nvidia-device-plugin.yml
    	sleep 5
    	mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests
    	echo "the nvidia device is restarted"
    else
    	echo "no need to restart nvidia device plugin"
    fi
  5. Click Next Step: OK. Confirm the information and click Create. You are redirected to the Task Execution Management page, on which you can view the task status.

  6. Run the following command to check whether NVIDIA Device Plugin runs as normal on the specified GPU-accelerated node:

    kubectl get nodes <NODE_NAME> -o jsonpath='{.metadata.name} ==> nvidia.com/gpu: {.status.allocatable.nvidia\.com/gpu}'

    Expected output:

    cn-hangzhou.172.16.XXX.XX ==> nvidia.com/gpu: 1

    If the value of nvidia.com/gpu in the output is not 0, it indicates NVIDIA Device Plugin runs as normal.

Log on to the ACK console. In the left-side navigation pane, click Clusters.

Isolate GPU devices

Important

GPU device isolation is only supported by nvidia-device-plugin version 0.9.1 and later. For more information, see View the NVIDIA Device Plugin version.

In ACK exclusive GPU scheduling scenarios, you may need to isolate a specific GPU device on a node due to issues like device failure. ACK provides a manual isolation mechanism to prevent new GPU application pods from being assigned to the affected GPU card. Follow these steps:

On the target node, manage the file unhealthyDevices.json located in the directory /etc/nvidia-device-plugin/. Create the file if it does not exist. Ensure that unhealthyDevices.json adheres to the JSON file format as outlined below.

{
 "index": ["x", "x" ..],
 "uuid": ["xxx", "xxx" ..]
}

Enter the index or uuid of the device you want to isolate in the JSON file (only one is required per device). The changes will take effect automatically after saving the file.

Once the settings are complete, verify the isolation by checking the reported nvidia.com/gpu resources on the Kubernetes Node.

View the NVIDIA Device Plugin version

  • Kubernetes 1.32 and later

    For NVIDIA Device Plugin deployed as a DaemonSet, you can locate ack-nvidia-device-plugin on the Add-ons page of the console and check the current version on the component card.

  • Kubernetes 1.20 to 1.31

    For NVIDIA Device Plugin deployed as a static pod, you can run the following command to check the component version.

    kubectl get pods -n kube-system -l component=nvidia-device-plugin \
      -o jsonpath='{range .items[*]}{.spec.containers[0].image}{"\t"}{.spec.nodeName}{"\n"}{end}' \
      | awk -F'[:/]' '{split($NF, a, "-"); print a[1] "\t" $0}' \
      | sort -k1,1V \
      | cut -f2- \
      | awk -F'\t' '{split($1, img, ":"); print img[NF] "\t" $2}'

Modify the device checkpoint key for NVIDIA Device Plugin

The Device Plugin creates a checkpoint file on the node to record device allocations and corresponding pod information. By default, NVIDIA Device Plugin uses the UUID of the GPU as the unique identifier. Follow the steps below to change the identifier to the device's index, which can resolve issues like UUID loss during VM cold migration.

Kubernetes 1.32 and later

  1. Run the following command to modify the environment variables of the static pod:

    kubectl edit pod -n kube-system ack-nvidia-device-plugin-lr7dr -oyaml 
  2. Add the following environment variable: CHECKPOINT_DEVICE_ID_STRATEGY.

        env:
          - name: CHECKPOINT_DEVICE_ID_STRATEGY
            value: index
  3. Refer to Restart NVIDIA Device Plugin to restart NVIDIA Device Plugin to make the modifications take effect.

Kubernetes 1.20 to 1.31

  1. Verify the Device Plugin version in the /etc/kubernetes/manifests/nvidia-device-plugin.yml file on the target node by checking the image tag. If the version is 0.9.3 or higher, no change is needed. Otherwise, update to the latest version v0.9.3-0dd4d5f5-aliyun.

  2. Modify the static pod's environment variables in the /etc/kubernetes/manifests/nvidia-device-plugin.yml file. Add the environment variable CHECKPOINT_DEVICE_ID_STRATEGY as shown in the example code.

        env:
          - name: CHECKPOINT_DEVICE_ID_STRATEGY
            value: index
  3. Refer to Restart NVIDIA Device Plugin to restart NVIDIA Device Plugin to make the modifications take effect.

References