In AI training and inference scenarios, multiple applications often share GPU resources. Deploy the NVIDIA Dynamic Resource Allocation (DRA) driver in your ACK cluster to overcome the scheduling limits of traditional device plugins. The Kubernetes DRA API dynamically allocates GPUs across pods and controls resources at a fine-grained level, improving GPU utilization and reducing costs.
How it works
Dynamic Resource Allocation (DRA) extends the persistent volume API to support generic resources. The experience is similar to dynamic volume provisioning: just as you use a PersistentVolumeClaim to claim storage from a StorageClass, you use a ResourceClaim to claim GPU resources from a DeviceClass.
DRA supports more flexible and fine-grained resource allocation than traditional device plugins:
-
Flexible device filtering: Use the Common Expression Language (CEL) to filter devices by specific attributes.
-
Device sharing: Share the same GPU across multiple containers or pods by referencing the same ResourceClaim.
-
Simplified pod requests: Specify resource requirements declaratively without per-container device counts.
NVIDIA DRA Driver for GPUs implements the DRA API for Kubernetes workloads. It supports controlled GPU sharing and dynamic GPU reconfiguration.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed cluster running Kubernetes 1.34 or later
-
kubectl installed and configured with your cluster's kubeconfig
Set up the DRA GPU scheduling environment
Step 1: Create a GPU node pool
Create a node pool that uses DRA GPU scheduling. Add a node label to disable default GPU device plugin resource reporting and prevent duplicate GPU allocation.
-
Log on to the Container Service console. In the left navigation pane, choose Clusters. Click the cluster name, then choose Node management > Node Pools.
-
Click Create Node Pool. Select a GPU instance type from GPU instance types supported by ACK. Keep all other settings at their default values.
-
Click Specify Instance Type. Enter an instance type name, such as
ecs.gn7i-c8g1.2xlarge. Set Expected Nodes to 1. -
Click Advanced to expand the node pool configuration. Under Node Labels, add the following label:
ack.node.gpu.schedule: disabledThis disables exclusive GPU scheduling and stops GPU device plugin resource reporting on the node.
Important: Running both the device plugin and DRA on the same node causes duplicate GPU allocation. Always add this label to nodes where DRA is enabled.
-
Step 2: Install the NVIDIA DRA driver
Install the NVIDIA DRA GPU driver, which provides the concrete implementation of the DRA API.
-
Install the Helm CLI.
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash -
Add the NVIDIA Helm repository and update it.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update -
Install version
25.3.2of the NVIDIA DRA GPU driver.Important--set controller.affinity=nullremoves the node affinity constraint from the controller workload, allowing it to schedule on any node. Evaluate this setting before use in production environments, as it may affect stability.helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.2" --create-namespace --namespace nvidia-dra-driver-gpu \ --set gpuResourcesEnabledOverride=true \ --set controller.affinity=null \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=ack.node.gpu.schedule" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=In" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]=disabled"A successful installation produces output similar to:
NAME: nvidia-dra-driver-gpu LAST DEPLOYED: Tue Oct 14 20:42:13 2025 NAMESPACE: nvidia-dra-driver-gpu STATUS: deployed REVISION: 1 TEST SUITE: None
Step 3: Verify the environment
Verify that the NVIDIA DRA driver is running and GPU resources are reported to the cluster.
-
Check that the DRA GPU driver pods are running.
kubectl get pod -n nvidia-dra-driver-gpuAll pods should show a
Runningstatus. If any pod is inPendingorCrashLoopBackOff, check whether the node labelack.node.gpu.schedule: disabledwas applied correctly in Step 1. -
Check that DRA-related resources are created.
kubectl get deviceclass,resourcesliceThe expected output is:
NAME AGE deviceclass.resource.k8s.io/compute-domain-daemon.nvidia.com 60s deviceclass.resource.k8s.io/compute-domain-default-channel.nvidia.com 60s deviceclass.resource.k8s.io/gpu.nvidia.com 60s deviceclass.resource.k8s.io/mig.nvidia.com 60s NAME NODE DRIVER POOL AGE resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-compute-domain.nvidia.com-htjqn cn-beijing.10.11.34.156 compute-domain.nvidia.com cn-beijing.10.11.34.156 57s resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj cn-beijing.10.11.34.156 gpu.nvidia.com cn-beijing.10.11.34.156 57sIf the
deviceclassresources do not appear, DRA may not be enabled on your cluster. Confirm that your cluster runs Kubernetes 1.34 or later. If noresourcesliceresources appear, the driver pod may not be running — recheck Step 2. -
View GPU resource details from a ResourceSlice.
Replace
cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhjwith your actual ResourceSlice name from the previous step.kubectl get resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj -o yaml
Deploy a workload that uses DRA GPU
The following steps use a ResourceClaimTemplate to automatically create a ResourceClaim per pod, so each pod gets independent access to a separate GPU.
-
Create a file named
resource-claim-template.yaml.apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: single-gpu spec: spec: devices: requests: - exactly: allocationMode: ExactCount deviceClassName: gpu.nvidia.com count: 1 name: gpuApply the template to the cluster.
kubectl apply -f resource-claim-template.yaml -
Create a file named
resource-claim-template-pod.yaml.apiVersion: v1 kind: Pod metadata: name: pod1 labels: app: pod spec: containers: - name: ctr image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04 command: ["bash", "-c"] args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: gpu resourceClaims: - name: gpu resourceClaimTemplateName: single-gpuDeploy the pod.
kubectl apply -f resource-claim-template-pod.yaml -
List the ResourceClaim objects created automatically for the pod.
Replace
pod1-gpu-wstqmwith your actual ResourceClaim name.kubectl get resourceclaimThe output includes an auto-generated ResourceClaim such as
pod1-gpu-wstqm. To inspect it:kubectl describe resourceclaim pod1-gpu-wstqm -
Verify that the pod is using the GPU. The expected output is
GPU 0: NVIDIA A10.kubectl logs pod1
(Optional) Clean up the environment
After testing, delete unused resources to avoid unnecessary charges.
-
Delete the pod and ResourceClaimTemplate.
kubectl delete pod pod1 kubectl delete resourceclaimtemplate single-gpu -
Uninstall the NVIDIA DRA GPU driver.
helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu