For AI training and inference tasks where multiple applications need to share GPU resources, traditional device plugins have scheduling limitations. Leverage the Kubernetes Dynamic Resource Allocation (DRA) API to achieve dynamic, fine-grained GPU allocation between pods by deploying the NVIDIA DRA driver in your Container Service for Kubernetes (ACK) cluster. Doing so improves GPU utilization and reduces costs.
How it works
DRA is a Kubernetes API that extends the PersistentVolume model to generic resources such as GPUs, allowing pods to request and share them. Compared to the traditional device plugin model, DRA offers a more flexible and fine-grained way to request resources.
NVIDIA DRA Driver for GPUs implements the DRA API to provide a modern way to allocate GPUs for Kubernetes workloads. It supports controlled sharing and dynamic reconfiguration of GPUs.
Prerequisite
You have an ACK managed cluster running Kubernetes version 1.34 or later.
Configure the DRA GPU scheduling environment
Step 1: Create a GPU node pool
Create a node pool for DRA-based GPU scheduling. Use a node label to disable the default GPU device plugin, preventing GPUs from being double-counted.
Log on to the ACK console. In the left navigation pane, choose Clusters. Select the target cluster and choose Nodes > Node Pools.
Click Create Node Pool and select an ACK-supported GPU instance type. Keep the default values for the other parameters.
Click Specify Instance Type and enter an instance type, such as
ecs.gn7i-c8g1.2xlarge. Set Expected Nodes to 1.Expand Advanced Options (Optional). In the Node Labels section, enter the key-value pair:
ack.node.gpu.schedule: disabled. This label disables the default exclusive GPU scheduling feature and prevents the GPU device plugin from reporting resources.Enabling both the device plugin and DRA causes GPU resources to be allocated twice. Disable GPU device plugin resource reporting on DRA nodes.
Step 2: Install the NVIDIA DRA driver
Install the NVIDIA DRA GPU driver using Helm. This driver provides the implementation of the DRA API for your cluster.
Get a cluster kubeconfig and connect to the cluster using kubectl.
Install the Helm CLI if you haven't already.
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bashAdd and update the
NVIDIA Helmrepository.helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo updateInstall the
NVIDIA DRA GPUdriver version25.3.2. This command configures the driver to run only on the nodes you labeled in the previous step.helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.2" --create-namespace --namespace nvidia-dra-driver-gpu \ --set gpuResourcesEnabledOverride=true \ --set controller.affinity=null \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=ack.node.gpu.schedule" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=In" \ --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]=disabled"ImportantThe
--set controller.affinity=nullparameter removes the node affinity declaration from thecontrollerworkload. This allows thecontrollerworkload to be scheduled on any node, which might cause stability issues. Evaluate the impact before you perform this operation in a production environment.The following output indicates that the driver is installed.
NAME: nvidia-dra-driver-gpu LAST DEPLOYED: Tue Oct 14 20:42:13 2025 NAMESPACE: nvidia-dra-driver-gpu STATUS: deployed REVISION: 1 TEST SUITE: None
Step 3: Verify the environment
Verify that the NVIDIA DRA driver is running as expected and GPU resources are reported to the Kubernetes cluster.
Ensure that all pods for the DRA GPU driver driver are in the
Runningstate.kubectl get pod -n nvidia-dra-driver-gpuConfirm that the DRA-related resources are created.
kubectl get deviceclass,resourcesliceExpected output:
NAME AGE deviceclass.resource.k8s.io/compute-domain-daemon.nvidia.com 60s deviceclass.resource.k8s.io/compute-domain-default-channel.nvidia.com 60s deviceclass.resource.k8s.io/gpu.nvidia.com 60s deviceclass.resource.k8s.io/mig.nvidia.com 60s NAME NODE DRIVER POOL AGE resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-compute-domain.nvidia.com-htjqn cn-beijing.10.11.34.156 compute-domain.nvidia.com cn-beijing.10.11.34.156 57s resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj cn-beijing.10.11.34.156 gpu.nvidia.com cn-beijing.10.11.34.156 57sView the details of the GPU resource reporting in the current environment.
Replace
cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhjwith the actual name of theresourcesliceresource object.kubectl get resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj -o yaml
Deploy a workload using a DRA GPU
This section shows how to deploy a workload that requests GPU resources using DRA. This is done by creating aResourceClaimTemplateto automatically create aResourceClaim.
Declare a
ResourceClaimTemplatethat requests a single GPU. Save the following content asresource-claim-template.yaml.apiVersion: resource.k8s.io/v1 kind: ResourceClaimTemplate metadata: name: single-gpu spec: spec: devices: requests: - exactly: allocationMode: ExactCount deviceClassName: gpu.nvidia.com count: 1 name: gpuCreate the
ResourceClaimTemplatein the cluster.kubectl apply -f resource-claim-template.yamlCreate a file named
resource-claim-template-pod.yaml.apiVersion: v1 kind: Pod metadata: name: pod1 labels: app: pod spec: containers: - name: ctr image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04 command: ["bash", "-c"] args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"] resources: claims: - name: gpu resourceClaims: - name: gpu resourceClaimTemplateName: single-gpuCreate a workload that references the
ResourceClaimTemplate.kubectl apply -f resource-claim-template-pod.yamlView the automatically created
ResourceClaim.Find the
ResourceClaimin the current namespace, such aspod1-gpu-wstqm.kubectl get resourceclaimView the
ResourceClaimdetails.Replace
pod1-gpu-wstqmwith the actual name of theResourceClaimresource object.kubectl describe resourceclaim pod1-gpu-wstqm
Check the pod's logs to verify that it has access to the GPU. The expected output should list the allocated GPU, such as
GPU 0: NVIDIA A10.kubectl logs pod1
(Optional) Clean up the environment
When you are finished, delete the resources you created to avoid unnecessary costs.
Delete the deployed workloads and claim template.
kubectl delete pod pod1 kubectl delete resourceclaimtemplate single-gpuUninstall the
Nvidia GPU DRAdriver.helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu