All Products
Search
Document Center

Container Service for Kubernetes:Schedule GPUs using DRA

Last Updated:Oct 22, 2025

For AI training and inference tasks where multiple applications need to share GPU resources, traditional device plugins have scheduling limitations. Leverage the Kubernetes Dynamic Resource Allocation (DRA) API to achieve dynamic, fine-grained GPU allocation between pods by deploying the NVIDIA DRA driver in your Container Service for Kubernetes (ACK) cluster. Doing so improves GPU utilization and reduces costs.

How it works

  • DRA is a Kubernetes API that extends the PersistentVolume model to generic resources such as GPUs, allowing pods to request and share them. Compared to the traditional device plugin model, DRA offers a more flexible and fine-grained way to request resources.

  • NVIDIA DRA Driver for GPUs implements the DRA API to provide a modern way to allocate GPUs for Kubernetes workloads. It supports controlled sharing and dynamic reconfiguration of GPUs.

Prerequisite

You have an ACK managed cluster running Kubernetes version 1.34 or later.

Configure the DRA GPU scheduling environment

Step 1: Create a GPU node pool

Create a node pool for DRA-based GPU scheduling. Use a node label to disable the default GPU device plugin, preventing GPUs from being double-counted.

  1. Log on to the ACK console. In the left navigation pane, choose Clusters. Select the target cluster and choose Nodes > Node Pools.

  2. Click Create Node Pool and select an ACK-supported GPU instance type. Keep the default values for the other parameters.

    1. Click Specify Instance Type and enter an instance type, such as ecs.gn7i-c8g1.2xlarge. Set Expected Nodes to 1.

    2. Expand Advanced Options (Optional). In the Node Labels section, enter the key-value pair: ack.node.gpu.schedule: disabled. This label disables the default exclusive GPU scheduling feature and prevents the GPU device plugin from reporting resources.

      Enabling both the device plugin and DRA causes GPU resources to be allocated twice. Disable GPU device plugin resource reporting on DRA nodes.

Step 2: Install the NVIDIA DRA driver

Install the NVIDIA DRA GPU driver using Helm. This driver provides the implementation of the DRA API for your cluster.

  1. Get a cluster kubeconfig and connect to the cluster using kubectl.

  2. Install the Helm CLI if you haven't already.

    curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
  3. Add and update the NVIDIA Helm repository.

    helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update
  4. Install the NVIDIA DRA GPU driver version 25.3.2. This command configures the driver to run only on the nodes you labeled in the previous step.

    helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.2" --create-namespace --namespace nvidia-dra-driver-gpu \
        --set gpuResourcesEnabledOverride=true \
        --set controller.affinity=null \
        --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=ack.node.gpu.schedule" \
        --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=In" \
        --set "kubeletPlugin.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]=disabled"
    Important

    The --set controller.affinity=null parameter removes the node affinity declaration from the controller workload. This allows the controller workload to be scheduled on any node, which might cause stability issues. Evaluate the impact before you perform this operation in a production environment.

    The following output indicates that the driver is installed.

    NAME: nvidia-dra-driver-gpu
    LAST DEPLOYED: Tue Oct 14 20:42:13 2025
    NAMESPACE: nvidia-dra-driver-gpu
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None

Step 3: Verify the environment

Verify that the NVIDIA DRA driver is running as expected and GPU resources are reported to the Kubernetes cluster.

  1. Ensure that all pods for the DRA GPU driver driver are in the Running state.

    kubectl get pod -n nvidia-dra-driver-gpu
  2. Confirm that the DRA-related resources are created.

    kubectl get deviceclass,resourceslice

    Expected output:

    NAME                                                                    AGE
    deviceclass.resource.k8s.io/compute-domain-daemon.nvidia.com            60s
    deviceclass.resource.k8s.io/compute-domain-default-channel.nvidia.com   60s
    deviceclass.resource.k8s.io/gpu.nvidia.com                              60s
    deviceclass.resource.k8s.io/mig.nvidia.com                              60s
    
    NAME                                                                                   NODE                      DRIVER                      POOL                      AGE
    resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-compute-domain.nvidia.com-htjqn   cn-beijing.10.11.34.156   compute-domain.nvidia.com   cn-beijing.10.11.34.156   57s
    resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj              cn-beijing.10.11.34.156   gpu.nvidia.com              cn-beijing.10.11.34.156   57s
  3. View the details of the GPU resource reporting in the current environment.

    Replace cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj with the actual name of the resourceslice resource object.
    kubectl get resourceslice.resource.k8s.io/cn-beijing.1x.1x.3x.1x-gpu.nvidia.com-bnwhj -o yaml

Deploy a workload using a DRA GPU

This section shows how to deploy a workload that requests GPU resources using DRA. This is done by creating a ResourceClaimTemplate to automatically create a ResourceClaim.
  1. Declare a ResourceClaimTemplate that requests a single GPU. Save the following content as resource-claim-template.yaml.

    apiVersion: resource.k8s.io/v1
    kind: ResourceClaimTemplate
    metadata:
      name: single-gpu
    spec:
      spec:
        devices:
          requests:
          - exactly:
              allocationMode: ExactCount
              deviceClassName: gpu.nvidia.com
              count: 1
            name: gpu

    Create the ResourceClaimTemplate in the cluster.

    kubectl apply -f resource-claim-template.yaml
  2. Create a file named resource-claim-template-pod.yaml.

    apiVersion: v1
    kind: Pod
    metadata:
      name: pod1
      labels:
        app: pod
    spec:
      containers:
      - name: ctr
        image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: gpu
      resourceClaims:
      - name: gpu
        resourceClaimTemplateName: single-gpu

    Create a workload that references the ResourceClaimTemplate.

    kubectl apply -f resource-claim-template-pod.yaml
  3. View the automatically created ResourceClaim.

    1. Find the ResourceClaim in the current namespace, such as pod1-gpu-wstqm.

      kubectl get resourceclaim
    2. View the ResourceClaim details.

      Replace pod1-gpu-wstqm with the actual name of the ResourceClaim resource object.
      kubectl describe resourceclaim pod1-gpu-wstqm
  4. Check the pod's logs to verify that it has access to the GPU. The expected output should list the allocated GPU, such as GPU 0: NVIDIA A10.

    kubectl logs pod1

(Optional) Clean up the environment

When you are finished, delete the resources you created to avoid unnecessary costs.

  • Delete the deployed workloads and claim template.

    kubectl delete pod pod1
    kubectl delete resourceclaimtemplate single-gpu
  • Uninstall the Nvidia GPU DRA driver.

    helm uninstall nvidia-dra-driver-gpu -n nvidia-dra-driver-gpu
  • Remove or release node resources.