All Products
Search
Document Center

Container Service for Kubernetes:Use the default Kubernetes GPU scheduling

Last Updated:Mar 26, 2026

Container Service for Kubernetes (ACK) supports GPU scheduling and operations management using the standard Kubernetes extended resource request model. This topic deploys a GPU-accelerated TensorFlow job to verify that your cluster schedules GPU workloads correctly.

Prerequisites

Before you begin, ensure that you have:

  • An ACK cluster with at least one GPU node

  • kubectl configured to connect to the cluster

  • Access to the ACK console

Avoid bypassing standard GPU resource requests

For GPU nodes managed by an ACK cluster, request GPU resources only through the standard Kubernetes extended resource mechanism (nvidia.com/gpu in the resources.limits field). The following actions bypass this mechanism and introduce security risks:

  • Running GPU applications directly on nodes

  • Using docker, podman, or nerdctl to create containers or request GPU resources (for example, docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all)

  • Adding NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> to the env section of a pod's YAML file

  • Using the NVIDIA_VISIBLE_DEVICES environment variable to directly request GPU resources for a pod

  • Defaulting NVIDIA_VISIBLE_DEVICES to all in a container image when the variable is not set in the pod's YAML file

  • Setting privileged: true in the pod's securityContext and running a GPU program

Why it matters: GPU resources requested through these non-standard methods are not recorded in the scheduler's resource tracking. The mismatch between actual GPU allocation on a node and what the scheduler tracks can cause it to assign additional GPU workloads to the same node, leading to service failures from resource contention on the same GPU card. These methods may also trigger known errors reported by the NVIDIA community.

Verify GPU availability

Before deploying a workload, confirm that your GPU node exposes GPU capacity to the Kubernetes scheduler.

  1. List nodes in the cluster:

    kubectl get nodes
  2. Describe a GPU node to check its capacity:

    kubectl describe node <gpu-node-name>

    In the Capacity section, nvidia.com/gpu must show a non-zero value:

    Capacity:
     nvidia.com/gpu: 1

    If nvidia.com/gpu is missing or shows 0, the GPU device plugin may not be running correctly on that node. Possible causes include the NVIDIA device plugin DaemonSet not being deployed, driver issues, or node configuration problems. Resolve the underlying issue before proceeding.

Deploy a GPU application

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Workloads > Deployments.

  3. On the Deployments page, click Create from YAML and paste the following manifest:

    apiVersion: v1
    kind: Pod
    metadata:
      name: tensorflow-mnist
      namespace: default
    spec:
      containers:
      - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        name: tensorflow-mnist
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1  # Request one GPU card for this container.
        workingDir: /root
      restartPolicy: Always
    GPU resources are only declared in limits, not requests. This is a Kubernetes constraint specific to GPU and other extended resources: you cannot set requests without a matching limits for GPU. The scheduler uses the limits value as the effective request.
  4. In the left navigation pane, choose Workloads > Pods. Find the pod you created and click its name.

  5. Click the Logs tab. It may take a few minutes for the image to pull and the pod to start. When the pod is running, the log output confirms that the job is using the GPU correctly.

    image.png

What's next

  • To schedule pods to specific GPU node types in a heterogeneous cluster, use node labels and node selectors. Label GPU nodes with the accelerator type (for example, kubectl label nodes <node-name> accelerator=<gpu-model>), then add a nodeSelector to your pod spec.

  • To explore advanced GPU scheduling options in ACK, such as GPU sharing and isolation, see the ACK GPU scheduling documentation.