All Products
Search
Document Center

Container Compute Service:Specify GPU models and driver versions for ACS GPU-accelerated pods

Last Updated:Nov 18, 2025

Container Compute Service (ACS) provides serverless container compute power. ACS lets you declare GPU models and driver versions when you configure GPU resources for pods. This significantly reduces expenses on infrastructure management and maintenance. This topic describes how to specify a GPU model and driver version when you create a pod.

GPU models

ACS supports a variety of GPU models. You can also configure capacity reservations to apply for GPU resources on demand. You can specify GPU models for pods of different compute classes as follows.

  • GPU-HPN

    You can only apply for node reservations. After you apply for a node reservation, you must associate it with the cluster. Each reservation hosts pods as an independent virtual node in the cluster. For more information, see GPU-HPN capacity reservations.

  • GPU-accelerated

    You can apply for GPU resources on demand or configure capacity reservations. After a pod is created, GPU resources are automatically deducted from the capacity reservation.

Note

To view the supported GPU models, submit a ticket.

Specify a GPU model for a pod

For GPU-HPN pods, you can only apply for node reservations. Each reservation hosts pods as a virtual node in the cluster. The label of the virtual node contains the GPU model. You can configure node affinity scheduling to schedule pods to the virtual node. For more information, see Schedule pods to GPU-HPN virtual nodes with attribute labels.

You need to explicitly specify the GPU model in the labels and nodeSelector parameters of a pod.

Compute class

Parameter

Example

GPU-HPN

spec.nodeSelector

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
  name: gpu-example-pod
spec:
  nodeSelector:
 # Set the GPU model to example-model. The value is for reference only.
    alibabacloud.com/gpu-model-series: "example-model"
  ...

GPU-accelerated

metadata.labels[

alibabacloud.com/gpu-model-series]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model. The value is for reference only.
    alibabacloud.com/gpu-model-series: "example-model"
  name: gpu-pod
spec:
...

Driver versions

GPU-heavy applications usually rely on Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model released by NVIDIA in 2007. The following figure displays the architecture of CUDA. The driver API and runtime API in the CUDA software stack have the following differences.

  • Driver API: provides a variety of complex features.

  • Runtime API: encapsulates partial drivers and provides implicit driver initialization.

The CUDA driver API is included in the NVIDIA driver package and the CUDA library and CUDA runtime are included in the CUDA Toolkit package.

cuda.png

When you run GPU-heavy applications in an ACS cluster, take note of the following items:

  1. The CUDA base images provided by NVIDIA are used to build container images. The CUDA Toolkit is already installed in these base images. You can build container images based on the base images. You can also select a CUDA base image based on the version of the CUDA Toolkit.

  2. To specify a driver version when you create a pod, see Specify a driver version for a pod.

  3. For more information about the compatibility between the CUDA Toolkit and NVIDIA drivers, see CUDA Toolkit Release Notes.

Note

The version of the CUDA runtime API used by an application is the same as the version of the CUDA base image that is used to build the Docker image. For example, if your Docker image is built based on CUDA base image NVIDIA/CUDA:12.2.0-base-Ubuntu20.04, the version of the CUDA runtime API used by your application is 12.2.0.

Specify a driver version for a pod

ACS lets you add pod labels to specify a driver version.

Compute class

Parameter

Example

GPU-accelerated

metadata.labels[alibabacloud.com/gpu-driver-version]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model. The value is for reference only.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU-HPN

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU driver versions

Make sure that the specified driver version is supported by ACS. For more information about the driver versions for different GPU models, see GPU driver versions supported by ACS.

Default GPU driver versions for pods

ACS supports configuring specific properties for pods based on certain rules. If the default driver version does not meet your requirements, you can add the following configuration to kube-system/acs-profile to configure different GPU driver versions for specific types of GPU pods. For more information, see Configure Selectors.

The following configuration sets the driver version to 1.5.0 for all pods with the gpu-hpn compute class in the cluster.

apiVersion: v1
kind: ConfigMap
metadata:
  name: acs-profile
  namespace: kube-system
data:
  # Other system configurations remain unchanged
  selectors: |
    [
      {
        "name": "gpu-hpn-driver",
        "objectSelector": {
          "matchLabels": {
            "alibabacloud.com/compute-class": "gpu-hpn"
          }
        },
        "effect": {
          "annotations": {
            "alibabacloud.com/gpu-driver-version": "1.5.0"
          }
        }
      }
    ]

Example

  1. Create a file named gpu-pod-with-model-and-driver.yaml based on the following YAML content. The file creates a pod whose compute class is GPU. The GPU model that the pod applies for is example-model and the driver version is 535.161.08.

    apiVersion: v1
    kind: Pod
    metadata:
      name: gpu-pod-with-model-and-driver
      labels:
        # Set the compute class to gpu.
        alibabacloud.com/compute-class: "gpu"
        # Set the GPU model to example-model. The value is for reference only.
        alibabacloud.com/gpu-model-series: "example-model"
        # Set the driver version to 535.161.08.
        alibabacloud.com/gpu-driver-version: "535.161.08"
    spec:
      containers:
      - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        name: tensorflow-mnist
        command:
        - sleep
        - infinity
        resources:
          requests:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: 1
          limits:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: 1
  2. Run the following command to deploy the gpu-pod-with-model-and-driver.yaml file in the cluster.

    kubectl apply -f gpu-pod-with-model-and-driver.yaml
  3. Run the following command to query the status of the pod:

    kubectl get pod

    Expected output:

    NAME                            READY   STATUS    RESTARTS   AGE
    gpu-pod-with-model-and-driver   1/1     Running   0          87s
  4. Run the following command to query the GPU information of the pod:

    Note

    /usr/bin/nvidia-smi contains the command parameters encapsulated in the sample container image.

    kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smi

    Expected output:

    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI xxx.xxx.xx             Driver Version: 535.161.08   CUDA Version: xx.x     |
    |-----------------------------------------+----------------------+----------------------+
    ...
    |=========================================+======================+======================|
    |   x  NVIDIA example-model           xx  | xxxxxxxx:xx:xx.x xxx |                    x |
    | xxx   xxx    xx              xxx / xxxx |      xxxx /       xxx|      x%      xxxxxxxx|
    |                                         |                      |                  xxx |
    +-----------------------------------------+----------------------+----------------------+

    The output indicates that the GPU model is example-model and the driver version is 535.161.08, which meets the expectation.

    Important

    The actual output shall prevail.