Specify GPU models and driver versions for ACS GPU-accelerated pods - Container Compute Service

Container Compute Service (ACS) provides serverless container compute resources. You can declare a GPU model and a supported driver version in your Pod specification. This significantly reduces infrastructure management and operational overhead. This document explains how to specify a GPU model and driver version when you create a Pod.

GPU model details

ACS supports multiple GPU models. You can use them with capacity reservations or request them on-demand when creating a Pod. The method for using them varies by Compute class.

GPU-HPN
This class only supports node reservation. After purchasing a reservation, you must associate it with your cluster. Each reservation is represented in the cluster as an independent virtual node for Pods. For more information, see GPU-HPN capacity reservations.
GPU-accelerated
This class supports both on-demand use and capacity reservation. Pods automatically deduct from available capacity reservations.

Note

For a list of currently supported GPU models, submit a ticket.

Specify a GPU model for a Pod

For the GPU-HPN Compute class, you can only use resources by applying for a node reservation. Each reservation exists as an independent virtual node within the cluster. The virtual node's labels specify the GPU model, allowing you to use node affinity scheduling to schedule your Pod to the desired virtual node. For details, see Schedule Pods to GPU-HPN virtual nodes with attribute labels.

For the GPU-accelerated Compute class, you must explicitly specify the GPU model in the Pod's labels. The following table shows how to do this.

Compute class

Field

Example

GPU-HPN

spec.nodeSelector

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
  name: gpu-example-pod
spec:
  nodeSelector:
 # Set the GPU model to example-model. The value is for reference only. Example: T4.
    alibabacloud.com/gpu-model-series: "example-model"
  ...

GPU-accelerated

metadata.labels[

alibabacloud.com/gpu-model-series]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model. The value is for reference only. Example: T4.
    alibabacloud.com/gpu-model-series: "example-model"
  name: gpu-pod
spec:
...

Driver version details

GPU applications typically depend on CUDA (Compute Unified Device Architecture), a parallel computing platform and programming model introduced by NVIDIA in 2007. The following diagram shows the CUDA architecture. The Driver API and Runtime API in the CUDA software stack have the following differences:

Driver API: More comprehensive functionality, but more complex to use.
Runtime API (CUDA Runtime API): A user-friendly wrapper for the Driver API that abstracts some initialization operations.

The CUDA Driver API is provided by the NVIDIA driver package, while the CUDA Library and CUDA Runtime are provided by the CUDA Toolkit package.

When running GPU applications on an ACS cluster, note the following:

Build your container image from an official NVIDIA CUDA base image. These images come pre-installed with the CUDA Toolkit and are the ideal foundation for your application. You can also choose different CUDA base images for various CUDA Toolkit versions.
When creating an application, specify the driver version required by the Pod. For details, see Specify a driver version for a Pod.
Refer to the official NVIDIA CUDA Toolkit Release Notes for the compatibility matrix between CUDA Toolkit and NVIDIA driver versions.

Note

Your application's CUDA Runtime API version must match the CUDA version of its base image. For example, if your application's Docker image is built from the nvidia/cuda:12.2.0-base-ubuntu20.04 base image, the application uses CUDA Runtime API version 12.2.0.

Specify a driver version for a Pod

ACS lets you specify a driver version for a Pod by using a label when it consumes GPU resources. The format is as follows.

Compute class

Field

Example

GPU-accelerated

metadata.labels[alibabacloud.com/gpu-driver-version]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model. The value is for reference only. Example: T4.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU-HPN

metadata.labels[alibabacloud.com/gpu-driver-version]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU driver versions

When you specify a driver version for a Pod, ensure that the version is on the list of GPU driver versions supported by ACS.

Default GPU driver version for Pods

ACS applies specific attributes to Pods based on predefined rules. If this default does not meet your needs, you can override it for specific Pod types by configuring the acs-profile ConfigMap in the kube-system namespace. For details, see Configure Selectors.

The following configuration sets the driver version to 1.5.0 for all Pods of the gpu-hpn Compute class in the cluster.

apiVersion: v1
kind: ConfigMap
metadata:
  name: acs-profile
  namespace: kube-system
data:
  # Other system configurations remain unchanged
  selectors: |
    [
      {
        "name": "gpu-hpn-driver",
        "objectSelector": {
          "matchLabels": {
            "alibabacloud.com/compute-class": "gpu-hpn"
          }
        },
        "effect": {
          "annotations": {
            "alibabacloud.com/gpu-driver-version": "1.5.0"
          }
        }
      }
    ]

Example

Create a file named gpu-pod-with-model-and-driver.yaml with the following YAML content. This YAML defines a GPU-accelerated Pod that requests an example-model GPU and driver version 535.161.08.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-with-model-and-driver
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model. The value is for reference only. Example: T4.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
spec:
  containers:
  - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
    name: tensorflow-mnist
    command:
    - sleep
    - infinity
    resources:
      requests:
        cpu: 1
        memory: 1Gi
        nvidia.com/gpu: 1
      limits:
        cpu: 1
        memory: 1Gi
        nvidia.com/gpu: 1

Run the following command to deploy gpu-pod-with-model-and-driver.yaml to the cluster.
```
kubectl apply -f gpu-pod-with-model-and-driver.yaml
```

Run the following command to check the Pod status.

kubectl get pod

Expected output:

NAME                            READY   STATUS    RESTARTS   AGE
gpu-pod-with-model-and-driver   1/1     Running   0          87s

Run the following command to view the Pod's GPU information.

Note

The /usr/bin/nvidia-smi command in the following example is pre-packaged in the sample image.

kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smi

Expected output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI xxx.xxx.xx             Driver Version: 535.161.08   CUDA Version: xx.x     |
|-----------------------------------------+----------------------+----------------------+
...
|=========================================+======================+======================|
|   x  NVIDIA example-model           xx  | xxxxxxxx:xx:xx.x xxx |                    x |
| xxx   xxx    xx              xxx / xxxx |      xxxx /       xxx|      x%      xxxxxxxx|
|                                         |                      |                  xxx |
+-----------------------------------------+----------------------+----------------------+

The output shows the GPU model is example-model and the driver version is 535.161.08, which matches the labels configured in the Pod manifest.

Important

The preceding output is an example. Your actual output may vary depending on your environment.