Specify GPU models and driver versions for ACS GPU-accelerated pods - Container Compute Service

Alibaba Cloud Container Compute Service (ACS) provides serverless computing power for containers. When you use GPU resources, ACS lets you declare GPU models and supported driver versions in pods. This capability significantly reduces infrastructure management and O&M costs. This topic describes how to specify a GPU model and a driver version when you create a pod.

GPU models

ACS supports multiple GPU models. You can use them with capacity reservations or request them on demand when you create a pod. The method that you use depends on the compute class.

High-performance GPU for networking
You can only request node reservations. After you purchase the reservations, you must associate each reservation with your cluster. Each reservation runs as an independent virtual node in the cluster. For more information, see GPU-HPN Capacity Reservation.
GPU instances
You can use GPU resources on demand or configure capacity reservations. After you create a pod, ACS automatically deducts the GPU resources from your capacity reservation.

Note

To view the list of supported GPU models, submit a ticket.

Specify a GPU model for a pod

For GPU-HPN, you can only request node reservations. Each reservation runs as a virtual node in the cluster. The virtual node label includes the GPU model. You can use node affinity scheduling to schedule pods to that virtual node. For more information, see Schedule pods to GPU-HPN virtual nodes with attribute labels.

For GPU-accelerated pods, you must explicitly specify the GPU model in the labels and nodeSelector fields of the pod.

Compute class

Protocol field

Example

High-performance network GPU instance types

spec.nodeSelector

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
  name: gpu-example-pod
spec:
  nodeSelector:
 # Set the GPU model to example-model.
    alibabacloud.com/gpu-model-series: "example-model"
  ...

GPU instance type

metadata.labels[

alibabacloud.com/gpu-model-series]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model, such as T4.
    alibabacloud.com/gpu-model-series: "example-model"
  name: gpu-pod
spec:
...

Driver versions

GPU applications usually depend on the Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model released by NVIDIA in 2007. The following figure shows the CUDA architecture. The driver API and runtime API in the CUDA software stack differ in the following ways.

Driver API: offers full functionality but is complex to use.
Runtime API: wraps part of the driver API and hides some driver initialization steps. It is easier to use.

The CUDA Driver API is included in the NVIDIA driver package. The CUDA Library and CUDA Runtime are included in the CUDA Toolkit package.

When you run GPU applications in an ACS cluster, note the following:

Use the CUDA base images provided by NVIDIA to install the CUDA Toolkit in your container image. These base images already include the CUDA Toolkit. You can then build your application image directly on top of a base image that matches the specific CUDA Toolkit version required by your application.
To specify a driver version when you create a pod, see Specify a driver version for a pod.
For information about the compatibility between the CUDA Toolkit and NVIDIA drivers, see the official NVIDIA document CUDA Toolkit Release Notes.

Note

The CUDA runtime API version used by your application must match the version of the CUDA base image used to build its container image. For example, if your container image is built from the CUDA base image NVIDIA/CUDA:12.2.0-base-Ubuntu20.04, your application uses CUDA runtime API version 12.2.0.

Specify a driver version for a pod

ACS lets you specify a driver version using pod labels when your application uses GPU resources. Use the format shown in the following table.

Compute class

Protocol field

Example

GPU Instances

metadata.labels[alibabacloud.com/gpu-driver-version]

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model, such as T4.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU Instance with High-Performance Networking

apiVersion: v1
kind: Pod
metadata:
  labels:
    # Set the compute class to GPU-HPN.
    alibabacloud.com/compute-class: "gpu-hpn"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
  name: gpu-pod
spec:
...

GPU driver versions

Ensure that the driver version that you specify is supported by ACS. For more information, see GPU driver version guide.

Default GPU driver version for pods

ACS lets you configure specific pod attributes using rules. If the default driver version does not meet your needs, add the following configuration to the kube-system/acs-profile ConfigMap to assign different GPU driver versions to specific types of GPU pods. For more information, see Configure selectors.

The following configuration sets the driver version to 1.5.0 for all GPU-HPN pods in the cluster.

apiVersion: v1
kind: ConfigMap
metadata:
  name: acs-profile
  namespace: kube-system
data:
  # Keep other system configurations unchanged.
  selectors: |
    [
      {
        "name": "gpu-hpn-driver",
        "objectSelector": {
          "matchLabels": {
            "alibabacloud.com/compute-class": "gpu-hpn"
          }
        },
        "effect": {
          "labels": {
            "alibabacloud.com/gpu-driver-version": "1.5.0"
          }
        }
      }
    ]

Example

Create a file named gpu-pod-with-model-and-driver.yaml and add the following YAML content to the file. This file defines a pod that has the gpu compute class. The pod requests a GPU that has the example-model model and the 535.161.08 driver version.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-with-model-and-driver
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model to example-model, such as T4.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"
spec:
  containers:
  - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
    name: tensorflow-mnist
    command:
    - sleep
    - infinity
    resources:
      requests:
        cpu: 1
        memory: 1Gi
        nvidia.com/gpu: 1
      limits:
        cpu: 1
        memory: 1Gi
        nvidia.com/gpu: 1

Run the following command to deploy gpu-pod-with-model-and-driver.yaml to your cluster.
```
kubectl apply -f gpu-pod-with-model-and-driver.yaml
```

Run the following command to check the pod status.

kubectl get pod

Expected output:

NAME                            READY   STATUS    RESTARTS   AGE
gpu-pod-with-model-and-driver   1/1     Running   0          87s

Run the following command to check the GPU information in the pod.

Note

The /usr/bin/nvidia-smi command in the example is preconfigured in the sample container image.

kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smi

Expected output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI xxx.xxx.xx             Driver Version: 535.161.08   CUDA Version: xx.x     |
|-----------------------------------------+----------------------+----------------------+
...
|=========================================+======================+======================|
|   x  NVIDIA example-model           xx  | xxxxxxxx:xx:xx.x xxx |                    x |
| xxx   xxx    xx              xxx / xxxx |      xxxx /       xxx|      x%      xxxxxxxx|
|                                         |                      |                  xxx |
+-----------------------------------------+----------------------+----------------------+

The output shows the GPU model example-model and driver version 535.161.08, which match the labels in the pod definition.

Important

The output shown is for reference only. Actual results may vary depending on your environment.