Container Compute Service (ACS) provides serverless container compute resources. You can declare a GPU model and a supported driver version in your Pod specification. This significantly reduces infrastructure management and operational overhead. This document explains how to specify a GPU model and driver version when you create a Pod.
GPU model details
ACS supports multiple GPU models. You can use them with capacity reservations or request them on-demand when creating a Pod. The method for using them varies by Compute class.
GPU-HPN
This class only supports node reservation. After purchasing a reservation, you must associate it with your cluster. Each reservation is represented in the cluster as an independent virtual node for Pods. For more information, see GPU-HPN capacity reservations.
GPU-accelerated
This class supports both on-demand use and capacity reservation. Pods automatically deduct from available capacity reservations.
For a list of currently supported GPU models, submit a ticket.
Specify a GPU model for a Pod
For the GPU-HPN Compute class, you can only use resources by applying for a node reservation. Each reservation exists as an independent virtual node within the cluster. The virtual node's labels specify the GPU model, allowing you to use node affinity scheduling to schedule your Pod to the desired virtual node. For details, see Schedule Pods to GPU-HPN virtual nodes with attribute labels.
For the GPU-accelerated Compute class, you must explicitly specify the GPU model in the Pod's labels. The following table shows how to do this.
Compute class | Field | Example |
GPU-HPN | spec.nodeSelector | |
GPU-accelerated | metadata.labels[ alibabacloud.com/gpu-model-series] | |
Driver version details
GPU applications typically depend on CUDA (Compute Unified Device Architecture), a parallel computing platform and programming model introduced by NVIDIA in 2007. The following diagram shows the CUDA architecture. The Driver API and Runtime API in the CUDA software stack have the following differences:
Driver API: More comprehensive functionality, but more complex to use.
Runtime API (CUDA Runtime API): A user-friendly wrapper for the Driver API that abstracts some initialization operations.
The CUDA Driver API is provided by the NVIDIA driver package, while the CUDA Library and CUDA Runtime are provided by the CUDA Toolkit package.

When running GPU applications on an ACS cluster, note the following:
Build your container image from an official NVIDIA CUDA base image. These images come pre-installed with the CUDA Toolkit and are the ideal foundation for your application. You can also choose different CUDA base images for various CUDA Toolkit versions.
When creating an application, specify the driver version required by the Pod. For details, see Specify a driver version for a Pod.
Refer to the official NVIDIA CUDA Toolkit Release Notes for the compatibility matrix between CUDA Toolkit and NVIDIA driver versions.
Your application's CUDA Runtime API version must match the CUDA version of its base image. For example, if your application's Docker image is built from the nvidia/cuda:12.2.0-base-ubuntu20.04 base image, the application uses CUDA Runtime API version 12.2.0.
Specify a driver version for a Pod
ACS lets you specify a driver version for a Pod by using a label when it consumes GPU resources. The format is as follows.
Compute class | Field | Example |
GPU-accelerated | metadata.labels[alibabacloud.com/gpu-driver-version] | |
GPU-HPN | metadata.labels[alibabacloud.com/gpu-driver-version] | |
GPU driver versions
When you specify a driver version for a Pod, ensure that the version is on the list of GPU driver versions supported by ACS.
Default GPU driver version for Pods
ACS applies specific attributes to Pods based on predefined rules. If this default does not meet your needs, you can override it for specific Pod types by configuring the acs-profile ConfigMap in the kube-system namespace. For details, see Configure Selectors.
The following configuration sets the driver version to 1.5.0 for all Pods of the gpu-hpn Compute class in the cluster.
apiVersion: v1
kind: ConfigMap
metadata:
name: acs-profile
namespace: kube-system
data:
# Other system configurations remain unchanged
selectors: |
[
{
"name": "gpu-hpn-driver",
"objectSelector": {
"matchLabels": {
"alibabacloud.com/compute-class": "gpu-hpn"
}
},
"effect": {
"annotations": {
"alibabacloud.com/gpu-driver-version": "1.5.0"
}
}
}
]Example
Create a file named
gpu-pod-with-model-and-driver.yamlwith the following YAML content. This YAML defines aGPU-acceleratedPod that requests anexample-modelGPU and driver version535.161.08.apiVersion: v1 kind: Pod metadata: name: gpu-pod-with-model-and-driver labels: # Set the compute class to gpu. alibabacloud.com/compute-class: "gpu" # Set the GPU model to example-model. The value is for reference only. Example: T4. alibabacloud.com/gpu-model-series: "example-model" # Set the driver version to 535.161.08. alibabacloud.com/gpu-driver-version: "535.161.08" spec: containers: - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5 name: tensorflow-mnist command: - sleep - infinity resources: requests: cpu: 1 memory: 1Gi nvidia.com/gpu: 1 limits: cpu: 1 memory: 1Gi nvidia.com/gpu: 1Run the following command to deploy
gpu-pod-with-model-and-driver.yamlto the cluster.kubectl apply -f gpu-pod-with-model-and-driver.yamlRun the following command to check the Pod status.
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE gpu-pod-with-model-and-driver 1/1 Running 0 87sRun the following command to view the Pod's GPU information.
NoteThe
/usr/bin/nvidia-smicommand in the following example is pre-packaged in the sample image.kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smiExpected output:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI xxx.xxx.xx Driver Version: 535.161.08 CUDA Version: xx.x | |-----------------------------------------+----------------------+----------------------+ ... |=========================================+======================+======================| | x NVIDIA example-model xx | xxxxxxxx:xx:xx.x xxx | x | | xxx xxx xx xxx / xxxx | xxxx / xxx| x% xxxxxxxx| | | | xxx | +-----------------------------------------+----------------------+----------------------+The output shows the GPU model is
example-modeland the driver version is535.161.08, which matches the labels configured in the Pod manifest.ImportantThe preceding output is an example. Your actual output may vary depending on your environment.