Alibaba Cloud Container Compute Service (ACS) provides serverless computing power for containers. When you use GPU resources, ACS lets you declare GPU models and supported driver versions in pods. This capability significantly reduces infrastructure management and O&M costs. This topic describes how to specify a GPU model and a driver version when you create a pod.
GPU models
ACS supports multiple GPU models. You can use them with capacity reservations or request them on demand when you create a pod. The method that you use depends on the compute class.
High-performance GPU for networking
You can only request node reservations. After you purchase the reservations, you must associate each reservation with your cluster. Each reservation runs as an independent virtual node in the cluster. For more information, see GPU-HPN Capacity Reservation.
GPU instances
You can use GPU resources on demand or configure capacity reservations. After you create a pod, ACS automatically deducts the GPU resources from your capacity reservation.
To view the list of supported GPU models, submit a ticket.
Specify a GPU model for a pod
For GPU-HPN, you can only request node reservations. Each reservation runs as a virtual node in the cluster. The virtual node label includes the GPU model. You can use node affinity scheduling to schedule pods to that virtual node. For more information, see Schedule pods to GPU-HPN virtual nodes with attribute labels.
For GPU-accelerated pods, you must explicitly specify the GPU model in the labels and nodeSelector fields of the pod.
Compute class | Protocol field | Example |
High-performance network GPU instance types | spec.nodeSelector | |
GPU instance type | metadata.labels[ alibabacloud.com/gpu-model-series] | |
Driver versions
GPU applications usually depend on the Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model released by NVIDIA in 2007. The following figure shows the CUDA architecture. The driver API and runtime API in the CUDA software stack differ in the following ways.
Driver API: offers full functionality but is complex to use.
Runtime API: wraps part of the driver API and hides some driver initialization steps. It is easier to use.
The CUDA Driver API is included in the NVIDIA driver package. The CUDA Library and CUDA Runtime are included in the CUDA Toolkit package.

When you run GPU applications in an ACS cluster, note the following:
Use the CUDA base images provided by NVIDIA to install the CUDA Toolkit in your container image. These base images already include the CUDA Toolkit. You can then build your application image directly on top of a base image that matches the specific CUDA Toolkit version required by your application.
To specify a driver version when you create a pod, see Specify a driver version for a pod.
For information about the compatibility between the CUDA Toolkit and NVIDIA drivers, see the official NVIDIA document CUDA Toolkit Release Notes.
The CUDA runtime API version used by your application must match the version of the CUDA base image used to build its container image. For example, if your container image is built from the CUDA base image NVIDIA/CUDA:12.2.0-base-Ubuntu20.04, your application uses CUDA runtime API version 12.2.0.
Specify a driver version for a pod
ACS lets you specify a driver version using pod labels when your application uses GPU resources. Use the format shown in the following table.
Compute class | Protocol field | Example |
GPU Instances | metadata.labels[alibabacloud.com/gpu-driver-version] | |
GPU Instance with High-Performance Networking | |
GPU driver versions
Ensure that the driver version that you specify is supported by ACS. For more information, see GPU driver version guide.
Default GPU driver version for pods
ACS lets you configure specific pod attributes using rules. If the default driver version does not meet your needs, add the following configuration to the kube-system/acs-profile ConfigMap to assign different GPU driver versions to specific types of GPU pods. For more information, see Configure selectors.
The following configuration sets the driver version to 1.5.0 for all GPU-HPN pods in the cluster.
apiVersion: v1
kind: ConfigMap
metadata:
name: acs-profile
namespace: kube-system
data:
# Keep other system configurations unchanged.
selectors: |
[
{
"name": "gpu-hpn-driver",
"objectSelector": {
"matchLabels": {
"alibabacloud.com/compute-class": "gpu-hpn"
}
},
"effect": {
"labels": {
"alibabacloud.com/gpu-driver-version": "1.5.0"
}
}
}
]Example
Create a file named
gpu-pod-with-model-and-driver.yamland add the following YAML content to the file. This file defines a pod that has thegpucompute class. The pod requests a GPU that has theexample-modelmodel and the535.161.08driver version.apiVersion: v1 kind: Pod metadata: name: gpu-pod-with-model-and-driver labels: # Set the compute class to gpu. alibabacloud.com/compute-class: "gpu" # Set the GPU model to example-model, such as T4. alibabacloud.com/gpu-model-series: "example-model" # Set the driver version to 535.161.08. alibabacloud.com/gpu-driver-version: "535.161.08" spec: containers: - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5 name: tensorflow-mnist command: - sleep - infinity resources: requests: cpu: 1 memory: 1Gi nvidia.com/gpu: 1 limits: cpu: 1 memory: 1Gi nvidia.com/gpu: 1Run the following command to deploy
gpu-pod-with-model-and-driver.yamlto your cluster.kubectl apply -f gpu-pod-with-model-and-driver.yamlRun the following command to check the pod status.
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE gpu-pod-with-model-and-driver 1/1 Running 0 87sRun the following command to check the GPU information in the pod.
NoteThe
/usr/bin/nvidia-smicommand in the example is preconfigured in the sample container image.kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smiExpected output:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI xxx.xxx.xx Driver Version: 535.161.08 CUDA Version: xx.x | |-----------------------------------------+----------------------+----------------------+ ... |=========================================+======================+======================| | x NVIDIA example-model xx | xxxxxxxx:xx:xx.x xxx | x | | xxx xxx xx xxx / xxxx | xxxx / xxx| x% xxxxxxxx| | | | xxx | +-----------------------------------------+----------------------+----------------------+The output shows the GPU model
example-modeland driver version535.161.08, which match the labels in the pod definition.ImportantThe output shown is for reference only. Actual results may vary depending on your environment.