Container Compute Service (ACS) provides serverless container compute power. ACS lets you declare GPU models and driver versions when you configure GPU resources for pods. This significantly reduces expenses on infrastructure management and maintenance. This topic describes how to specify a GPU model and driver version when you create a pod.
GPU models
ACS supports a variety of GPU models. You can also configure capacity reservations to apply for GPU resources on demand. You can specify GPU models for pods of different compute classes as follows.
GPU-HPN
You can only apply for node reservations. After you apply for a node reservation, you must associate it with the cluster. Each reservation hosts pods as an independent virtual node in the cluster. For more information, see GPU-HPN capacity reservations.
GPU-accelerated
You can apply for GPU resources on demand or configure capacity reservations. After a pod is created, GPU resources are automatically deducted from the capacity reservation.
To view the supported GPU models, submit a ticket.
Specify a GPU model for a pod
For GPU-HPN pods, you can only apply for node reservations. Each reservation hosts pods as a virtual node in the cluster. The label of the virtual node contains the GPU model. You can configure node affinity scheduling to schedule pods to the virtual node. For more information, see Schedule pods to GPU-HPN virtual nodes with attribute labels.
You need to explicitly specify the GPU model in the labels and nodeSelector parameters of a pod.
Compute class | Parameter | Example |
GPU-HPN | spec.nodeSelector | |
GPU-accelerated | metadata.labels[ alibabacloud.com/gpu-model-series] | |
Driver versions
GPU-heavy applications usually rely on Compute Unified Device Architecture (CUDA), which is a parallel computing platform and programming model released by NVIDIA in 2007. The following figure displays the architecture of CUDA. The driver API and runtime API in the CUDA software stack have the following differences.
Driver API: provides a variety of complex features.
Runtime API: encapsulates partial drivers and provides implicit driver initialization.
The CUDA driver API is included in the NVIDIA driver package and the CUDA library and CUDA runtime are included in the CUDA Toolkit package.

When you run GPU-heavy applications in an ACS cluster, take note of the following items:
The CUDA base images provided by NVIDIA are used to build container images. The CUDA Toolkit is already installed in these base images. You can build container images based on the base images. You can also select a CUDA base image based on the version of the CUDA Toolkit.
To specify a driver version when you create a pod, see Specify a driver version for a pod.
For more information about the compatibility between the CUDA Toolkit and NVIDIA drivers, see CUDA Toolkit Release Notes.
The version of the CUDA runtime API used by an application is the same as the version of the CUDA base image that is used to build the Docker image. For example, if your Docker image is built based on CUDA base image NVIDIA/CUDA:12.2.0-base-Ubuntu20.04, the version of the CUDA runtime API used by your application is 12.2.0.
Specify a driver version for a pod
ACS lets you add pod labels to specify a driver version.
Compute class | Parameter | Example |
GPU-accelerated | metadata.labels[alibabacloud.com/gpu-driver-version] | |
GPU-HPN | |
GPU driver versions
Make sure that the specified driver version is supported by ACS. For more information about the driver versions for different GPU models, see GPU driver versions supported by ACS.
Default GPU driver versions for pods
ACS supports configuring specific properties for pods based on certain rules. If the default driver version does not meet your requirements, you can add the following configuration to kube-system/acs-profile to configure different GPU driver versions for specific types of GPU pods. For more information, see Configure Selectors.
The following configuration sets the driver version to 1.5.0 for all pods with the gpu-hpn compute class in the cluster.
apiVersion: v1
kind: ConfigMap
metadata:
name: acs-profile
namespace: kube-system
data:
# Other system configurations remain unchanged
selectors: |
[
{
"name": "gpu-hpn-driver",
"objectSelector": {
"matchLabels": {
"alibabacloud.com/compute-class": "gpu-hpn"
}
},
"effect": {
"annotations": {
"alibabacloud.com/gpu-driver-version": "1.5.0"
}
}
}
]Example
Create a file named gpu-pod-with-model-and-driver.yaml based on the following YAML content. The file creates a pod whose compute class is GPU. The GPU model that the pod applies for is example-model and the driver version is 535.161.08.
apiVersion: v1 kind: Pod metadata: name: gpu-pod-with-model-and-driver labels: # Set the compute class to gpu. alibabacloud.com/compute-class: "gpu" # Set the GPU model to example-model. The value is for reference only. alibabacloud.com/gpu-model-series: "example-model" # Set the driver version to 535.161.08. alibabacloud.com/gpu-driver-version: "535.161.08" spec: containers: - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5 name: tensorflow-mnist command: - sleep - infinity resources: requests: cpu: 1 memory: 1Gi nvidia.com/gpu: 1 limits: cpu: 1 memory: 1Gi nvidia.com/gpu: 1Run the following command to deploy the gpu-pod-with-model-and-driver.yaml file in the cluster.
kubectl apply -f gpu-pod-with-model-and-driver.yamlRun the following command to query the status of the pod:
kubectl get podExpected output:
NAME READY STATUS RESTARTS AGE gpu-pod-with-model-and-driver 1/1 Running 0 87sRun the following command to query the GPU information of the pod:
Note/usr/bin/nvidia-smicontains the command parameters encapsulated in the sample container image.kubectl exec -it gpu-pod-with-model-and-driver -- /usr/bin/nvidia-smiExpected output:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI xxx.xxx.xx Driver Version: 535.161.08 CUDA Version: xx.x | |-----------------------------------------+----------------------+----------------------+ ... |=========================================+======================+======================| | x NVIDIA example-model xx | xxxxxxxx:xx:xx.x xxx | x | | xxx xxx xx xxx / xxxx | xxxx / xxx| x% xxxxxxxx| | | | xxx | +-----------------------------------------+----------------------+----------------------+The output indicates that the GPU model is example-model and the driver version is 535.161.08, which meets the expectation.
ImportantThe actual output shall prevail.