Container Compute Service (ACS) lets you declare a GPU model and NVIDIA driver version directly in your pod spec. Use this when your workload requires a specific GPU hardware type or a driver version different from the cluster default—without managing the underlying infrastructure.
Prerequisites
Before you begin, ensure that you have:
-
An ACK cluster with ACS enabled
-
kubectl configured to connect to your cluster
-
The GPU model you need (to view supported GPU models, submit a ticket)
GPU models
ACS supports two ways to request GPU resources:
-
On demand: GPU resources are allocated when the pod is scheduled.
-
Capacity reservation: Resources are pre-reserved and automatically deducted from the reservation when the pod is created. For details, see Capacity reservations for GPU-accelerated pods.
Specify a GPU model
Set the alibabacloud.com/gpu-model-series label in the pod metadata to request a specific GPU model.
| Compute class | Label | Example value |
|---|---|---|
| GPU-accelerated | alibabacloud.com/gpu-model-series |
T4 |
The following example shows how to set the label in a pod spec:
metadata:
labels:
# Set the compute class to gpu.
alibabacloud.com/compute-class: "gpu"
# Set the GPU model. Replace example-model with the actual GPU model, such as T4.
alibabacloud.com/gpu-model-series: "example-model"
Driver versions
ACS lets you specify an NVIDIA driver version at the pod level.
How to choose a driver version
GPU-heavy applications typically rely on Compute Unified Device Architecture (CUDA), a parallel computing platform and programming model released by NVIDIA in 2007. The CUDA runtime version in your container is determined by the CUDA base image you build from. For example, if your image is built from nvidia/cuda:12.2.0-base-ubuntu20.04, the CUDA runtime version is 12.2.0.
The CUDA software stack includes two APIs:
-
Driver API (in the NVIDIA driver package): provides a variety of complex features.
-
Runtime API (in the CUDA Toolkit package): encapsulates partial drivers and provides implicit driver initialization. For driver and toolkit compatibility requirements, see CUDA Toolkit Release Notes.
Supported driver versions
Make sure the driver version you specify is supported by ACS. The following table lists supported driver versions by GPU model.
| GPU model | Supported driver versions |
|---|---|
| 8th-gen GPU A | 550.90.07 (default) |
| 8th-gen GPU B | 550.90.07 (default), 535.161.08 |
| T4 | 535.161.08 (default), 525.105.17 |
Pod label vs cluster default
ACS provides two ways to configure driver versions. Choose based on your scope of control:
| Method | Scope | When to use |
|---|---|---|
Pod label (alibabacloud.com/gpu-driver-version) |
Single pod or workload | A specific workload needs a driver version different from the cluster default |
| acs-profile cluster default | Entire cluster | Standardize the driver version across all pods in the cluster |
To change the cluster-level default, see Use acs-profile to automatically inject pod configurations.
Specify a driver version
Set the alibabacloud.com/gpu-driver-version label in the pod metadata.
| Compute class | Label | Example value |
|---|---|---|
| GPU-accelerated | alibabacloud.com/gpu-driver-version |
535.161.08 |
The following example shows how to set both GPU model and driver version labels together:
metadata:
labels:
# Set the compute class to gpu.
alibabacloud.com/compute-class: "gpu"
# Set the GPU model. Replace example-model with the actual GPU model, such as T4.
alibabacloud.com/gpu-model-series: "example-model"
# Set the driver version to 535.161.08.
alibabacloud.com/gpu-driver-version: "535.161.08"
Deploy a GPU pod with a specific model and driver version
The following steps create a Deployment that requests a GPU pod with a specified model and driver version, then verify the GPU configuration.
-
Create a file named
acs-pod-with-model-and-driver.yamlwith the following content:apiVersion: apps/v1 kind: Deployment metadata: name: acs-pod-with-model-and-driver namespace: default labels: app: acs-pod-with-model-and-driver spec: replicas: 1 selector: matchLabels: app: acs-pod-with-model-and-driver template: metadata: name: acs-pod-with-model-and-driver labels: app: acs-pod-with-model-and-driver # Specify ACS compute power. alibabacloud.com/acs: "true" # Set the compute class to gpu. alibabacloud.com/compute-class: "gpu" # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4. alibabacloud.com/gpu-model-series: "<example-model>" # Set the driver version to 535.161.08. alibabacloud.com/gpu-driver-version: "535.161.08" spec: containers: - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5 name: tensorflow-mnist command: - sleep - infinity resources: requests: cpu: 1 memory: 1Gi nvidia.com/gpu: 1 limits: cpu: 1 memory: 1Gi nvidia.com/gpu: 1 -
Deploy the Deployment:
kubectl apply -f acs-pod-with-model-and-driver.yaml -
Verify the pod is running:
kubectl get podThe expected output is similar to:
NAME READY STATUS RESTARTS AGE acs-pod-with-model-and-driver-7b89cbf4cf-2w66p 1/1 Running 0 6m26s -
Query the GPU information of the pod:
Note/usr/bin/nvidia-smicontains the command parameters encapsulated in the sample container image.ImportantThe actual output varies. The values shown here are representative.
kubectl exec -it acs-pod-with-model-and-driver-7b89cbf4cf-2w66p -- /usr/bin/nvidia-smiThe expected output is similar to:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI xxx.xxx.xx Driver Version: 535.161.08 CUDA Version: xx.x | |-----------------------------------------+----------------------+----------------------+ ... |=========================================+======================+======================| | x NVIDIA example-model xx | xxxxxxxx:xx:xx.x xxx | x | | xxx xxx xx xxx / xxxx | xxxx / xxx| x% xxxxxxxx| | | | xxx | +-----------------------------------------+----------------------+----------------------+The output confirms the GPU model is
example-modeland the driver version is535.161.08.