Configure GPU Model and Driver Version for Serverless GPU Pods - ACK

Prerequisites

Before you begin, ensure that you have:

An ACK cluster with ACS enabled
kubectl configured to connect to your cluster
The GPU model you need (to view supported GPU models, submit a ticket)

GPU models

ACS supports two ways to request GPU resources:

On demand: GPU resources are allocated when the pod is scheduled.
Capacity reservation: Resources are pre-reserved and automatically deducted from the reservation when the pod is created. For details, see Capacity reservations for GPU-accelerated pods.

Specify a GPU model

Set the alibabacloud.com/gpu-model-series label in the pod metadata to request a specific GPU model.

Compute class	Label	Example value
GPU-accelerated	`alibabacloud.com/gpu-model-series`	`T4`

The following example shows how to set the label in a pod spec:

metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model. Replace example-model with the actual GPU model, such as T4.
    alibabacloud.com/gpu-model-series: "example-model"

Driver versions

ACS lets you specify an NVIDIA driver version at the pod level.

How to choose a driver version

GPU-heavy applications typically rely on Compute Unified Device Architecture (CUDA), a parallel computing platform and programming model released by NVIDIA in 2007. The CUDA runtime version in your container is determined by the CUDA base image you build from. For example, if your image is built from nvidia/cuda:12.2.0-base-ubuntu20.04, the CUDA runtime version is 12.2.0.

Note

The CUDA software stack includes two APIs:

Driver API (in the NVIDIA driver package): provides a variety of complex features.
Runtime API (in the CUDA Toolkit package): encapsulates partial drivers and provides implicit driver initialization. For driver and toolkit compatibility requirements, see CUDA Toolkit Release Notes.

Supported driver versions

Make sure the driver version you specify is supported by ACS. The following table lists supported driver versions by GPU model.

GPU model	Supported driver versions
8th-gen GPU A	550.90.07 (default)
8th-gen GPU B	550.90.07 (default), 535.161.08
T4	535.161.08 (default), 525.105.17

Pod label vs cluster default

ACS provides two ways to configure driver versions. Choose based on your scope of control:

Method	Scope	When to use
Pod label (`alibabacloud.com/gpu-driver-version`)	Single pod or workload	A specific workload needs a driver version different from the cluster default
acs-profile cluster default	Entire cluster	Standardize the driver version across all pods in the cluster

To change the cluster-level default, see Use acs-profile to automatically inject pod configurations.

Specify a driver version

Set the alibabacloud.com/gpu-driver-version label in the pod metadata.

Compute class	Label	Example value
GPU-accelerated	`alibabacloud.com/gpu-driver-version`	`535.161.08`

The following example shows how to set both GPU model and driver version labels together:

metadata:
  labels:
    # Set the compute class to gpu.
    alibabacloud.com/compute-class: "gpu"
    # Set the GPU model. Replace example-model with the actual GPU model, such as T4.
    alibabacloud.com/gpu-model-series: "example-model"
    # Set the driver version to 535.161.08.
    alibabacloud.com/gpu-driver-version: "535.161.08"

Deploy a GPU pod with a specific model and driver version

The following steps create a Deployment that requests a GPU pod with a specified model and driver version, then verify the GPU configuration.

Create a file named acs-pod-with-model-and-driver.yaml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: acs-pod-with-model-and-driver
  namespace: default
  labels:
    app: acs-pod-with-model-and-driver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: acs-pod-with-model-and-driver
  template:
    metadata:
      name: acs-pod-with-model-and-driver
      labels:
        app: acs-pod-with-model-and-driver
        # Specify ACS compute power.
        alibabacloud.com/acs: "true"
        # Set the compute class to gpu.
        alibabacloud.com/compute-class: "gpu"
        # example-model indicates the GPU model. Replace it with the actual GPU model, such as T4.
        alibabacloud.com/gpu-model-series: "<example-model>"
        # Set the driver version to 535.161.08.
        alibabacloud.com/gpu-driver-version: "535.161.08"
    spec:
      containers:
      - image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        name: tensorflow-mnist
        command:
        - sleep
        - infinity
        resources:
          requests:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: 1
          limits:
            cpu: 1
            memory: 1Gi
            nvidia.com/gpu: 1

Deploy the Deployment:

kubectl apply -f acs-pod-with-model-and-driver.yaml

Verify the pod is running:

kubectl get pod

The expected output is similar to:

NAME                                             READY   STATUS    RESTARTS   AGE
acs-pod-with-model-and-driver-7b89cbf4cf-2w66p   1/1     Running   0          6m26s

Query the GPU information of the pod:

Note

/usr/bin/nvidia-smi contains the command parameters encapsulated in the sample container image.

Important

The actual output varies. The values shown here are representative.

kubectl exec -it acs-pod-with-model-and-driver-7b89cbf4cf-2w66p -- /usr/bin/nvidia-smi