Alibaba Cloud Container Service (ACS) supports GPU sharing on GPU-HPN nodes. This feature lets you run multiple pods on a single GPU device. In an exclusive GPU scheduling scenario, a pod must request an entire GPU. If a pod does not require the resources of a full GPU, resources are wasted. GPU sharing lets you request fine-grained heterogeneous computing power for your pods. GPU sharing also supports flexible requests and limits constraints for pods. This capability meets the resource isolation and sharing requirements of various application scenarios.
Introduction
This topic applies only to ACS clusters.
GPU sharing provides a more fine-grained resource description. It allows a single pod to request resources in increments smaller than one full GPU, such as 0.5 of a GPU's computing power. It does not support aggregated requests across multiple GPUs, such as requesting 0.5 of the computing power from two different GPUs at the same time.
The GPU sharing module maintains the driver version for pods that use GPU sharing. You cannot specify a driver version for an individual pod.
This feature is in public preview in the Ulanqab and Shanghai Finance Cloud regions. To use this feature in other regions, please submit a ticket.
When you use GPU sharing, pods do not directly access a specific GPU device. Instead, they interact with the device through the GPU sharing module. The GPU sharing module consists of a proxy module and a resource management module. The proxy module is integrated into the pod by default. It intercepts API calls related to the GPU device and forwards them to the backend resource module. The resource module runs the GPU instructions on the actual GPU device and limits GPU resource usage based on the pod's resource description.
The resource module for GPU sharing also consumes some CPU and memory resources, which are automatically reserved when the feature is enabled. For more information, see Node configuration.
Resource configuration and QoS
Shared GPU resources are described using Kubernetes requests/limits constraints. You can configure computing power and GPU memory as percentages. The feature also supports resource descriptions where limits are greater than requests. This may cause multiple pods to compete for GPU resources simultaneously. ACS defines a Quality of Service (QoS) for shared GPU resources. When multiple pods on a node use GPU resources simultaneously, the pods are queued and preemption may be triggered. The following is an example:
...
resources:
requests: # Controls the number of pods that can be scheduled on the node.
alibabacloud.com/gpu-core.percentage: 10 # The percentage of computing power that the pod requires.
alibabacloud.com/gpu-memory.percentage: 10 # The percentage of GPU memory that the pod requires.
limits: # Controls the upper limit of resources that can be used at runtime. For more information about the effects, see the configuration instructions.
alibabacloud.com/gpu-core.percentage: 100 # The upper limit of computing power usage.
alibabacloud.com/gpu-memory.percentage: 100 # The upper limit of GPU memory usage. Exceeding this limit causes a CUDA OOM error.
...Similar to the process management mechanism of an operating system, the GPU sharing module classifies pods into three states: hibernation, ready, and running. The state transition process is shown in the following figure.
When a pod starts, it enters the hibernation state.
When the pod attempts to use GPU resources, it enters the ready state. The GPU sharing module then allocates GPU resources to the pod based on a priority policy.
After the pod is allocated GPU resources, it enters the running state.
If pods are still in the ready state after all resources are allocated, a preemption mechanism is triggered to ensure resource fairness among pods.
When a pod is preempted, the process that occupies the GPU resources is killed, and the pod returns to the hibernation state.
Queuing policy
Pods in the ready state are queued based on the First In, First Out (FIFO) policy. The GPU sharing module allocates resources to the pod that entered the ready state first. If current resources are insufficient, the preemption policy is triggered.
Preemption policy
When resources cannot meet the demands of a pod in the ready state, the GPU sharing module attempts to preempt other pods. First, it filters the running pods based on specific conditions. Then, it scores and sorts the eligible pods and preempts them one by one until the resource demands of the queued pod are met.
If none of the currently running pods meet the filter conditions, the pod in the ready state remains in the queue and waits for resources. The details are as follows.
Policy type | Description |
Filter policy | The currently running pod has continuously occupied GPU resources for 2 hours. This is customizable. For more information, see QoS configuration. |
Scoring policy | The duration for which a pod has continuously occupied GPU resources. Pods that have occupied resources for a longer time are preempted first. |
Resource sharing models
GPU sharing is based on a shared model and allows multiple pods to run on a single GPU card simultaneously. ACS currently supports the following sharing models:
Model name | Effect | GPU shared resource configuration | Queuing policy | Preemption policy | Scenarios |
share-pool | Treats all GPUs on a node as a share pool. A pod can use any physical GPU that has idle resources. |
| FIFO | Allows custom configurations. | Notebook development scenarios. Combined with the request/limit configuration in resource QoS, this model supports off-peak GPU resource usage by multiple users. When resources are insufficient, QoS mechanisms such as queuing and preemption are triggered. For more information, see Example: Use the share-pool model for off-peak resource usage in Notebook scenarios. |
static | GPU slicing scenario. Assigns a fixed GPU device to a pod, which does not change during runtime. The scheduler prioritizes placing pods on the same GPU to avoid fragmentation. |
Warning If request is less than limit for GPU computing power or memory, resource competition occurs between pods. This can even cause pods to be killed due to an out-of-memory (OOM) error. | Not supported | Not supported | Small-scale AI applications where multiple pods share a GPU device to improve resource utilization. Due to the `request`==`limit` constraint, a pod can obtain GPU resources at any time during runtime without queuing. |
Example: Use the share-pool model for off-peak resource usage in Notebook scenarios
In Notebook development, applications typically do not occupy resources for long periods. You can use the share-pool model to allow pods to run on different GPU cards during off-peak hours. A pod enters the ready queue to wait for resources only when it requires resources.
The following is a use case for a Notebook scenario:
Pods A and B are configured with `requests=0.5` and `limits=0.5`. Pods C and D are configured with `requests=0.5` and `limits=1`. Based on the `requests` values, these pods can be scheduled to a single GPU-HPN node that has two GPUs.
Time T1:
Pod A and Pod C are occupying resources. Pod B and Pod D are in the ready queue, waiting to be scheduled.
The GPU sharing module attempts to allocate resources to Pod D, which is at the head of the queue. However, GPU 0 has only 0.5 GPU of idle resources. Although Pod D's `request` is 0.5, which the available capacity can satisfy, its `limit` is 1. Running Pod A and Pod D on the same GPU would cause resource competition. Therefore, the GPU sharing module keeps Pod D in the queue.
Time T2 - Phase 1:
Pod C's task is complete, and it enters the hibernation queue.
After GPU 1 becomes idle, its resources are allocated to Pod D.
Time T2 - Phase 2:
Pod B is allocated resources. Because Pod B's `limit` is 0.5, it can run on GPU 0 simultaneously with Pod A without resource competition.
Example: Use GPU sharing
This example demonstrates how to use the GPU sharing feature. The procedure covers enabling the GPU sharing feature (share-pool) on a GPU-HPN node, submitting a pod that uses shared GPU resources, and then disabling the feature on the node.
Step 1: Add a label to the GPU-HPN node
View the GPU-HPN nodes.
ImportantBefore enabling this feature, you must delete any pods on the node that request exclusive GPU resources. You do not need to delete pods that request only CPU and memory resources.
kubectl get node -l alibabacloud.com/node-type=reservedExpected output:
NAME STATUS ROLES AGE VERSION cn-wulanchabu-c.cr-xxx Ready agent 59d v1.28.3-aliyunAdd the label
alibabacloud.com/gpu-share-policy=share-poolto the nodecn-wulanchabu-c.cr-xxxto enable the GPU sharing feature.$ kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=share-pool
Step 2: Check the enabling status of the node
Wait for the feature to be enabled on the node, and then check the enabling status. You can see the GPU shared resources in the capacity field. In the conditions field, GPUSharePolicyValid is True. This indicates that the feature is enabled.
$ kubectl get node cn-wulanchabu-c.cr-xxx -o yamlAfter the GPU sharing policy takes effect, the node status is updated. Expected output:
# The actual output may vary.
apiVersion: v1
kind: Node
spec:
# ...
status:
allocatable:
# GPU shared resource description
alibabacloud.com/gpu-core.percentage: "1600"
alibabacloud.com/gpu-memory.percentage: "1600"
# After the feature is enabled, CPU, memory, and storage resources are reserved for the GPU sharing module.
cpu: "144"
memory: 1640Gi
nvidia.com/gpu: "16"
ephemeral-storage: 4608Gi
capacity:
# GPU shared resource description
alibabacloud.com/gpu-core.percentage: "1600"
alibabacloud.com/gpu-memory.percentage: "1600"
cpu: "176"
memory: 1800Gi
nvidia.com/gpu: "16"
ephemeral-storage: 6Ti
conditions:
# Indicates whether the GPU Share policy configuration is valid.
- lastHeartbeatTime: "2025-01-07T04:13:04Z"
lastTransitionTime: "2025-01-07T04:13:04Z"
message: gpu share policy is valid.
reason: Valied
status: "True"
type: GPUSharePolicyValid
# Indicates the GPU Share policy that is in effect on the current node.
- lastHeartbeatTime: "2025-01-07T04:13:04Z"
lastTransitionTime: "2025-01-07T04:13:04Z"
message: gpu share policy is share-pool.
reason: share-pool
status: "True"
type: GPUSharePolicyFor more information about the configuration items for GPU shared resources, see Node configuration.
Step 3: Deploy a pod with GPU shared resource specifications
Create a file named gpu-share-demo.yaml. Configure it to use the same
share-poolmodel as the node.apiVersion: v1 kind: Pod metadata: labels: alibabacloud.com/compute-class: "gpu-hpn" # Set the GPU sharing model for the pod to share-pool, which is the same as the node configuration. alibabacloud.com/gpu-share-policy: "share-pool" # static name: gpu-share-demo namespace: default spec: containers: - name: demo image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4 args: - '1000h' command: - sleep # Specify the GPU shared resources gpu-core.percentage and gpu-memory.percentage in the resource description. # For more information about the effects of request and limit, see the configuration instructions. resources: limits: cpu: '5' memory: 50Gi alibabacloud.com/gpu-core.percentage: 100 alibabacloud.com/gpu-memory.percentage: 100 requests: cpu: '5' memory: 50Gi alibabacloud.com/gpu-core.percentage: 10 alibabacloud.com/gpu-memory.percentage: 10Deploy the sample pod.
kubectl apply -f gpu-share-demo.yaml
Step 4: Check the GPU shared resource usage of the pod
Log on to the container to check the GPU shared resource usage of the pod.
kubectl exec -it pod gpu-share-demo -- /bin/bashUse commands such as `nvidia-smi` to view the GPU resource allocation and usage of the container. The actual output may vary.
For pods of the share-pool type, the BusID field displays `Pending` when the pod is not using GPU resources.
The specific command depends on the GPU card type. For example, nvidia-smi corresponds to NVIDIA series GPU devices. For other card types, submit a ticket for assistance.
(Optional) Step 5: Disable the GPU sharing policy on the node
Before disabling the policy, you must delete any pods on the node that request GPU shared resources. You do not need to delete pods that request only CPU and memory resources.
Delete the pod that uses the GPU sharing feature.
$ kubectl delete pod gpu-share-demoDisable the GPU sharing feature on the node.
$ kubectl label node cn-wulanchabu-c.cr-xxx alibabacloud.com/gpu-share-policy=noneCheck the policy configuration status of the node again.
$ kubectl get node cn-wulanchabu-c.cr-xxx -o yamlExpected output:
apiVersion: v1 kind: Node spec: # ... status: allocatable: # After the feature is disabled, the reserved CPU and memory resources are restored to their initial values. cpu: "176" memory: 1800Gi nvidia.com/gpu: "16" ephemeral-storage: 4608Gi capacity: cpu: "176" memory: 1800Gi nvidia.com/gpu: "16" ephemeral-storage: 6Ti conditions: # Indicates whether the GPU Share policy configuration is valid. - lastHeartbeatTime: "2025-01-07T04:13:04Z" lastTransitionTime: "2025-01-07T04:13:04Z" message: gpu share policy config is valid. reason: Valid status: "True" type: GPUSharePolicyValid # Indicates the GPU Share policy that is in effect on the current node. - lastHeartbeatTime: "2025-01-07T04:13:04Z" lastTransitionTime: "2025-01-07T04:13:04Z" message: gpu share policy is none. reason: none status: "False" type: GPUSharePolicy
Detailed configuration instructions
Node configuration
Enablement configuration
To enable GPU sharing, you can configure a label on the node. The details are as follows.
Configuration item | Description | Valid values | Example |
alibabacloud.com/gpu-share-policy | The GPU resource sharing policy. |
| |
If pods that use exclusive GPUs already exist on the node, you must delete them before you enable the sharing policy.
If pods that use GPU shared resources already exist on the node, you cannot modify or disable the GPU sharing policy. You must delete these pods first.
You do not need to delete pods that request only CPU and memory resources.
QoS configuration
On GPU-HPN nodes, you can configure the Quality of Service (QoS) parameters for GPU sharing in the node annotations. Use the following format.
apiVersion: v1
kind: Node
...
metadata:
annotations:
alibabacloud.com/gpu-share-qos-config: '{"preemptEnabled": true, "podMaxDurationMinutes": 120, "reservedEphemeralStorage": "1.5Ti"}'
...The following describes the details:
Parameter | Type | Valid values | Description |
preemptEnabled | Boolean |
| Applies only to the share-pool model. Specifies whether to enable preemption. The default value is true, which enables preemption. |
podMaxDurationMinutes | Int | An integer greater than 0. Unit: minutes. | Applies only to the share-pool model. A pod can be preempted only if it has occupied a GPU for longer than this time. The default value is 120, which is 2 hours. |
reservedEphemeralStorage | resource.Quantity | Greater than or equal to 0. The unit is in Kubernetes string format, such as 500Gi. | The reserved capacity for the node's local temporary storage. The default value is 1.5 TiB. |
View shared resources on a node
After the feature is enabled, the corresponding GPU shared resource names are added to the `allocatable` and `capacity` fields of the node. The basic resource overhead is deducted from the `allocatable` field. The resource names are described as follows.
Configuration item | Description | Calculation method |
alibabacloud.com/gpu-core.percentage | The computing power of the GPU shared resource, in percentage format. This field is added when the feature is enabled and deleted when the feature is disabled. | Number of devices × 100. For example, for a machine with 16 GPUs, the value is 1600. |
alibabacloud.com/gpu-memory.percentage | The GPU memory of the GPU shared resource, in percentage format. This field is added when the feature is enabled and deleted when the feature is disabled. | Number of devices × 100. For example, for a machine with 16 GPUs, the value is 1600. |
cpu | After the feature is enabled, the basic overhead is deducted from | Number of devices × 2. For example, for a machine with 16 GPUs, 32 cores are reserved. |
memory | Number of devices × 10 GB. For example, for a machine with 16 GPUs, 160 GB is reserved. | |
ephemeral-storage | 1.5 TB of disk space per node. |
Correctness tips
Field | Value | Description |
type | GPUSharePolicyValid | Indicates whether the current GPU Share configuration is valid. |
status | "True", "False" |
|
reason | Valid, InvalidParameters, InvalidExistingPods, ResourceNotEnough |
|
message | - | A user-friendly message. |
| UTC | The time when the |
Current effective GPU sharing policy
Field | Value | Description |
type | GPUSharePolicy | Indicates whether the current GPU Share configuration is valid. |
status | "True", "False" |
|
reason | none, share-pool, static |
|
message | - | A user-friendly message. |
| UTC | The time when the |
If the node resources do not change as described above after you enable or disable the feature, the configuration modification has failed. You can check the validity condition message in the conditions field.
Pod configuration
After the feature is enabled, you can use it by configuring the GPU shared resource label in the pod.
apiVersion: v1
kind: Pod
metadata:
labels:
# Only the gpu-hpn compute class is supported.
alibabacloud.com/compute-class: "gpu-hpn"
# Set the GPU sharing model for the pod to share-pool, which is the same as the node configuration.
alibabacloud.com/gpu-share-policy: "share-pool"
name: gpu-share-demo
namespace: default
spec:
containers:
- name: demo
image: registry-cn-wulanchabu-vpc.ack.aliyuncs.com/acs/stress:v1.0.4
args:
- '1000h'
command:
- sleep
resources:
limits:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 100
alibabacloud.com/gpu-memory.percentage: 100
requests:
cpu: '5'
memory: 50Gi
alibabacloud.com/gpu-core.percentage: 10
alibabacloud.com/gpu-memory.percentage: 10The configuration items are described as follows:
Compute class
Configuration item | Value | Description |
metadata.labels.alibabacloud.com/compute-class | gpu-hpn | Only the gpu-hpn compute class is supported. |
GPU sharing policy
Configuration item | Type | Valid values | Description |
metadata.labels.alibabacloud.com/gpu-share-policy | String |
| Specifies the GPU sharing model for the pod. Only nodes that match this model are considered for scheduling. |
Resource requirements
Configure GPU shared resources in the container's resource requests to describe the computing power and GPU memory requirements and limits. These settings control the number of pods that can be scheduled on a node. The number of pods on a node is also limited by other resource dimensions, such as CPU, memory, and the maximum number of pods.
Requirement category | Configuration item | Type | Valid values | Description |
requests | alibabacloud.com/gpu-core.percentage | Int | share-pool policy: [10, 100] static policy: [10, 100) | The computing power percentage. This indicates the requested proportion of a single GPU's computing power. The minimum is 10%. |
alibabacloud.com/gpu-memory.percentage | The GPU memory percentage. This indicates the requested proportion of a single GPU's memory. The minimum is 10%. | |||
limits | alibabacloud.com/gpu-core.percentage | The computing power percentage. This indicates the requested limit on the proportion of a single GPU's computing power. The minimum is 10%. | ||
alibabacloud.com/gpu-memory.percentage | The GPU memory percentage. This indicates the requested limit on the proportion of a single GPU's memory. The minimum is 10%. |
Configuration constraints
In addition to the constraints on individual configuration items, the following constraints apply when a pod requests resources.
You must specify both GPU memory and computing power (
alibabacloud.com/gpu-core.percentageandalibabacloud.com/gpu-memory.percentage) in bothrequestsandlimits.A pod can have at most one container that uses GPU shared resources. This is typically the main container. Other containers, such as sidecar containers, can request only non-GPU resources such as CPU and memory.
A container cannot request both exclusive GPU resources (such as
nvidia.com/gpu) and GPU shared resources (alibabacloud.com/gpu-core.percentage,alibabacloud.com/gpu-memory.percentage).
FAQ
What happens to a pod in the ready queue if no GPU resources are available?
When a GPU sharing pod is waiting for resources, it periodically prints a message. The following is a sample message.
You have been waiting for ${1} seconds. Approximate position: ${2}The ${1} parameter indicates the waiting time, and the ${2} parameter indicates the current position in the ready queue.
What are the pod monitoring metrics specific to the GPU sharing mode?
For pods that use GPU shared resources, you can use the following metrics to view their resource usage.
Metric | Description | Example |
DCGM_FI_POOLING_STATUS | Provided only in share-pool mode. Indicates the pod's status in the GPU sharing mode, including hibernation, ready, and running. The details are as follows:
| |
DCGM_FI_POOLING_POSITION | Provided only in share-pool mode. Indicates that the pod is waiting for resources in the ready queue. The value indicates the pod's position in the ready queue, starting from 1. This metric appears only when POOLING_STATUS=1. | |
How are GPU utilization metrics different when a pod uses GPU sharing?
The GPU utilization metrics for a pod are the same as before. However, for pods that use GPU sharing, the labels and meanings of the metrics are different.
In the pod monitoring data provided by ACS, metrics such as GPU computing power utilization and GPU memory usage are absolute values based on the entire GPU card, which is the same as in the exclusive GPU scenario.
The GPU memory usage seen within a pod using commands such as `nvidia-smi` is an absolute value, which is the same as in the exclusive GPU scenario. However, the computing power utilization is a relative value, where the denominator is the pod's limit.
The device information, such as the ID number in the pod's GPU utilization metrics, corresponds to the actual ID on the node. The numbering does not always start from 0.
For the share-pool sharing model, the device number in the metrics may change because the pod elastically uses different GPU devices from the pool.
If GPU sharing is enabled on only some nodes in a cluster, how can I avoid scheduling conflicts with exclusive GPU pods?
The default scheduler in an ACS cluster automatically matches pod and node types to avoid scheduling conflicts.
If you use a custom scheduler, an exclusive GPU pod might be scheduled to a GPU sharing node because the node's capacity includes both GPU device resources and GPU shared resources. You can choose one of the following solutions:
Solution 1: Write a scheduler plugin that automatically detects the configuration labels and
conditionprotocol of ACS nodes to filter out nodes of a mismatched type. For more information, see Scheduling Framework.Solution 2: Use Kubernetes labels or taints. Add a label or taint to the nodes where GPU sharing is enabled. Then, configure different affinity policies for exclusive GPU pods and shared GPU pods.
What information is available when a GPU sharing pod is preempted?
For the share-pool sharing model, when preemption is triggered, the pod has an Event and a Condition. An Event is in an unstructured data format. To read structured data, you can retrieve it from the `reason` and `status` fields of the corresponding Condition. The details are as follows.
# Indicates that the GPU resources of the current pod were preempted. The name of the preempting pod is <new-pod-name>.
Warning GPUSharePreempted 5m15s gpushare GPU is preempted by <new-pod-name>.
# Indicates that the current pod preempted the GPU resources of another pod. The name of the preempted pod is <old-pod-name>.
Warning GPUSharePreempt 3m47s gpushare GPU is preempted from <old-pod-name>.- type: Interruption.GPUShareReclaim # The event type for a GPU sharing pod preemption.
status: "True" # True indicates that a preemption or preemption-by action occurred.
reason: GPUSharePreempt # GPUSharePreempt indicates that this pod preempted another pod. GPUSharePreempted indicates that this pod was preempted by another pod.
message: GPU is preempted from <old-pod-name>. # A user-friendly message similar to the event.
lastTransitionTime: "2025-04-22T08:12:09Z" # The time when the preemption occurred.
lastProbeTime: "2025-04-22T08:12:09Z"How can I run more pods on a node in a Notebook scenario?
For pods with GPU sharing enabled, ACS also lets you configure CPU and memory specifications where the `request` is less than the `limit`. This helps to fully utilize node resources. Note that when the total `limit` of resources for pods submitted to a node exceeds the node's allocatable resources, the pods will compete for CPU and memory resources. You can analyze the resource competition for CPU and memory by reviewing the node's resource utilization data. For more information, see ACS GPU-HPN node-level monitoring metrics. For a pod, CPU resource competition is reflected in the pod's CPU Steal Time. Memory resource competition triggers a machine-wide out-of-memory (OOM) error, which causes some pods to be killed. Plan your pod priorities and resource specifications based on your application's characteristics to avoid affecting pod service quality due to resource competition.