For GPU-intensive workloads such as AI model training, inference, and scientific computing, resource demand often fluctuates significantly. Given the high cost of GPU hardware, manually managing capacity can be inefficient. By creating a GPU node pool with auto scaling enabled, you can dynamically adjust the number of nodes based on real-time resource demand. This on-demand, elastic scheduling improves GPU utilization and reduces O&M costs.
Preparations
You have an ACK managed Pro cluster.
Step 1: Create a GPU node pool with auto scaling enabled
To ensure proper scheduling and resource isolation for your GPU workloads, create a dedicated GPU node pool and enable auto scaling. This allows the system to dynamically adjust computing resources based on workload changes.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose .
Click Create Node Pool and configure it with the following key settings. For more information, see Create and manage node pools.
Scaling Mode: Select Auto and set the Min. Instances and Max. Instances for the node pool.
NoteIf the cluster does not have enough resources to schedule application pods, Container Service for Kubernetes (ACK) automatically scales out nodes within the configured minimum and maximum number of instances.
Instance Configuration Mode: Select Specify Instance Type.
Architecture: Select GPU-accelerated.
Instance Type: Select a suitable GPU-accelerated instance type for your workload, such as
ecs.gn7i-c8g1.2xlarge(NVIDIA A10). Selecting multiple types improves the likelihood of a successful scale-up.
Taints: To prevent non-GPU workloads from being scheduled to GPU-accelerated nodes, add a taint to the node pool. For example:
Key: scaler
Value: gpu
Effect: NoSchedule

Node Labels: Add a unique label to the node pool, such as
gpu-spec: NVIDIA-A10to target it for your GPU applications.
Step 2: Configure the application for GPU scheduling
To schedule your application to the GPU node pool, you must modify its Deployment manifest to request GPU resources and specify the correct node affinity and tolerations.
Configure GPU resource requests.
In the container's
resourcessection, request the number of GPUs required.# ... spec: containers: - name: gpu-auto-scaler # ... resources: limits: nvidia.com/gpu: 1 # Request 1 GPU # ...Add node affinity.
Use
nodeAffinityto ensure the pod is scheduled only onto nodes with the label you applied to the GPU node pool.# ... spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu-spec # Match the label key set for the node pool. operator: In values: - NVIDIA-A10 # Match the label value set for the node pool. # ...Add tolerations.
Configure tolerations to match the node pool configuration. This ensures that the pod can be scheduled to GPU nodes with the corresponding taint.
# ... spec: tolerations: - key: "scaler" # Match the taint key set for the node pool. operator: "Equal" value: "gpu" # Match the taint value set for the node pool. effect: "NoSchedule" # Match the taint effect set for the node pool. # ...
Step 3: Deploy the application and validate node scaling
This example uses a Deployment to show how to verify the scaling behaviors.
Create a file named
gpu-deployment.yaml.This Deployment requests two pod replicas, each requiring one GPU on a node labeled
gpu-spec: NVIDIA-A10.Deploy the application and trigger the initial scale-out.
Deploy the application.
kubectl apply -f gpu-deployment.yamlSince there are no matching GPU nodes in the cluster, the pods will remain in the
Pendingstate and trigger the cluster autoscaler to provision new nodes from the GPU node pool. This process may take several minutes.Monitor the events of a pending pod to see the
TriggeredScaleUpevent.kubectl describe pod <your-pod-name>Expected output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6m8s default-scheduler Successfully assigned default/gpu-auto-scaler-565994fcf9-6nmz2 to cn-shanghai.10.XX.XX.244 Normal TriggeredScaleUp 8m32s cluster-autoscaler pod triggered scale-up: [{asg-uf646aomci1pkqya54y7 0->2 (max: 10)}] Normal AllocIPSucceed 6m4s terway-daemon Alloc IP 10.XX.XX.245/16 took 4.505870999s Normal Pulling 6m4s kubelet Pulling image "registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04" Normal Pulled 6m2s kubelet Successfully pulled image "registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04" in 1.687s (1.687s including waiting). Image size: 29542023 bytes. Normal Created 6m2s kubelet Created container: gpu-auto-scaler Normal Started 6m2s kubelet Started container gpu-auto-scalerOnce the pods are running, list the nodes with the GPU label.
kubectl get nodes -l gpu-spec=NVIDIA-A10You should see two new GPU nodes:
NAME STATUS ROLES AGE VERSION cn-shanghai.10.XX.XX.243 Ready <none> 7m26s v1.34.1-aliyun.1 cn-shanghai.10.XX.XX.244 Ready <none> 7m25s v1.34.1-aliyun.1
Verify automatic node scale-out.
Scale the Deployment to three replicas.
kubectl scale deployment gpu-auto-scaler --replicas=3Run
kubectl get pod. Two pods are running, and one new pod is in thePendingstate due to insufficient resources. This triggers another node pool scale-out.Wait a few minutes, then run
kubectl get nodes -l gpu-spec=NVIDIA-A10again.You should see the number of nodes with the corresponding label in the cluster has increased to 3.
NAME STATUS ROLES AGE VERSION cn-shanghai.10.XX.XX.243 Ready <none> 11m v1.34.1-aliyun.1 cn-shanghai.10.XX.XX.244 Ready <none> 11m v1.34.1-aliyun.1 cn-shanghai.10.XX.XX.247 Ready <none> 45s v1.34.1-aliyun.1
Verify automatic node scale-in.
Scale the Deployment down to one replica.
kubectl scale deployment gpu-auto-scaler --replicas=1The two extra pods are terminated, leaving two GPU nodes idle. After the scale-in delay is reached, the node scaling component automatically removes the idle nodes from the cluster to save costs.
After the scale-in delay is reached, run
kubectl get nodes -l gpu-spec=NVIDIA-A10again.You should see the number of nodes with the corresponding label in the cluster has been reduced to 1.
NAME STATUS ROLES AGE VERSION cn-shanghai.10.XX.XX.243 Ready <none> 31m v1.34.1-aliyun.1
Apply in production environments
Cost optimization: GPU-accelerated instances are expensive, and scaled-out nodes are pay-as-you-go. Consider adding spot instances to your node pool configuration to significantly reduce costs. Always set a reasonable value for Max. Instances for your node pool to prevent unexpected cost overruns.
High availability: To avoid scale-out failures due to insufficient inventory in a single zone or for a single instance type, configure your node pool with vSwitches in multiple zones and select multiple GPU instance types.
Monitoring and alerting: Enable GPU monitoring for the cluster to gain insights into GPU utilization, health status, and workload performance. This will help you quickly diagnose issues and optimize resource allocation.
FAQ
Why did my GPU pod remain pending without triggering a scale-up?
Possible reasons include:
Incorrect affinity: Check whether the labels in the application's
nodeAffinityconfiguration exactly match the labels of the node pool.Mismatched resource requests: Ensure that the number of GPUs requested by the application (
nvidia.com/gpu) does not exceed the maximum number that a single node can provide.Autoscaler issues: Check the component logs for error messages.
To collect component logs:
Node pool limits: Check whether the quota for the node pool's maximum size has been reached.
How can I use different types of GPUs in the same cluster?
Create multiple node pools, each with a different GPU instance type and a unique node label, such as gpu-spec: NVIDIA-A10 and gpu-spec: NVIDIA-L20. When you deploy applications, use nodeAffinity in your pod spec to schedule different workloads to the appropriate GPU type.
How can I view the GPUs attached to a node?
Once the node pool is created, you can view the GPUs attached to the GPU-accelerated nodes.