Configure node auto scaling for GPU applications - Container Service for Kubernetes

For GPU-intensive workloads such as AI model training, inference, and scientific computing, resource demand often fluctuates significantly. Given the high cost of GPU hardware, manually managing capacity can be inefficient. By creating a GPU node pool with auto scaling enabled, you can dynamically adjust the number of nodes based on real-time resource demand. This on-demand, elastic scheduling improves GPU utilization and reduces O&M costs.

Preparations

You have an ACK managed Pro cluster.
You have enabled node auto scaling for the cluster.

Step 1: Create a GPU node pool with auto scaling enabled

To ensure proper scheduling and resource isolation for your GPU workloads, create a dedicated GPU node pool and enable auto scaling. This allows the system to dynamically adjust computing resources based on workload changes.

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.
Click Create Node Pool and configure it with the following key settings. For more information, see Create and manage node pools.
- Scaling Mode: Select Auto and set the Min. Instances and Max. Instances for the node pool.
  Note
  If the cluster does not have enough resources to schedule application pods, Container Service for Kubernetes (ACK) automatically scales out nodes within the configured minimum and maximum number of instances.
- Instance Configuration Mode: Select Specify Instance Type.
  - Architecture: Select GPU-accelerated.
  - Instance Type: Select a suitable GPU-accelerated instance type for your workload, such as ecs.gn7i-c8g1.2xlarge (NVIDIA A10). Selecting multiple types improves the likelihood of a successful scale-up.
- Taints: To prevent non-GPU workloads from being scheduled to GPU-accelerated nodes, add a taint to the node pool. For example:
  - Key: scaler
  - Value: gpu
  - Effect: NoSchedule
- Node Labels: Add a unique label to the node pool, such as gpu-spec: NVIDIA-A10 to target it for your GPU applications.

Step 2: Configure the application for GPU scheduling

To schedule your application to the GPU node pool, you must modify its Deployment manifest to request GPU resources and specify the correct node affinity and tolerations.

Configure GPU resource requests.

In the container's resources section, request the number of GPUs required.

# ...
spec:
  containers:
  - name: gpu-auto-scaler
    # ...
    resources:
      limits:
        nvidia.com/gpu: 1 # Request 1 GPU
# ...

Add node affinity.

Use nodeAffinity to ensure the pod is scheduled only onto nodes with the label you applied to the GPU node pool.

# ...
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: gpu-spec       # Match the label key set for the node pool.
              operator: In
              values:
              - NVIDIA-A10    # Match the label value set for the node pool.
# ...

Add tolerations.

Configure tolerations to match the node pool configuration. This ensures that the pod can be scheduled to GPU nodes with the corresponding taint.

# ...
spec:
   tolerations:
    - key: "scaler"          # Match the taint key set for the node pool.
      operator: "Equal"
      value: "gpu"           # Match the taint value set for the node pool.
      effect: "NoSchedule"   # Match the taint effect set for the node pool.
# ...

Step 3: Deploy the application and validate node scaling

This example uses a Deployment to show how to verify the scaling behaviors.

Create a file named gpu-deployment.yaml.

This Deployment requests two pod replicas, each requiring one GPU on a node labeled gpu-spec: NVIDIA-A10.

YAML example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-auto-scaler
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: gpu-auto-scaler
  template:
    metadata:
      labels:
        app: gpu-auto-scaler
    spec:
      containers:
        - name: gpu-auto-scaler
          image: registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04
          command: ["bash", "-c"]
          args: ["while [ 1 ]; do date; nvidia-smi -L; sleep 60; done"]
          resources:
            limits:
              # Request 1 GPU.
              nvidia.com/gpu: 1 
      # Node affinity configuration to ensure the pod is scheduled to GPU nodes with a specific label.
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: gpu-spec
                  operator: In
                  values:
                  - NVIDIA-A10
      # Toleration declaration to ensure the pod can be scheduled to GPU nodes with a specific taint.
      tolerations:
      - key: "scaler"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

Deploy the application and trigger the initial scale-out.

Deploy the application.
```
kubectl apply -f gpu-deployment.yaml
```
Since there are no matching GPU nodes in the cluster, the pods will remain in the Pending state and trigger the cluster autoscaler to provision new nodes from the GPU node pool. This process may take several minutes.

Monitor the events of a pending pod to see the TriggeredScaleUp event.

kubectl describe pod <your-pod-name>

Expected output:

Events:
  Type     Reason            Age                    From                Message
  ----     ------            ----                   ----                -------
  Normal   Scheduled         6m8s                   default-scheduler   Successfully assigned default/gpu-auto-scaler-565994fcf9-6nmz2 to cn-shanghai.10.XX.XX.244
  Normal   TriggeredScaleUp  8m32s                  cluster-autoscaler  pod triggered scale-up: [{asg-uf646aomci1pkqya54y7 0->2 (max: 10)}]
  Normal   AllocIPSucceed    6m4s                   terway-daemon       Alloc IP 10.XX.XX.245/16 took 4.505870999s
  Normal   Pulling           6m4s                   kubelet             Pulling image "registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04"
  Normal   Pulled            6m2s                   kubelet             Successfully pulled image "registry-cn-hangzhou.ack.aliyuncs.com/dev/ubuntu:22.04" in 1.687s (1.687s including waiting). Image size: 29542023 bytes.
  Normal   Created           6m2s                   kubelet             Created container: gpu-auto-scaler
  Normal   Started           6m2s                   kubelet             Started container gpu-auto-scaler

Once the pods are running, list the nodes with the GPU label.

kubectl get nodes -l gpu-spec=NVIDIA-A10

You should see two new GPU nodes:

NAME                       STATUS   ROLES    AGE     VERSION
cn-shanghai.10.XX.XX.243   Ready    <none>   7m26s   v1.34.1-aliyun.1
cn-shanghai.10.XX.XX.244   Ready    <none>   7m25s   v1.34.1-aliyun.1

Verify automatic node scale-out.
1. Scale the Deployment to three replicas.
```
kubectl scale deployment gpu-auto-scaler --replicas=3
```
  Run kubectl get pod. Two pods are running, and one new pod is in the Pending state due to insufficient resources. This triggers another node pool scale-out.
2. Wait a few minutes, then run kubectl get nodes -l gpu-spec=NVIDIA-A10 again.
  You should see the number of nodes with the corresponding label in the cluster has increased to 3.
```
NAME                       STATUS   ROLES    AGE   VERSION
cn-shanghai.10.XX.XX.243   Ready    <none>   11m   v1.34.1-aliyun.1
cn-shanghai.10.XX.XX.244   Ready    <none>   11m   v1.34.1-aliyun.1
cn-shanghai.10.XX.XX.247   Ready    <none>   45s   v1.34.1-aliyun.1
```
Verify automatic node scale-in.
1. Scale the Deployment down to one replica.
```
kubectl scale deployment gpu-auto-scaler --replicas=1
```
  The two extra pods are terminated, leaving two GPU nodes idle. After the scale-in delay is reached, the node scaling component automatically removes the idle nodes from the cluster to save costs.
2. After the scale-in delay is reached, run kubectl get nodes -l gpu-spec=NVIDIA-A10 again.
  You should see the number of nodes with the corresponding label in the cluster has been reduced to 1.
```
NAME                       STATUS   ROLES    AGE   VERSION
cn-shanghai.10.XX.XX.243   Ready    <none>   31m   v1.34.1-aliyun.1
```

Apply in production environments

Cost optimization: GPU-accelerated instances are expensive, and scaled-out nodes are pay-as-you-go. Consider adding spot instances to your node pool configuration to significantly reduce costs. Always set a reasonable value for Max. Instances for your node pool to prevent unexpected cost overruns.
High availability: To avoid scale-out failures due to insufficient inventory in a single zone or for a single instance type, configure your node pool with vSwitches in multiple zones and select multiple GPU instance types.
Monitoring and alerting: Enable GPU monitoring for the cluster to gain insights into GPU utilization, health status, and workload performance. This will help you quickly diagnose issues and optimize resource allocation.

FAQ

Why did my GPU pod remain pending without triggering a scale-up?

Possible reasons include:

Incorrect affinity: Check whether the labels in the application's nodeAffinity configuration exactly match the labels of the node pool.
Mismatched resource requests: Ensure that the number of GPUs requested by the application (nvidia.com/gpu) does not exceed the maximum number that a single node can provide.
Autoscaler issues: Check the component logs for error messages.
To collect component logs:
- Node auto scaling: See Collect logs of data plane components in a cluster.
- Node instant scaling: See Enable log collection for node instant scaling.
Node pool limits: Check whether the quota for the node pool's maximum size has been reached.

How can I use different types of GPUs in the same cluster?

Create multiple node pools, each with a different GPU instance type and a unique node label, such as gpu-spec: NVIDIA-A10 and gpu-spec: NVIDIA-L20. When you deploy applications, use nodeAffinity in your pod spec to schedule different workloads to the appropriate GPU type.

How can I view the GPUs attached to a node?

Once the node pool is created, you can view the GPUs attached to the GPU-accelerated nodes.