Deploy and Scale GPU Workloads in ACK Auto Mode - Container Service for Kubernetes

GPU nodes are expensive to keep running around the clock. With Auto Mode enabled on an ACK cluster, managed node pools provision GPU nodes only when workloads need them and release those nodes when they go idle — so you pay only for the GPU time your applications actually use. This is especially cost-effective for workloads with fluctuating demand, such as online inference.

Prerequisites

Before you begin, ensure that you have:

An ACK managed cluster with Auto Mode enabled
ContainerOS 3.6 or later on your nodes (upgrade ContainerOS if needed)

Step 1: Create a managed node pool with GPU instances

Create a dedicated node pool for GPU workloads so GPU nodes stay isolated from general-purpose workloads. When a GPU workload is submitted, Auto Mode automatically provisions the required GPU nodes. When those nodes go idle and meet the scale-in criteria, Auto Mode releases them automatically.

On the ACK Clusters page, click the name of your cluster. In the left navigation pane, choose Nodes > Node Pools.

Click Create Node Pool and configure the following parameters. For all available options, see Create a node pool.

Parameter	Description
Configure Managed Node Pool	Select Auto Mode.
vSwitch	Select two or more vSwitches in different zones for high availability. During scaling, nodes are provisioned in the zones of the selected vSwitches based on the Scaling Policy.
Instance-related configurations	Set Instance Configuration Mode to Specify Instance Type. Set Architecture to GPU-accelerated. For Instance Type, select one or more GPU instance families, such as `ecs.gn7i-c8g1.2xlarge` (NVIDIA A10). Selecting multiple instance types improves the success rate of scale-out.
Taints	Add a taint to prevent non-GPU workloads from being scheduled onto these GPU nodes: Key `nvidia.com/gpu`, Value `true`, Effect `NoSchedule`.

Step 2: Configure GPU resource requests and taint tolerations

For the Pod to land on the GPU node pool and trigger auto-provisioning, declare its GPU requirements and taint toleration in the YAML manifest.

GPU resource request — in the container's resources field:

# ...
spec:
  containers:
  - name: gpu-automode
    resources:
      limits:
        nvidia.com/gpu: 1   # Request 1 GPU
# ...

Taint toleration — allows the Pod to be scheduled onto nodes with the nvidia.com/gpu taint:

# ...
spec:
  tolerations:
  - key: "nvidia.com/gpu"        # Matches the taint key set on the node pool
    operator: "Equal"
    value: "true"                # Matches the taint value
    effect: "NoSchedule"         # Matches the taint effect
# ...

Step 3: Deploy the GPU workload and verify autoscaling

The following example uses a Stable Diffusion Web UI application to demonstrate the full deployment and autoscaling flow.

Deploy the workload

Create a file named stable-diffusion.yaml with the following content. The manifest contains two parts:

Deployment: Runs the Stable Diffusion workload. The Pod requests one NVIDIA GPU and declares the corresponding taint toleration.
Service: Exposes the workload through a public IP address via a LoadBalancer Service on port 7860.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: stable-diffusion
  name: stable-diffusion
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stable-diffusion
  template:
    metadata:
      labels:
        app: stable-diffusion
    spec:
      containers:
      - args:
        - --listen
        command:
        - python3
        - launch.py
        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu
        imagePullPolicy: IfNotPresent
        name: stable-diffusion
        ports:
        - containerPort: 7860
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 7860
        resources:
          limits:
            nvidia.com/gpu: 1   # Request 1 GPU
          requests:
            cpu: "6"
            memory: 12Gi
      # Allow the Pod to be scheduled onto the GPU node pool
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    # Expose the Service over the Internet
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
  name: stable-diffusion-svc
  namespace: default
spec:
  externalTrafficPolicy: Local
  ports:
  - port: 7860
    protocol: TCP
    targetPort: 7860
  selector:
    app: stable-diffusion   # Routes traffic to Pods with this label
  type: LoadBalancer

Apply the manifest.
```
kubectl apply -f stable-diffusion.yaml
```

Verify node auto-provisioning

Right after deployment, the Pod enters a Pending state because no GPU nodes exist yet. Auto Mode detects the unschedulable Pod and provisions one.

Check the Pod status.
```
kubectl get pod -l app=stable-diffusion
```

Inspect the Pod events to confirm that provisioning was triggered.

kubectl describe pod -l app=stable-diffusion

In the Events section, a FailedScheduling warning appears first, followed by a ProvisionNode event, indicating that Auto Mode has started provisioning a GPU node.

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  15m   default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., ,
  Normal   ProvisionNode     16m   GOATScaler         Provision node asa-2ze2h0f4m5ctpd8kn4f1 in Zone: cn-beijing-k with InstanceType: ecs.gn7i-c8g1.2xlarge, Triggered time 2025-11-19 02:58:01.096
  Normal   AllocIPSucceed    12m   terway-daemon      Alloc IP 10.XX.XX.141/16 took 4.764400743s
  Normal   Pulling           12m   kubelet            Pulling image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu"
  Normal   Pulled            3m48s kubelet            Successfully pulled image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu" in 8m47.675s (8m47.675s including waiting). Image size: 11421866941 bytes.
  Normal   Created           3m42s kubelet            Created container: stable-diffusion
  Normal   Started           3m24s kubelet            Started container stable-diffusion

Confirm the provisioned node is Ready.

# Store the node name where the Pod is running
NODE_NAME=$(kubectl get pod -l app=stable-diffusion -o jsonpath='{.items[0].spec.nodeName}')

# Print the node name
echo "Stable Diffusion is running on node: $NODE_NAME"

# Verify the node is in Ready state
kubectl get node $NODE_NAME

Access the Stable Diffusion UI

Wait a few minutes for the node to join the cluster and the Pod to start, then access the application.

Get the public IP address of the Service.

kubectl get svc stable-diffusion-svc

Find the EXTERNAL-IP in the output.

NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
stable-diffusion-svc   LoadBalancer   192.XXX.XX.196   8.XXX.XX.68   7860:31302/TCP   18m

Open a browser and navigate to http://<EXTERNAL-IP>:7860. If the Stable Diffusion Web UI loads, the workload is running on the GPU node.

Verify node scale-in

To test automatic scale-in, delete the Deployment so the GPU node goes idle.

Delete the Deployment and the Service.

kubectl delete deployment stable-diffusion
kubectl delete service stable-diffusion-svc

After the Defer Scale-in For period elapses (3 minutes by default in Auto Mode), the autoscaler removes the idle node. Query the node by its stored name to confirm.
```
kubectl get node $NODE_NAME
```
The expected output confirms the node has been released.
```
Error from server (NotFound): nodes "<nodeName>" not found
```