All Products
Search
Document Center

Container Service for Kubernetes:Deploy and run GPU workloads

Last Updated:Mar 26, 2026

GPU nodes are expensive to keep running around the clock. With Auto Mode enabled on an ACK cluster, managed node pools provision GPU nodes only when workloads need them and release those nodes when they go idle — so you pay only for the GPU time your applications actually use. This is especially cost-effective for workloads with fluctuating demand, such as online inference.

Prerequisites

Before you begin, ensure that you have:

Step 1: Create a managed node pool with GPU instances

Create a dedicated node pool for GPU workloads so GPU nodes stay isolated from general-purpose workloads. When a GPU workload is submitted, Auto Mode automatically provisions the required GPU nodes. When those nodes go idle and meet the scale-in criteria, Auto Mode releases them automatically.

  1. On the ACK Clusters page, click the name of your cluster. In the left navigation pane, choose Nodes > Node Pools.

  2. Click Create Node Pool and configure the following parameters. For all available options, see Create a node pool.

    ParameterDescription
    Configure Managed Node PoolSelect Auto Mode.
    vSwitchSelect two or more vSwitches in different zones for high availability. During scaling, nodes are provisioned in the zones of the selected vSwitches based on the Scaling Policy.
    Instance-related configurationsSet Instance Configuration Mode to Specify Instance Type. Set Architecture to GPU-accelerated. For Instance Type, select one or more GPU instance families, such as ecs.gn7i-c8g1.2xlarge (NVIDIA A10). Selecting multiple instance types improves the success rate of scale-out.
    TaintsAdd a taint to prevent non-GPU workloads from being scheduled onto these GPU nodes: Key nvidia.com/gpu, Value true, Effect NoSchedule.

Step 2: Configure GPU resource requests and taint tolerations

For the Pod to land on the GPU node pool and trigger auto-provisioning, declare its GPU requirements and taint toleration in the YAML manifest.

GPU resource request — in the container's resources field:

# ...
spec:
  containers:
  - name: gpu-automode
    resources:
      limits:
        nvidia.com/gpu: 1   # Request 1 GPU
# ...

Taint toleration — allows the Pod to be scheduled onto nodes with the nvidia.com/gpu taint:

# ...
spec:
  tolerations:
  - key: "nvidia.com/gpu"        # Matches the taint key set on the node pool
    operator: "Equal"
    value: "true"                # Matches the taint value
    effect: "NoSchedule"         # Matches the taint effect
# ...

Step 3: Deploy the GPU workload and verify autoscaling

The following example uses a Stable Diffusion Web UI application to demonstrate the full deployment and autoscaling flow.

Deploy the workload

  1. Create a file named stable-diffusion.yaml with the following content. The manifest contains two parts:

    • Deployment: Runs the Stable Diffusion workload. The Pod requests one NVIDIA GPU and declares the corresponding taint toleration.

    • Service: Exposes the workload through a public IP address via a LoadBalancer Service on port 7860.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: stable-diffusion
      name: stable-diffusion
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: stable-diffusion
      template:
        metadata:
          labels:
            app: stable-diffusion
        spec:
          containers:
          - args:
            - --listen
            command:
            - python3
            - launch.py
            image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu
            imagePullPolicy: IfNotPresent
            name: stable-diffusion
            ports:
            - containerPort: 7860
              protocol: TCP
            readinessProbe:
              tcpSocket:
                port: 7860
            resources:
              limits:
                nvidia.com/gpu: 1   # Request 1 GPU
              requests:
                cpu: "6"
                memory: 12Gi
          # Allow the Pod to be scheduled onto the GPU node pool
          tolerations:
          - key: "nvidia.com/gpu"
            operator: "Equal"
            value: "true"
            effect: "NoSchedule"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      annotations:
        # Expose the Service over the Internet
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet
        service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU
      name: stable-diffusion-svc
      namespace: default
    spec:
      externalTrafficPolicy: Local
      ports:
      - port: 7860
        protocol: TCP
        targetPort: 7860
      selector:
        app: stable-diffusion   # Routes traffic to Pods with this label
      type: LoadBalancer
  2. Apply the manifest.

    kubectl apply -f stable-diffusion.yaml

Verify node auto-provisioning

Right after deployment, the Pod enters a Pending state because no GPU nodes exist yet. Auto Mode detects the unschedulable Pod and provisions one.

  1. Check the Pod status.

    kubectl get pod -l app=stable-diffusion
  2. Inspect the Pod events to confirm that provisioning was triggered.

    kubectl describe pod -l app=stable-diffusion

    In the Events section, a FailedScheduling warning appears first, followed by a ProvisionNode event, indicating that Auto Mode has started provisioning a GPU node.

    Events:
      Type     Reason            Age   From               Message
      ----     ------            ----  ----               -------
      Warning  FailedScheduling  15m   default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 Insufficient cpu, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling., ,
      Normal   ProvisionNode     16m   GOATScaler         Provision node asa-2ze2h0f4m5ctpd8kn4f1 in Zone: cn-beijing-k with InstanceType: ecs.gn7i-c8g1.2xlarge, Triggered time 2025-11-19 02:58:01.096
      Normal   AllocIPSucceed    12m   terway-daemon      Alloc IP 10.XX.XX.141/16 took 4.764400743s
      Normal   Pulling           12m   kubelet            Pulling image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu"
      Normal   Pulled            3m48s kubelet            Successfully pulled image "yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion:v1.0.0-gpu" in 8m47.675s (8m47.675s including waiting). Image size: 11421866941 bytes.
      Normal   Created           3m42s kubelet            Created container: stable-diffusion
      Normal   Started           3m24s kubelet            Started container stable-diffusion
  3. Confirm the provisioned node is Ready.

    # Store the node name where the Pod is running
    NODE_NAME=$(kubectl get pod -l app=stable-diffusion -o jsonpath='{.items[0].spec.nodeName}')
    
    # Print the node name
    echo "Stable Diffusion is running on node: $NODE_NAME"
    
    # Verify the node is in Ready state
    kubectl get node $NODE_NAME

Access the Stable Diffusion UI

Wait a few minutes for the node to join the cluster and the Pod to start, then access the application.

  1. Get the public IP address of the Service.

    kubectl get svc stable-diffusion-svc

    Find the EXTERNAL-IP in the output.

    NAME                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
    stable-diffusion-svc   LoadBalancer   192.XXX.XX.196   8.XXX.XX.68   7860:31302/TCP   18m
  2. Open a browser and navigate to http://<EXTERNAL-IP>:7860. If the Stable Diffusion Web UI loads, the workload is running on the GPU node.

Verify node scale-in

To test automatic scale-in, delete the Deployment so the GPU node goes idle.

  1. Delete the Deployment and the Service.

    kubectl delete deployment stable-diffusion
    kubectl delete service stable-diffusion-svc
  2. After the Defer Scale-in For period elapses (3 minutes by default in Auto Mode), the autoscaler removes the idle node. Query the node by its stored name to confirm.

    kubectl get node $NODE_NAME

    The expected output confirms the node has been released.

    Error from server (NotFound): nodes "<nodeName>" not found