All Products
Search
Document Center

Container Service for Kubernetes:Multi-Cluster HPA Practice Using vLLM Custom Metrics

Last Updated:Mar 26, 2026

LLM online services that handle unpredictable traffic spikes benefit from a multi-cluster architecture. This tutorial shows how to deploy a vLLM inference service using an ACK One fleet and scale it across clusters with FederatedHPA (Federated Horizontal Pod Autoscaler).

How it works

In large-scale inference scenarios, LLM online services face sharp and unpredictable traffic spikes. A multi-cluster architecture is a common solution for handling peak traffic:

  • Users with on-premises data centers typically use a hybrid cloud architecture. During traffic peaks, they scale out workloads to cloud clusters.

  • Cloud users often deploy multiple clusters across different regions to avoid resource shortages in a single region.

The ACK One fleet manages resources across multiple clusters and supports cross-cluster scheduling and Auto Scaling. It supports both deployment models through three mechanisms:

  • Multi-cluster priority-based scheduling: Set scheduling priorities for clusters. Replicas go to high-priority clusters first. When resources run low, the fleet scales out to low-priority clusters and scales in by removing replicas from low-priority clusters first.

  • Inventory-aware intelligent scheduling: The fleet cluster works with GoatScaler in child clusters and uses ECS inventory to schedule replicas intelligently.

  • Unified centralized Auto Scaling: Create a FederatedHPA to scale workloads across clusters from a central point. The fleet cluster's Metrics Adapter collects and aggregates metrics from child clusters (which also support Prometheus Adapter), then scales workloads based on the aggregated metrics.

image.jpeg

This tutorial demonstrates the following scenario:

  • cluster1 is the high-priority cluster; cluster2 is the low-priority cluster.

  • When traffic increases, FederatedHPA triggers a scale-out event.

  • New replicas are scheduled to cluster2 based on available capacity.

  • After traffic subsides, FederatedHPA scales the deployment back to the minimum replica count.

Prerequisites

Before you begin, make sure you have:

Important

The multi-cluster Horizontal Pod Autoscaler (HPA) feature is in invitational preview. Contact us to apply for whitelist access before you use it.

Step 1: Configure metric collection in child clusters

Configure the ack-alibaba-cloud-metrics-adapter component in both ACK clusters to expose vLLM custom metrics.

  1. Log on to the Container Service Management Console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Helm.

  3. Locate ack-alibaba-cloud-metrics-adapter and click Update in the Actions column. In the prometheus.adapter.rules field, add the following metric collection rules:

    • vllm:num_requests_waiting: Number of requests waiting to be processed, exposed as num_requests_waiting

    • vllm:num_requests_running: Number of requests currently being processed, exposed as num_requests_running

    Variable Description
    <<.Series>> The metric series name matched by seriesQuery
    <<.LabelMatchers>> Label filters generated from the resource overrides (namespace, Pod name)
    <<.GroupBy>> The grouping labels for aggregation
    rules:
      - seriesQuery: 'vllm:num_requests_waiting'
        resources:
          overrides:
            kubernetes_namespace: {resource: "namespace"}
            kubernetes_pod_name: {resource: "pod"}
        name:
          matches: 'vllm:num_requests_waiting'
          as: 'num_requests_waiting'
        metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
      - seriesQuery: 'vllm:num_requests_running'
        resources:
          overrides:
            kubernetes_namespace: {resource: "namespace"}
            kubernetes_pod_name: {resource: "pod"}
        name:
          matches: 'vllm:num_requests_running'
          as: 'num_requests_running'
        metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

    The template variables in metricsQuery work as follows:

  4. Click OK.

  5. Verify that the custom metrics are registered correctly:

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_waiting"
    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_running"

    If the output includes vLLM metric values, the configuration succeeded.

Step 2: Deploy the vLLM inference service in the fleet

Save the following YAML as deployment.yaml and run kubectl apply -f deployment.yaml.

The Deployment uses the Qwen/Qwen3-0.6B model downloaded from ModelScope, served on port 8000. Prometheus scrape annotations on the Pod enable automatic metric collection without a separate PodMonitor.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3
  namespace: test
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      annotations:
        # Enables Prometheus scraping — equivalent to a PodMonitor
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen3
    spec:
      containers:
      # Serves Qwen3-0.6B downloaded from ModelScope
      - command:
        - sh
        - -c
        - export VLLM_USE_MODELSCOPE=True; vllm serve Qwen/Qwen3-0.6B --served-model-name
          qwen3-0.6b --port 8000 --trust-remote-code --tensor_parallel_size=1 --max-model-len
          2048 --gpu-memory-utilization 0.8
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
        imagePullPolicy: IfNotPresent
        name: vllm
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Step 3: Configure multi-cluster distribution policy in the fleet

Save the following YAML as propagationpolicy.yaml and run kubectl apply -f propagationpolicy.yaml.

Key fields:

  • autoScaling.ecsProvision: Enables inventory-aware intelligent scheduling. The fleet provisions ECS instances automatically when existing GPU instances are insufficient.

  • clusterAffinities: Defines priority-based cluster groups. The scheduler fills groups in order—high-priority first, then low-priority when the high-priority group runs out of capacity.

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: vllm-deploy-pp
  namespace: test
spec:
  autoScaling:
    ecsProvision: true
  placement:
    clusterAffinities:
    - affinityName: high-priority
      clusterNames:
      - ${cluster1_id}
    - affinityName: low-priority
      clusterNames:
      - ${cluster2_id}
    replicaScheduling:
      replicaSchedulingType: Divided
      replicaDivisionPreference: Weighted
      weightPreference:
        dynamicWeight: AvailableReplicas
  preserveResourcesOnDeletion: false
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    namespace: test
  - apiVersion: v1
    kind: Service
    namespace: test
  schedulerName: default-scheduler

Step 4: Create a FederatedHPA in the fleet

FederatedHPA monitors CPU, memory, custom metrics, and external metrics for Pods. This example uses the vLLM custom metrics num_requests_waiting and num_requests_running to trigger scaling.

Save the following YAML as federatedhpa.yaml and run kubectl apply -f federatedhpa.yaml.

apiVersion: autoscaling.one.alibabacloud.com/v1alpha1
kind: FederatedHPA
metadata:
  name: vllm-fhpa
  namespace: test
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: qwen3
  minReplicas: 1
  maxReplicas: 10
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 30
    scaleUp:
      stabilizationWindowSeconds: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: num_requests_waiting
        target:
          type: AverageValue
          averageValue: ${waiting_size} # Optimal value varies by GPU and model
    - type: Pods
      pods:
        metric:
          name: num_requests_running
        target:
          type: AverageValue
          averageValue: ${running_size} # Optimal value varies by GPU and model

Key fields in the metrics block:

Field Description
type: Pods Aggregates the metric across all Pods in the target Deployment and computes a per-Pod average
target.type: AverageValue Scales when the per-Pod average exceeds averageValue
averageValue The target threshold per Pod. The optimal value varies by GPU and model
behavior.scaleDown.stabilizationWindowSeconds How long FederatedHPA waits before scaling down after metrics drop. Set to 30 seconds in this example
behavior.scaleUp.stabilizationWindowSeconds How long FederatedHPA waits before scaling up after metrics rise. Set to 10 seconds in this example

Verify results

Verify priority-based scheduling

Scale out the workload manually:

kubectl scale deployment -n test qwen3 --replicas=2

Because cluster1 has only one GPU instance, the second replica is scheduled to cluster2 in the low-priority group.

Verify FederatedHPA

  1. Expose the service using a multi-cluster ALB, as described in Manage north-south traffic, and get the Ingress address.

  2. Run a load test against the service (replace the ALB address):

    hey -n 600 -c 60 -m POST -H "Content-Type: application/json" \
      -d '{"messages": [{"role": "user", "content": "Test it"}]}' \
      http://alb-xxxxxx.cn-hangzhou.alb.aliyuncsslb.com:8000/v1/chat/completions
  3. Check that num_requests_running rises and the replica count increases:

    Current Metrics:
      Pods:
        Current:
          Average Value:  0
        Metric:
          Name:  num_requests_waiting
        Type:      Pods
      Pods:
        Current:
          Average Value:  58
        Metric:
          Name:        num_requests_running
        Type:            Pods
    Current Replicas:  2
    Desired Replicas:  3
  4. After the load test ends, verify scale-down. The FederatedHPA is configured with a 30-second scale-down stabilization window. Wait about one minute, then run:

    kubectl get fhpa -n test vllm-fhpa

    The desired replica count should return to the minimum (1) once num_requests_running drops below the threshold. A healthy output looks similar to:

    NAME        REFERENCE          MINPODS   MAXPODS   REPLICAS   AGE
    vllm-fhpa   Deployment/qwen3   1         10        1          10m

Verify inventory-aware intelligent scheduling

After scaling the inference application to three replicas, the new replica starts successfully—even though cluster1 and cluster2 together have only two GPU instances. This confirms that inventory-aware intelligent scheduling provisioned the additional ECS instance automatically.

FAQ

kubectl get fhpa shows empty values in the REPLICAS column. Why?

The FederatedHPA did not match the target workload. Check the workload name and namespace in spec.scaleTargetRef.

kubectl get fhpa -o yaml returns a replica count error. Why?

The condition error reads the HPA was unable to compute the replica count: unable to obtain metric xxx. FederatedHPA cannot fetch metrics from child clusters.

  1. Confirm that ack-alibaba-cloud-metrics-adapter is installed in all child clusters.

  2. Confirm that the component parameters are correct in each child cluster. In the child cluster dashboard, run:

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

    A healthy response lists all registered custom metrics. If the output is empty, review the Helm application parameter settings and check whether the metrics rules match the actual vLLM metric names exposed by your Pods.

  3. Check the FederatedHPA status conditions directly:

    kubectl describe fhpa -n test vllm-fhpa

    Look for the ScalingActive condition in the output:

    Conditions:
      Type            Status  Reason            Message
      ----            ------  ------            -------
      AbleToScale     True    ReadyForNewScale  the last scale time was sufficiently old...
      ScalingActive   False   FailedGetScale    the HPA was unable to compute the replica count...

    ScalingActive: False with reason FailedGetScale or InvalidSelector indicates a metrics fetch failure or target mismatch. Review the adapter configuration and confirm that spec.scaleTargetRef points to the correct Deployment name and namespace.

Why can't I see metrics in child clusters?

Run the following command to check whether custom metrics are registered:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

If no metrics appear, review the Helm application parameter settings for ack-alibaba-cloud-metrics-adapter.

What's next