All Products
Search
Document Center

Container Service for Kubernetes:Use multi-cluster HPA based on vLLM custom metrics

Last Updated:Jan 09, 2026

Online services for large language models (LLMs) often require a multi-cluster architecture to manage significant and unpredictable traffic fluctuations. The multi-cluster solution provided by ACK One is ideal for this use case. This topic describes how to deploy a vLLM inference service on an ACK One Fleet and implement cross-cluster auto scaling by using a Horizontal Pod Autoscaler (HPA) (also known as FederatedHPA).

How it works

A multi-cluster architecture is a common solution for managing the significant and unpredictable traffic fluctuations of large-scale online LLM services:

  • Users with on-premises data centers: Typically adopt a hybrid cloud architecture, using cloud-based clusters to scale out during peak business hours.

  • Users on the cloud: Tend to deploy multiple clusters in different regions to mitigate the risk of resource shortages in a single region.

ACK One Fleet addresses the requirements of both deployment models by providing unified resource management and cross-cluster capabilities, such as scheduling and auto scaling. The key features include:

  • Multi-cluster priority-based scheduling: Allows you to configure scheduling priorities for each cluster in the Fleet. Replicas are first scheduled to high-priority clusters. If resources in those clusters are insufficient, replicas are scaled out to low-priority clusters. During a scale-in, replicas in low-priority clusters are removed first.

  • Inventory-aware scheduling: The Fleet controller integrates with the GoatScaler add-on in each member cluster and considers real-time ECS inventory to make intelligent scheduling decisions for replicas.

  • Centralized auto scaling: By creating a FederatedHPA resource, you can manage auto scaling for a workload across all clusters from a central point. The Fleet's Metrics Adapter collects and aggregates metrics from each member cluster (which can also use prometheus-adapter). The workload is then scaled based on the aggregated metrics.

image.jpeg

Prerequisites

Important

Multi-cluster HPA is in invitational preview. To use this feature, submit a request to be added to the whitelist.

Procedure

Step 1: Configure metric collection in member clusters

Configure the parameters for ack-alibaba-cloud-metrics-adapter in both ACK clusters.

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Helm.

  3. Find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column. Modify the metric collection configuration in prometheus.adapter.rules. For example:

    • vllm:num_requests_waiting: The number of requests waiting to be processed.

    • vllm:num_requests_running: The number of requests currently being processed.

    rules:
      - seriesQuery: 'vllm:num_requests_waiting'
        resources:
          overrides:
            kubernetes_namespace: {resource: "namespace"}
            kubernetes_pod_name: {resource: "pod"}
        name:
          matches: 'vllm:num_requests_waiting'
          as: 'num_requests_waiting'
        metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
      - seriesQuery: 'vllm:num_requests_running'
        resources:
          overrides:
            kubernetes_namespace: {resource: "namespace"}
            kubernetes_pod_name: {resource: "pod"}
        name:
          matches: 'vllm:num_requests_running'
          as: 'num_requests_running'
        metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
  4. After making your changes, click OK.

  5. Verify that the custom metrics are configured correctly.

    # View custom metrics
    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_waiting"
    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_running"

    The expected output includes the vLLM metrics, confirming that metric collection is configured correctly.

Step 2: Create a vLLM inference service in the Fleet

Save the following YAML template as deployment.yaml. Then, run the kubectl apply -f deployment.yaml command to deploy the inference service in the Fleet.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3
  namespace: test
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      annotations:
        # Scrapes pod metrics, similar to a PodMonitor.
        prometheus.io/path: /metrics
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
      labels:
        app: qwen3
    spec:
      containers:
      # Use the qwen3-0.6b model downloaded from ModelScope.
      - command:
        - sh
        - -c
        - export VLLM_USE_MODELSCOPE=True; vllm serve Qwen/Qwen3-0.6B --served-model-name
          qwen3-0.6b --port 8000 --trust-remote-code --tensor_parallel_size=1 --max-model-len
          2048 --gpu-memory-utilization 0.8
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
        imagePullPolicy: IfNotPresent
        name: vllm
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 8000
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            nvidia.com/gpu: "1"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Step 3: Configure a propagation policy in the Fleet

Save the following YAML template as propagationpolicy.yaml. Then, run the kubectl apply -f propagationpolicy.yaml command to apply it.

  • autoScaling.ecsProvision: Enables inventory-aware scheduling. ECS instances are automatically scaled as needed.

  • clusterAffinities: Defines priority groups. During scheduling, affinities are filled in order of priority.

apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
  name: vllm-deploy-pp
  namespace: test
spec:
  autoScaling:
    ecsProvision: true
  placement:
    clusterAffinities:
    - affinityName: high-priority
      clusterNames:
      - ${cluster1_id}
    - affinityName: low-priority
      clusterNames:
      - ${cluster2_id}
    replicaScheduling:
      replicaSchedulingType: Divided
      replicaDivisionPreference: Weighted
      weightPreference:
        dynamicWeight: AvailableReplicas
  preserveResourcesOnDeletion: false
  resourceSelectors:
  - apiVersion: apps/v1
    kind: Deployment
    namespace: test
  - apiVersion: v1
    kind: Service
    namespace: test
  schedulerName: default-scheduler

Step 4: Create a FederatedHPA in the Fleet

A FederatedHPA can monitor pod CPU, memory, custom metrics, and external metrics. The following example uses the num_requests_waiting and num_requests_running custom metrics for auto scaling.

Save the following YAML template as federatedhpa.yaml. Then, run the kubectl apply -f federatedhpa.yaml command to apply it.

apiVersion: autoscaling.one.alibabacloud.com/v1alpha1
kind: FederatedHPA
metadata:
  name: vllm-fhpa
  namespace: test
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: qwen3
  minReplicas: 1
  maxReplicas: 10
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 30
    scaleUp:
      stabilizationWindowSeconds: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: num_requests_waiting
        target:
          type: AverageValue
          averageValue: ${waiting_size} # The optimal value varies depending on the GPU and model.
    - type: Pods
      pods:
        metric:
          name: num_requests_running
        target:
          type: AverageValue
          averageValue: ${running_size} # The optimal value varies depending on the GPU and model.

Verify the results

  1. Verify priority-based scheduling

    Run the kubectl scale deployment -ntest qwen3 --replicas=2 command to scale out the workload. Because cluster1 has only one GPU-accelerated instance, the second replica is scheduled to cluster2, which is in the low-priority group.

  2. Verify the FederatedHPA

    1. Use a multi-cluster Application Load Balancer (ALB) to expose the service as described in Manage north-south traffic, and then obtain the Ingress address.

    2. Replace the ALB address in the following command. Then, run the command to load-test the service.

      hey -n 600 -c 60 -m POST -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Run a test"}]}' http://alb-xxxxxx.cn-hangzhou.alb.aliyuncsslb.com:8000/v1/chat/completions

      You should see num_requests_running increase significantly, and the number of replicas will also scale out:

        Current Metrics:
          Pods:
            Current:
              Average Value:  0
            Metric:
              Name:  num_requests_waiting
          Type:      Pods
          Pods:
            Current:
              Average Value:  58
            Metric:
              Name:        num_requests_running
          Type:            Pods
        Current Replicas:  2
        Desired Replicas:  3
  3. Verify inventory-aware scheduling

    After the inference application is scaled out to three replicas, the new replica is successfully scheduled and launched, even though cluster1 and cluster2 have only two GPU-accelerated instances in total. This confirms that the Fleet's inventory-based scheduling is functioning correctly.

FAQ

Why is the REPLICAS column empty after I run kubectl get fhpa?

This can happen if the FederatedHPA does not match the target workload. Check the workload name and namespace configuration.

Why do I get an error after I run kubectl get fhpa -o yaml?

When you inspect a FederatedHPA resource, you see an error in the status.conditions field similar to this:

the HPA was unable to compute the replica count: unable to get metric xxx.

This error indicates that the FederatedHPA controller in the Fleet cluster failed to fetch the required metrics for the corresponding workload from one or more of the member clusters.

Follow these steps to diagnose the issue:

  1. Confirm that ack-alibaba-cloud-metrics-adapter is installed in all member clusters.

  2. Confirm that the parameters for ack-alibaba-cloud-metrics-adapter in each member cluster are correctly configured.

  3. A good way to verify the configuration is to try querying the metric directly from the Prometheus dashboard in the affected member cluster. If the query fails there, it indicates a problem with the metrics adapter's configuration or the metric source itself.

Why are the metrics not visible when queried in a member cluster?

Run the following command to check whether the metrics are registered. If the relevant metrics are missing, verify the parameter configuration in the Helm application.

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .