Online services for large language models (LLMs) often require a multi-cluster architecture to manage significant and unpredictable traffic fluctuations. The multi-cluster solution provided by ACK One is ideal for this use case. This topic describes how to deploy a vLLM inference service on an ACK One Fleet and implement cross-cluster auto scaling by using a Horizontal Pod Autoscaler (HPA) (also known as FederatedHPA).
How it works
A multi-cluster architecture is a common solution for managing the significant and unpredictable traffic fluctuations of large-scale online LLM services:
Users with on-premises data centers: Typically adopt a hybrid cloud architecture, using cloud-based clusters to scale out during peak business hours.
Users on the cloud: Tend to deploy multiple clusters in different regions to mitigate the risk of resource shortages in a single region.
ACK One Fleet addresses the requirements of both deployment models by providing unified resource management and cross-cluster capabilities, such as scheduling and auto scaling. The key features include:
Multi-cluster priority-based scheduling: Allows you to configure scheduling priorities for each cluster in the Fleet. Replicas are first scheduled to high-priority clusters. If resources in those clusters are insufficient, replicas are scaled out to low-priority clusters. During a scale-in, replicas in low-priority clusters are removed first.
Inventory-aware scheduling: The Fleet controller integrates with the
GoatScaleradd-on in each member cluster and considers real-time ECS inventory to make intelligent scheduling decisions for replicas.Centralized auto scaling: By creating a FederatedHPA resource, you can manage auto scaling for a workload across all clusters from a central point. The Fleet's Metrics Adapter collects and aggregates metrics from each member cluster (which can also use
prometheus-adapter). The workload is then scaled based on the aggregated metrics.

Prerequisites
You have two Container Service for Kubernetes (ACK) clusters. Each cluster has a GPU node pool and is initialized with one GPU-accelerated instance.
The two ACK clusters are associated with an ACK One Fleet, and have alibaba-cloud-metrics-adapter installed. For registered clusters, you must also install the open-source prometheus-adapter.
You have installed the kubectl amc plugin.
Multi-cluster HPA is in invitational preview. To use this feature, submit a request to be added to the whitelist.
Procedure
Step 1: Configure metric collection in member clusters
Configure the parameters for ack-alibaba-cloud-metrics-adapter in both ACK clusters.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose .
Find ack-alibaba-cloud-metrics-adapter and click Update in the Actions column. Modify the metric collection configuration in
prometheus.adapter.rules. For example:vllm:num_requests_waiting: The number of requests waiting to be processed.vllm:num_requests_running: The number of requests currently being processed.
rules: - seriesQuery: 'vllm:num_requests_waiting' resources: overrides: kubernetes_namespace: {resource: "namespace"} kubernetes_pod_name: {resource: "pod"} name: matches: 'vllm:num_requests_waiting' as: 'num_requests_waiting' metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)' - seriesQuery: 'vllm:num_requests_running' resources: overrides: kubernetes_namespace: {resource: "namespace"} kubernetes_pod_name: {resource: "pod"} name: matches: 'vllm:num_requests_running' as: 'num_requests_running' metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'After making your changes, click OK.
Verify that the custom metrics are configured correctly.
# View custom metrics kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_waiting" kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/test/pods/qwen3-xxxxx/num_requests_running"The expected output includes the vLLM metrics, confirming that metric collection is configured correctly.
Step 2: Create a vLLM inference service in the Fleet
Save the following YAML template as deployment.yaml. Then, run the kubectl apply -f deployment.yaml command to deploy the inference service in the Fleet.
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen3
namespace: test
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: qwen3
template:
metadata:
annotations:
# Scrapes pod metrics, similar to a PodMonitor.
prometheus.io/path: /metrics
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
labels:
app: qwen3
spec:
containers:
# Use the qwen3-0.6b model downloaded from ModelScope.
- command:
- sh
- -c
- export VLLM_USE_MODELSCOPE=True; vllm serve Qwen/Qwen3-0.6B --served-model-name
qwen3-0.6b --port 8000 --trust-remote-code --tensor_parallel_size=1 --max-model-len
2048 --gpu-memory-utilization 0.8
image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm-openai:v0.9.1
imagePullPolicy: IfNotPresent
name: vllm
ports:
- containerPort: 8000
name: restful
protocol: TCP
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 8000
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30Step 3: Configure a propagation policy in the Fleet
Save the following YAML template as propagationpolicy.yaml. Then, run the kubectl apply -f propagationpolicy.yaml command to apply it.
autoScaling.ecsProvision: Enables inventory-aware scheduling. ECS instances are automatically scaled as needed.clusterAffinities: Defines priority groups. During scheduling, affinities are filled in order of priority.
apiVersion: policy.one.alibabacloud.com/v1alpha1
kind: PropagationPolicy
metadata:
name: vllm-deploy-pp
namespace: test
spec:
autoScaling:
ecsProvision: true
placement:
clusterAffinities:
- affinityName: high-priority
clusterNames:
- ${cluster1_id}
- affinityName: low-priority
clusterNames:
- ${cluster2_id}
replicaScheduling:
replicaSchedulingType: Divided
replicaDivisionPreference: Weighted
weightPreference:
dynamicWeight: AvailableReplicas
preserveResourcesOnDeletion: false
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
namespace: test
- apiVersion: v1
kind: Service
namespace: test
schedulerName: default-schedulerStep 4: Create a FederatedHPA in the Fleet
A FederatedHPA can monitor pod CPU, memory, custom metrics, and external metrics. The following example uses the num_requests_waiting and num_requests_running custom metrics for auto scaling.
Save the following YAML template as federatedhpa.yaml. Then, run the kubectl apply -f federatedhpa.yaml command to apply it.
apiVersion: autoscaling.one.alibabacloud.com/v1alpha1
kind: FederatedHPA
metadata:
name: vllm-fhpa
namespace: test
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: qwen3
minReplicas: 1
maxReplicas: 10
behavior:
scaleDown:
stabilizationWindowSeconds: 30
scaleUp:
stabilizationWindowSeconds: 10
metrics:
- type: Pods
pods:
metric:
name: num_requests_waiting
target:
type: AverageValue
averageValue: ${waiting_size} # The optimal value varies depending on the GPU and model.
- type: Pods
pods:
metric:
name: num_requests_running
target:
type: AverageValue
averageValue: ${running_size} # The optimal value varies depending on the GPU and model.Verify the results
Verify priority-based scheduling
Run the
kubectl scale deployment -ntest qwen3 --replicas=2command to scale out the workload. Becausecluster1has only one GPU-accelerated instance, the second replica is scheduled tocluster2, which is in the low-priority group.Verify the FederatedHPA
Use a multi-cluster Application Load Balancer (ALB) to expose the service as described in Manage north-south traffic, and then obtain the Ingress address.
Replace the ALB address in the following command. Then, run the command to load-test the service.
hey -n 600 -c 60 -m POST -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "Run a test"}]}' http://alb-xxxxxx.cn-hangzhou.alb.aliyuncsslb.com:8000/v1/chat/completionsYou should see
num_requests_runningincrease significantly, and the number of replicas will also scale out:Current Metrics: Pods: Current: Average Value: 0 Metric: Name: num_requests_waiting Type: Pods Pods: Current: Average Value: 58 Metric: Name: num_requests_running Type: Pods Current Replicas: 2 Desired Replicas: 3
Verify inventory-aware scheduling
After the inference application is scaled out to three replicas, the new replica is successfully scheduled and launched, even though
cluster1andcluster2have only two GPU-accelerated instances in total. This confirms that the Fleet's inventory-based scheduling is functioning correctly.
FAQ
Why is the REPLICAS column empty after I run kubectl get fhpa?
This can happen if the FederatedHPA does not match the target workload. Check the workload name and namespace configuration.
Why do I get an error after I run kubectl get fhpa -o yaml?
When you inspect a FederatedHPA resource, you see an error in the status.conditions field similar to this:
the HPA was unable to compute the replica count: unable to get metric xxx.
This error indicates that the FederatedHPA controller in the Fleet cluster failed to fetch the required metrics for the corresponding workload from one or more of the member clusters.
Follow these steps to diagnose the issue:
Confirm that
ack-alibaba-cloud-metrics-adapteris installed in all member clusters.Confirm that the parameters for
ack-alibaba-cloud-metrics-adapterin each member cluster are correctly configured.A good way to verify the configuration is to try querying the metric directly from the Prometheus dashboard in the affected member cluster. If the query fails there, it indicates a problem with the metrics adapter's configuration or the metric source itself.
Why are the metrics not visible when queried in a member cluster?
Run the following command to check whether the metrics are registered. If the relevant metrics are missing, verify the parameter configuration in the Helm application.
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .