為單機/多機推理配置彈性擴縮容 - Container Service for Kubernetes

在管理LLM推理服務時，需要應對模型推理過程中高度動態負載波動。本文通過結合推理架構的自訂指標與 Kubernetes HPA（Horizontal Pod Autoscaler）機制，實現對推理服務Pod數量的自動靈活調整，從而有效提升推理服務的品質與穩定性。

前提條件

已部署單機LLM推理服務或部署多機分布式推理服務。
已部署阿里雲Prometheus監控組件。具體操作，請參見使用阿里雲Prometheus監控。
已部署ack-alibaba-cloud-metrics-adapter組件，且在部署組件時設定AlibabaCloudMetricsAdapter.prometheus.url參數為阿里雲Prometheus監控的地址。具體操作，請參見修改ack-alibaba-cloud-metrics-adapter組件配置。

配置採集監控指標

LLM推理服務與傳統微服務存在顯著差異，其單次推理耗時顯著增加，資源瓶頸通常集中在GPU算力與顯存容量。然而，受制於當前GPU利用率和顯存統計方式的局限性，這兩項指標難以準確反映節點負載狀態。因此，我們選擇以推理引擎自身暴露的效能指標（如請求延遲、隊列深度）作為彈性擴縮容的決策依據。

計費說明

LLM推理服務將監控資料接入阿里雲Prometheus監控功能後，相關組件會自動將監控指標發送至阿里雲Prometheus服務，這些指標將被視為自訂指標。

使用自訂指標會產生額外的費用。這些費用將根據您的叢集規模、應用數量和資料量等因素產生變動，您可以通過用量查詢，監控和管理您的資源使用方式。

步驟一：擷取推理引擎監控指標

如果您已通過為LLM推理服務配置監控為推理服務配置了Prometheus監控，可跳過此步驟。

建立podmonitor.yaml。

展開查看程式碼範例。

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: llm-serving-podmonitor
  namespace: default
  annotations:
    arms.prometheus.io/discovery: "true"
    arms.prometheus.io/resource: "arms"
spec:
  selector:
    matchExpressions:
    - key: alibabacloud.com/inference-workload
      operator: Exists
  namespaceSelector:
    any: true
  podMetricsEndpoints:
  - interval: 15s
    path: /metrics
    port: "http"
    relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_name
      targetLabel: pod_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_namespace
      targetLabel: pod_namespace
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
      regex: (.+)
      targetLabel: rbg_role
    # Allow to override workload-name with specific label
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
      regex: (.+)
      targetLabel: workload_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
      regex: (.+)
      targetLabel: backend

執行以下命令建立PodMonitor。
```
kubectl apply -f ./podmonitor.yaml
```

步驟二：修改`ack-alibaba-cloud-metrics-adapter`組件配置

登入Container Service管理主控台，在左側導覽列選擇叢集列表。
在叢集列表頁面，單擊目的地組群名稱，在左側導覽列，單擊應用 > Helm。
在Helm頁面的操作列，單擊ack-alibaba-cloud-metrics-adapter對應的更新。

在更新發布面板，配置如下YAML，然後單擊確定。YAML中的指標僅作為樣本，您可根據實際需求進行修改。

vLLM metrics列表可參考文檔vLLM Metrics，SGLang metrics列表可參考文檔SGLang metrics，Dynamo metrics列表可參考Dynamo Metrics。

展開查看範例程式碼。

AlibabaCloudMetricsAdapter:

  prometheus:
    enabled: true    # 這裡設定為true，開啟整體Prometheus adapter功能。
    # 填寫阿里雲Prometheus監控的地址。
    url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
    # 阿里雲Prometheus開啟鑒權Token後，請配置prometheusHeader Authorization。
#    prometheusHeader:
#    - Authorization: xxxxxxx

    adapter:
      rules:
        default: false  			# 預設指標擷取配置，推薦保持false。
        custom:

        # ** 樣本1：this is an example for vllm **
        # vllm:num_requests_waiting 排隊的請求數量
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"
        - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:num_requests_running 正在處理的請求數量
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_running"
        - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:kv_cache_usage_perc kv cache使用率
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:kv_cache_usage_perc"
        - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # ** 樣本2：this is an example for sglang **
        # sglang:num_queue_reqs 排隊的請求數量
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_queue_reqs"
        - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
        # sglang:num_running_reqs 正在處理的請求數量
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_running_reqs"
        - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
          # sglang:token_usage 系統中Token使用率，可以反映KVCache利用率
          # 執行以下命令確認指標是否採集
          # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:token_usage"
        - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # 樣本3：this is an example for dynamo
        # nv_llm_http_service_inflight_requests 正在處理的請求數量
        # 執行以下命令確認指標是否採集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nv_llm_http_service_inflight_requests"
        - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

配置彈性擴縮容

以下伸縮策略中的參數配置僅作為示範參考，實際配置請根據真實業務情境，綜合考慮資源成本和服務SLO後進行設定。

建立hpa.yaml，相關YAML程式碼範例如下。請根據您使用的推理架構，在以下樣本中選擇一個進行配置。

vLLM架構

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: vllm-inference # 替換為vllm推理服務的名稱
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm:num_requests_waiting
      target:
        type: Value
        averageValue: 5

SGLang架構

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: sgl-inference
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: sglang:num_queue_reqs
      target:
        type: Value
        averageValue: 5

執行以下命令建立HPA對象。

kubectl apply -f hpa.yaml

使用benchmark工具，對服務進行壓測。

benchmark壓測工具的詳細介紹及使用方式，請參見vLLM Benchmark 及 SGLang Benchmark。

建立benchmark.yaml檔案。

image部署推理服務所使用的LLM/SGlang容器鏡像，可選擇：
- LLM容器鏡像：kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
- SGlang容器鏡像：anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

展開查看相關樣本YAML代碼。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: llm-benchmark
  name: llm-benchmark
spec:
  selector:
    matchLabels:
      app: llm-benchmark
  template:
    metadata:
      labels:
        app: llm-benchmark
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: #部署推理服務所使用的SGlang/LLM容器鏡像
        imagePullPolicy: IfNotPresent
        name: llm-benchmark
        resources:
          limits:
            cpu: "8"
            memory: 40Gi
          requests:
            cpu: "8"
            memory: 40Gi
        volumeMounts:
        - mountPath: /models/Qwen3-32B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

執行命令建立壓測的服務執行個體。
```
kubectl create -f benchmark.yaml
```

等待執行個體成功運行後，在執行個體中執行以下命令進行壓測：

vLLM架構

python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

SGLang架構

python3 -m sglang.bench_serving --backend sglang \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

在壓測期間，重新開啟一個終端，執行以下命令查看服務的擴縮容情況。

kubectl describe hpa llm-inference-hpa

預期輸出中，可以看到Events欄位記錄了SuccessfulRescale事件，表明HPA已根據推理服務中處於等待中狀態的請求數量，將推理服務的副本數從1個擴容至3個。

Name:                                   llm-inference-hpa
Namespace:                              default
Labels:                                 <none>
Annotations:                            <none>
CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
Reference:                              StatefulSet/vllm-inference
Metrics:                                ( current / target )
  "vllm:num_requests_waiting" on pods:  11 / 5
Min replicas:                           1
Max replicas:                           3
StatefulSet pods:                       1 current / 3 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target

前提條件

配置採集監控指標

計費說明

步驟一：擷取推理引擎監控指標

步驟二：修改ack-alibaba-cloud-metrics-adapter組件配置

配置彈性擴縮容

vLLM架構

SGLang架構

vLLM架構

SGLang架構

步驟二：修改`ack-alibaba-cloud-metrics-adapter`組件配置