為PD分離推理服務配置Auto Scaling策略 - Container Service for Kubernetes

在Prefill-Decode（PD）分離的LLM推理架構中，Prefill和Decode階段的資源需求差異巨大，傳統的CPU/GPU利用率指標無法有效指飛彈性伸縮。本方案以Dynamo架構為例，介紹如何利用KEDA，根據NATS訊息佇列的積壓情況，為Prefill角色配置獨立的Auto Scaling策略，實現資源按需分配，最佳化服務成本與效能。

前提條件

已部署Dynamo PD分離推理服務。
已部署阿里雲Prometheus監控組件。具體操作，請參見接入與配置阿里雲Prometheus監控。
已部署ack-keda組件。具體操作，請參見部署ack-keda。

使用限制

本文的Auto Scaling方案僅針對PD分離架構中的 Prefill 角色。Decode角色的Auto Scaling需配置獨立的策略（Decode建議用GPU顯存利用率）。
本文樣本基於Dynamo推理架構，如果您使用其他架構，相關配置（如NATS Stream名稱、Consumer名稱）需相應調整。

操作步驟

針對通過RoleBasedGroup（RBG）部署的PD分離推理服務，RBG提供了按角色獨立擴縮容的能力。本文將以Dynamo PD分離架構為例，示範利用KEDA（Kubernetes Event-driven Autoscaling）為PD分離推理服務中的Prefill角色單獨配置Auto Scaling策略。

在Dynamo的PD分離架構中，待處理的推理請求會作為訊息推送至NATS訊息佇列的dynamo_prefill_queue流中。Prefill執行個體作為消費者，根據自身處理能力從此隊列拉取訊息進行處理。因此，隊列中待處理（Pending）的訊息數量能有效反映Prefill角色的負載壓力。KEDA提供的NATS JetStream Scaler可以監控此隊列的積壓訊息數，並據此觸發Auto Scaling，精準調控Prefill執行個體的數量。

在生產環境中應用此Auto Scaling策略前，強烈建議在測試環境中進行充分的壓力測試，以確定最適合您業務負載的lagThreshold（積壓訊息閾值）和pollingInterval（輪詢間隔）。不合理的配置可能導致擴容不及時影響服務效能，或過度擴容造成資源浪費。

步驟一：為RBG角色建立ScalingAdapter

為了讓KEDA能夠獨立控制RBG中特定角色的副本數，需要在建立RBG時，為目標角色開啟ScalingAdapter，會自動建立與其綁定的RoleBasedGroupScalingAdapter資源。

建立rbg.yaml檔案，通過73-74行的scalingAdapter: enable: true設定為所建立的RBG中prefill角色開啟ScalingAdapter。

展開查看YAML程式碼範例。

apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: dynamo-pd
  namespace: default
spec:
  roles:
    - name: processor
      replicas: 1
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve graphs.pd_disagg:Frontend -f ./configs/qwen3.yaml
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"
              image: #步驟2中所構建的Dynamo Runtime鏡像地址
              name: processor
              ports:
                - containerPort: 8000
                  name: health
                  protocol: TCP
                - containerPort: 9345
                  name: request
                  protocol: TCP
                - containerPort: 443
                  name: api
                  protocol: TCP
                - containerPort: 9347
                  name: metrics
                  protocol: TCP
              readinessProbe:
                initialDelaySeconds: 30
                periodSeconds: 30
                tcpSocket:
                  port: 8000
              resources:
                limits:
                  cpu: "8"
                  memory: 12Gi
                requests:
                  cpu: "8"
                  memory: 12Gi
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
                - mountPath: /workspace/examples/llm/graphs/pd_disagg.py
                  name: dynamo-configs
                  subPath: pd_disagg.py
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - configMap:
                name: dynamo-configs
              name: dynamo-configs
    - name: prefill
      replicas: 2
      scalingAdapter:
        enable: true
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve components.prefill_worker:PrefillWorker -f ./configs/qwen3.yaml
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"
              image: #步驟2中所構建的Dynamo Runtime鏡像地址
              name: prefill-worker
              resources:
                limits:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
                requests:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - configMap:
                name: dynamo-configs
              name: dynamo-configs
    - name: decoder
      replicas: 1
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - cd /workspace/examples/llm; dynamo serve components.worker:VllmWorker -f ./configs/qwen3.yaml --service-name VllmWorker
              env:
                - name: DYNAMO_NAME
                  value: dynamo
                - name: DYNAMO_NAMESPACE
                  value: default
                - name: ETCD_ENDPOINTS
                  value: http://etcd:2379
                - name: NATS_SERVER
                  value: nats://nats:4222
                - name: DYNAMO_RP_TIMEOUT
                  value: "60"
              image: #步驟2中所構建的Dynamo Runtime鏡像地址
              name: vllm-worker
              resources:
                limits:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
                requests:
                  cpu: "12"
                  memory: 50Gi
                  nvidia.com/gpu: "2"
              volumeMounts:
                - mountPath: /models/Qwen3-32B/
                  name: model
                - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                  name: dynamo-configs
                  subPath: qwen3.yaml
          volumes:
            - name: model
              persistentVolumeClaim:
                claimName: llm-model
            - configMap:
                name: dynamo-configs
              name: dynamo-configs
---
apiVersion: v1
kind: Service
metadata:
  name: dynamo-service
spec:
  type: ClusterIP
  ports:
    - port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    rolebasedgroup.workloads.x-k8s.io/name: dynamo-pd
    rolebasedgroup.workloads.x-k8s.io/role: processor

執行以下命令建立資源。
```
kubectl apply -f rbg.yaml
```

在建立RBG時，系統會自動為開啟了 ScalingAdapter 的角色建立一個名為 RoleBasedGroupScalingAdapter 的自訂資源，並將其與該角色進行綁定。通過 RoleBasedGroupScalingAdapter為綁定的角色提供 Scale 子資源的實現能力。

執行以下命令，查看為指定角色自動建立的RoleBasedGroupScalingAdapter。

kubectl get rolebasedgroupscalingadapter

預期輸出：

NAME                  PHASE   REPLICAS
dynamo-pd-prefill     Bound   2

執行以下命令，確認dynamo-pd-prefill ScalingAdapter的狀態。

kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

預期輸出中，Status.Phase應為Bound，表明該ScalingAdapter已成功與所建立的RBG中prefill角色完成綁定。

Name:         dynamo-pd-prefill
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  workloads.x-k8s.io/v1alpha1
Kind:         RoleBasedGroupScalingAdapter
Metadata:
  Creation Timestamp:  2025-07-25T06:10:37Z
  Generation:          2
  Owner References:
    API Version:           workloads.x-k8s.io/v1alpha1
    Block Owner Deletion:  true
    Kind:                  RoleBasedGroup
    Name:                  dynamo-pd
    UID:                   5dd61668-79f3-4197-a5db-b778ce460270
  Resource Version:        1157485
  UID:                     edbb8373-2b9c-4ad1-8b6b-d5dfff71e769
Spec:
  Replicas:  2
  Scale Target Ref:
    Name:  dynamo-pd
    Role:  prefill
Status:
  Phase:     Bound
  Replicas:  2
  Selector:  rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
Events:
  Type    Reason           Age   From                          Message
  ----    ------           ----  ----                          -------
  Normal  SuccessfulBound  25s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]

步驟二：建立KEDA ScaledObject監控訊息佇列

建立ScaledObject資源，定義伸縮規則，將其關聯到上一步建立的RoleBasedGroupScalingAdapter。

建立scaledobject.yaml檔案，內容如下。該配置指定了伸縮對象為dynamo-pd-prefill ScalingAdapter，並設定了基於NATS訊息佇列積壓數量的觸發器。

以下伸縮策略中的參數配置僅作為示範參考，實際配置請根據真實業務情境進行調整。

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: dynamo-prefill-scaledobject
spec:
  pollingInterval: 30 # For demo. 預設: 30 秒
  minReplicaCount: 1 # For demo. 預設: 0
  maxReplicaCount: 6 # For demo. 預設: 100
  scaleTargetRef:
    apiVersion: workloads.x-k8s.io/v1alpha1
    kind: RoleBasedGroupScalingAdapter
    name: dynamo-pd-prefill #指定伸縮對象為RoleBasedGroup中的Prefill角色
  triggers:
  - type: nats-jetstream
    metadata:
      natsServerMonitoringEndpoint: "nats.default.svc.cluster.local:8222" #NATS service endpoint
      account: "$G" #當Nats未設定賬戶時的預設值
      stream: "dynamo_prefill_queue" #Dynamo中PrefillQueue名稱
      consumer: "worker-group" #Dynamo中Consumer的持久化名稱
      lagThreshold: "5" #Nats指定隊列中處於Pending的訊息數量伸縮閾值
      useHttps: "false" #是否使用Https協議

執行以下命令建立資源。
```
kubectl apply -f scaledobject.yaml
```

執行以下命令，確認KEDA ScaledObject資源狀態。

kubectl describe so dynamo-prefill-scaledobject

展開查看預期輸出。

Name:         dynamo-prefill-scaledobject
Namespace:    default
Labels:       scaledobject.keda.sh/name=dynamo-prefill-scaledobject
Annotations:  <none>
API Version:  keda.sh/v1alpha1
Kind:         ScaledObject
Metadata:
  ...
Spec:
  Cooldown Period:    300
  Max Replica Count:  6
  Min Replica Count:  1
  Polling Interval:   30
  Scale Target Ref:
    API Version:  workloads.x-k8s.io/v1alpha1
    Kind:         RoleBasedGroupScalingAdapter
    Name:         dynamo-pd-prefill
  Triggers:
    Metadata:
      Account:                          $G
      Consumer:                         worker-group
      Lag Threshold:                    5
      Nats Server Monitoring Endpoint:  nats.default.svc.cluster.local:8222
      Stream:                           dynamo_prefill_queue
      Use Https:                        false
    Type:                               nats-jetstream
Status:
  Conditions:
    Message:  ScaledObject is defined correctly and is ready for scaling
    Reason:   ScaledObjectReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
  External Metric Names:
    s0-nats-jetstream-dynamo_prefill_queue
  Hpa Name:                keda-hpa-dynamo-prefill-scaledobject
  Original Replica Count:  1
  Scale Target GVKR:
    Group:            workloads.x-k8s.io
    Kind:             RoleBasedGroupScalingAdapter
    Resource:         rolebasedgroupscalingadapters
    Version:          v1alpha1
  Scale Target Kind:  workloads.x-k8s.io/v1alpha1.RoleBasedGroupScalingAdapter
Events:
  Type    Reason              Age   From           Message
  ----    ------              ----  ----           -------
  Normal  KEDAScalersStarted  3s    keda-operator  Started scalers watch
  Normal  ScaledObjectReady   3s    keda-operator  ScaledObject is ready for scaling

預期輸出中，Status.Conditions的Ready狀態應為True。

同時，KEDA會自動建立一個HPA資源，其名稱記錄在Status.HpaName欄位中，可執行以下命令查看。

kubectl get hpa keda-hpa-dynamo-prefill-scaledobject

步驟三：（可選）壓測並驗證擴縮容效果

建立用於壓測的服務執行個體，使用benchmark工具，對服務進行壓測。

benchmark壓測工具的詳細介紹及使用方式，請參見vLLM Benchmark。

建立benchmark.yaml檔案。

展開查看相關樣本YAML代碼。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: llm-benchmark
  name: llm-benchmark
spec:
  selector:
    matchLabels:
      app: llm-benchmark
  template:
    metadata:
      labels:
        app: llm-benchmark
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: #部署推理服務所使用的Dynamo容器鏡像
        imagePullPolicy: IfNotPresent
        name: llm-benchmark
        resources:
          limits:
            cpu: "8"
            memory: 40Gi
          requests:
            cpu: "8"
            memory: 40Gi
        volumeMounts:
        - mountPath: /models/Qwen3-32B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

執行命令建立壓測的服務執行個體。
```
kubectl create -f benchmark.yaml
```

等待執行個體成功運行後，在執行個體中執行以下命令進行壓測：

python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
        --backend openai-chat \
        --model /models/Qwen3-32B/ \
        --served-model-name qwen \
        --trust-remote-code \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --num-prompts 320 \
        --max-concurrency 32 \
        --host dynamo-service \
        --port 8000 \
        --endpoint /v1/chat/completions

在壓測期間，新開一個終端，執行以下命令觀察HPA的擴縮容事件。

kubectl describe hpa keda-hpa-dynamo-prefill-scaledobject

預期輸出中，可以看到Events欄位記錄了SuccessfulRescale事件，表明KEDA已根據NATS隊列的積壓情況成功觸發了擴容。

Name:                               keda-hpa-dynamo-prefill-scaledobject
Namespace:                          default
Reference:                          RoleBasedGroupScalingAdapter/dynamo-pd-prefill
Min replicas:                       1
Max replicas:                       6
RoleBasedGroupScalingAdapter pods:  6 current / 6 desired
Events:
  Type     Reason             Age                   From                       Message
  ----     ------             ----                  ----                       -------
  Normal  SuccessfulRescale  2m1s  horizontal-pod-autoscaler  New size: 4; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  Normal  SuccessfulRescale  106s  horizontal-pod-autoscaler  New size: 6; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target

同時，可以觀察RoleBasedGroupScalingAdapter的副本數變化。

kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

預期輸出中，Spec.Replicas和Status.Replicas的值會從初始值增加到擴容後的值（例如6）。

Name:         dynamo-pd-prefill
Namespace:    default
API Version:  workloads.x-k8s.io/v1alpha1
Kind:         RoleBasedGroupScalingAdapter
Metadata:
  Owner References:
    API Version:           workloads.x-k8s.io/v1alpha1
    Block Owner Deletion:  true
    Kind:                  RoleBasedGroup
    Name:                  dynamo-pd
Spec:
  Replicas:  6
  Scale Target Ref:
    Name:  dynamo-pd
    Role:  prefill
Status:
  Last Scale Time:  2025-08-04T02:08:10Z
  Phase:            Bound
  Replicas:         6
  Selector:         rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
Events:
  Type    Reason           Age    From                          Message
  ----    ------           ----   ----                          -------
  Normal  SuccessfulBound  6m9s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]
  Normal  SuccessfulScale  4m40s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 1 to 4 replicas
  Normal  SuccessfulScale  4m25s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 4 to 6 replicas