全部產品
Search
文件中心

Container Service for Kubernetes:為PD分離推理服務配置Auto Scaling策略

更新時間:Dec 26, 2025

在Prefill-Decode(PD)分離的LLM推理架構中,Prefill和Decode階段的資源需求差異巨大,傳統的CPU/GPU利用率指標無法有效指飛彈性伸縮。本方案以Dynamo架構為例,介紹如何利用KEDA,根據NATS訊息佇列的積壓情況,為Prefill角色配置獨立的Auto Scaling策略,實現資源按需分配,最佳化服務成本與效能。

前提條件

使用限制

  • 本文的Auto Scaling方案僅針對PD分離架構中的 Prefill 角色。Decode角色的Auto Scaling需配置獨立的策略(Decode建議用GPU顯存利用率)。

  • 本文樣本基於Dynamo推理架構,如果您使用其他架構,相關配置(如NATS Stream名稱、Consumer名稱)需相應調整。

操作步驟

針對通過RoleBasedGroup(RBG)部署的PD分離推理服務,RBG提供了按角色獨立擴縮容的能力。本文將以Dynamo PD分離架構為例,示範利用KEDA(Kubernetes Event-driven Autoscaling)為PD分離推理服務中的Prefill角色單獨配置Auto Scaling策略。

在Dynamo的PD分離架構中,待處理的推理請求會作為訊息推送至NATS訊息佇列的dynamo_prefill_queue流中。Prefill執行個體作為消費者,根據自身處理能力從此隊列拉取訊息進行處理。因此,隊列中待處理(Pending)的訊息數量能有效反映Prefill角色的負載壓力。KEDA提供的NATS JetStream Scaler可以監控此隊列的積壓訊息數,並據此觸發Auto Scaling,精準調控Prefill執行個體的數量。

在生產環境中應用此Auto Scaling策略前,強烈建議在測試環境中進行充分的壓力測試,以確定最適合您業務負載的lagThreshold(積壓訊息閾值)和pollingInterval(輪詢間隔)。不合理的配置可能導致擴容不及時影響服務效能,或過度擴容造成資源浪費。

步驟一:為RBG角色建立ScalingAdapter

為了讓KEDA能夠獨立控制RBG中特定角色的副本數,需要在建立RBG時,為目標角色開啟ScalingAdapter,會自動建立與其綁定的RoleBasedGroupScalingAdapter資源。

  1. 建立rbg.yaml檔案,通過73-74行的scalingAdapter: enable: true設定為所建立的RBG中prefill角色開啟ScalingAdapter。

    展開查看YAML程式碼範例。

    apiVersion: workloads.x-k8s.io/v1alpha1
    kind: RoleBasedGroup
    metadata:
      name: dynamo-pd
      namespace: default
    spec:
      roles:
        - name: processor
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve graphs.pd_disagg:Frontend -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步驟2中所構建的Dynamo Runtime鏡像地址
                  name: processor
                  ports:
                    - containerPort: 8000
                      name: health
                      protocol: TCP
                    - containerPort: 9345
                      name: request
                      protocol: TCP
                    - containerPort: 443
                      name: api
                      protocol: TCP
                    - containerPort: 9347
                      name: metrics
                      protocol: TCP
                  readinessProbe:
                    initialDelaySeconds: 30
                    periodSeconds: 30
                    tcpSocket:
                      port: 8000
                  resources:
                    limits:
                      cpu: "8"
                      memory: 12Gi
                    requests:
                      cpu: "8"
                      memory: 12Gi
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
                    - mountPath: /workspace/examples/llm/graphs/pd_disagg.py
                      name: dynamo-configs
                      subPath: pd_disagg.py
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: prefill
          replicas: 2
          scalingAdapter:
            enable: true
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.prefill_worker:PrefillWorker -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步驟2中所構建的Dynamo Runtime鏡像地址
                  name: prefill-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: decoder
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.worker:VllmWorker -f ./configs/qwen3.yaml --service-name VllmWorker
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步驟2中所構建的Dynamo Runtime鏡像地址
                  name: vllm-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: dynamo-service
    spec:
      type: ClusterIP
      ports:
        - port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        rolebasedgroup.workloads.x-k8s.io/name: dynamo-pd
        rolebasedgroup.workloads.x-k8s.io/role: processor
  2. 執行以下命令建立資源。

    kubectl apply -f rbg.yaml
  3. 在建立RBG時,系統會自動為開啟了 ScalingAdapter 的角色建立一個名為 RoleBasedGroupScalingAdapter 的自訂資源,並將其與該角色進行綁定。通過 RoleBasedGroupScalingAdapter為綁定的角色提供 Scale 子資源的實現能力。

    • 執行以下命令,查看為指定角色自動建立的RoleBasedGroupScalingAdapter

      kubectl get rolebasedgroupscalingadapter

      預期輸出:

      NAME                  PHASE   REPLICAS
      dynamo-pd-prefill     Bound   2
    • 執行以下命令,確認dynamo-pd-prefill ScalingAdapter的狀態。

      kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

      預期輸出中,Status.Phase應為Bound,表明該ScalingAdapter已成功與所建立的RBG中prefill角色完成綁定。

      Name:         dynamo-pd-prefill
      Namespace:    default
      Labels:       <none>
      Annotations:  <none>
      API Version:  workloads.x-k8s.io/v1alpha1
      Kind:         RoleBasedGroupScalingAdapter
      Metadata:
        Creation Timestamp:  2025-07-25T06:10:37Z
        Generation:          2
        Owner References:
          API Version:           workloads.x-k8s.io/v1alpha1
          Block Owner Deletion:  true
          Kind:                  RoleBasedGroup
          Name:                  dynamo-pd
          UID:                   5dd61668-79f3-4197-a5db-b778ce460270
        Resource Version:        1157485
        UID:                     edbb8373-2b9c-4ad1-8b6b-d5dfff71e769
      Spec:
        Replicas:  2
        Scale Target Ref:
          Name:  dynamo-pd
          Role:  prefill
      Status:
        Phase:     Bound
        Replicas:  2
        Selector:  rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
      Events:
        Type    Reason           Age   From                          Message
        ----    ------           ----  ----                          -------
        Normal  SuccessfulBound  25s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]

步驟二:建立KEDA ScaledObject監控訊息佇列

建立ScaledObject資源,定義伸縮規則,將其關聯到上一步建立的RoleBasedGroupScalingAdapter

  1. 建立scaledobject.yaml檔案,內容如下。該配置指定了伸縮對象為dynamo-pd-prefill ScalingAdapter,並設定了基於NATS訊息佇列積壓數量的觸發器。

    以下伸縮策略中的參數配置僅作為示範參考,實際配置請根據真實業務情境進行調整。
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: dynamo-prefill-scaledobject
    spec:
      pollingInterval: 30 # For demo. 預設: 30 秒
      minReplicaCount: 1 # For demo. 預設: 0
      maxReplicaCount: 6 # For demo. 預設: 100
      scaleTargetRef:
        apiVersion: workloads.x-k8s.io/v1alpha1
        kind: RoleBasedGroupScalingAdapter
        name: dynamo-pd-prefill #指定伸縮對象為RoleBasedGroup中的Prefill角色
      triggers:
      - type: nats-jetstream
        metadata:
          natsServerMonitoringEndpoint: "nats.default.svc.cluster.local:8222" #NATS service endpoint
          account: "$G" #當Nats未設定賬戶時的預設值
          stream: "dynamo_prefill_queue" #Dynamo中PrefillQueue名稱
          consumer: "worker-group" #Dynamo中Consumer的持久化名稱
          lagThreshold: "5" #Nats指定隊列中處於Pending的訊息數量伸縮閾值
          useHttps: "false" #是否使用Https協議
  2. 執行以下命令建立資源。

    kubectl apply -f scaledobject.yaml
  3. 執行以下命令,確認KEDA ScaledObject資源狀態。

    kubectl describe so dynamo-prefill-scaledobject

    展開查看預期輸出。

    Name:         dynamo-prefill-scaledobject
    Namespace:    default
    Labels:       scaledobject.keda.sh/name=dynamo-prefill-scaledobject
    Annotations:  <none>
    API Version:  keda.sh/v1alpha1
    Kind:         ScaledObject
    Metadata:
      ...
    Spec:
      Cooldown Period:    300
      Max Replica Count:  6
      Min Replica Count:  1
      Polling Interval:   30
      Scale Target Ref:
        API Version:  workloads.x-k8s.io/v1alpha1
        Kind:         RoleBasedGroupScalingAdapter
        Name:         dynamo-pd-prefill
      Triggers:
        Metadata:
          Account:                          $G
          Consumer:                         worker-group
          Lag Threshold:                    5
          Nats Server Monitoring Endpoint:  nats.default.svc.cluster.local:8222
          Stream:                           dynamo_prefill_queue
          Use Https:                        false
        Type:                               nats-jetstream
    Status:
      Conditions:
        Message:  ScaledObject is defined correctly and is ready for scaling
        Reason:   ScaledObjectReady
        Status:   True
        Type:     Ready
        Message:  Scaling is not performed because triggers are not active
        Reason:   ScalerNotActive
        Status:   False
        Type:     Active
        Status:   Unknown
        Type:     Fallback
      External Metric Names:
        s0-nats-jetstream-dynamo_prefill_queue
      Hpa Name:                keda-hpa-dynamo-prefill-scaledobject
      Original Replica Count:  1
      Scale Target GVKR:
        Group:            workloads.x-k8s.io
        Kind:             RoleBasedGroupScalingAdapter
        Resource:         rolebasedgroupscalingadapters
        Version:          v1alpha1
      Scale Target Kind:  workloads.x-k8s.io/v1alpha1.RoleBasedGroupScalingAdapter
    Events:
      Type    Reason              Age   From           Message
      ----    ------              ----  ----           -------
      Normal  KEDAScalersStarted  3s    keda-operator  Started scalers watch
      Normal  ScaledObjectReady   3s    keda-operator  ScaledObject is ready for scaling

    預期輸出中,Status.ConditionsReady狀態應為True

    同時,KEDA會自動建立一個HPA資源,其名稱記錄在Status.HpaName欄位中,可執行以下命令查看。

    kubectl get hpa keda-hpa-dynamo-prefill-scaledobject

步驟三:(可選)壓測並驗證擴縮容效果

  1. 建立用於壓測的服務執行個體,使用benchmark工具,對服務進行壓測。

    benchmark壓測工具的詳細介紹及使用方式,請參見vLLM Benchmark
    1. 建立benchmark.yaml檔案。

      展開查看相關樣本YAML代碼。

      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        labels:
          app: llm-benchmark
        name: llm-benchmark
      spec:
        selector:
          matchLabels:
            app: llm-benchmark
        template:
          metadata:
            labels:
              app: llm-benchmark
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            containers:
            - command:
              - sh
              - -c
              - sleep inf
              image: #部署推理服務所使用的Dynamo容器鏡像
              imagePullPolicy: IfNotPresent
              name: llm-benchmark
              resources:
                limits:
                  cpu: "8"
                  memory: 40Gi
                requests:
                  cpu: "8"
                  memory: 40Gi
              volumeMounts:
              - mountPath: /models/Qwen3-32B
                name: llm-model
            volumes:
            - name: llm-model
              persistentVolumeClaim:
                claimName: llm-model
    2. 執行命令建立壓測的服務執行個體。

      kubectl create -f benchmark.yaml
    3. 等待執行個體成功運行後,在執行個體中執行以下命令進行壓測:

      python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
              --backend openai-chat \
              --model /models/Qwen3-32B/ \
              --served-model-name qwen \
              --trust-remote-code \
              --dataset-name random \
              --random-input-len 1500 \
              --random-output-len 100 \
              --num-prompts 320 \
              --max-concurrency 32 \
              --host dynamo-service \
              --port 8000 \
              --endpoint /v1/chat/completions 
  2. 在壓測期間,新開一個終端,執行以下命令觀察HPA的擴縮容事件。

    kubectl describe hpa keda-hpa-dynamo-prefill-scaledobject

    預期輸出中,可以看到Events欄位記錄了SuccessfulRescale事件,表明KEDA已根據NATS隊列的積壓情況成功觸發了擴容。

    Name:                               keda-hpa-dynamo-prefill-scaledobject
    Namespace:                          default
    Reference:                          RoleBasedGroupScalingAdapter/dynamo-pd-prefill
    Min replicas:                       1
    Max replicas:                       6
    RoleBasedGroupScalingAdapter pods:  6 current / 6 desired
    Events:
      Type     Reason             Age                   From                       Message
      ----     ------             ----                  ----                       -------
      Normal  SuccessfulRescale  2m1s  horizontal-pod-autoscaler  New size: 4; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
      Normal  SuccessfulRescale  106s  horizontal-pod-autoscaler  New size: 6; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  3. 同時,可以觀察RoleBasedGroupScalingAdapter的副本數變化。

    kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

    預期輸出中,Spec.ReplicasStatus.Replicas的值會從初始值增加到擴容後的值(例如6)。

    Name:         dynamo-pd-prefill
    Namespace:    default
    API Version:  workloads.x-k8s.io/v1alpha1
    Kind:         RoleBasedGroupScalingAdapter
    Metadata:
      Owner References:
        API Version:           workloads.x-k8s.io/v1alpha1
        Block Owner Deletion:  true
        Kind:                  RoleBasedGroup
        Name:                  dynamo-pd
    Spec:
      Replicas:  6
      Scale Target Ref:
        Name:  dynamo-pd
        Role:  prefill
    Status:
      Last Scale Time:  2025-08-04T02:08:10Z
      Phase:            Bound
      Replicas:         6
      Selector:         rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
    Events:
      Type    Reason           Age    From                          Message
      ----    ------           ----   ----                          -------
      Normal  SuccessfulBound  6m9s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]
      Normal  SuccessfulScale  4m40s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 1 to 4 replicas
      Normal  SuccessfulScale  4m25s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 4 to 6 replicas