基於請求隊列、KVCache以及LoRA感知等多維度智能路由與流量管理 - Alibaba Cloud Service Mesh

針對部署在Kubernetes叢集中的LLM推理服務，由於LLM推理流量和工作負載的特殊性，經典的負載平衡方法效果不佳，且無法從日誌和監控指標中擷取LLM推理相關資訊。本文將介紹如何在服務網格ASM中為vLLM推理服務聲明推理服務池和路由定義，以提升負載平衡效能，並實現推理流量的路由和可觀測性。

閱讀前提示

閱讀本文前，您需要瞭解：

如何在ACS叢集中使用GPU算力。
如何在ACK叢集中建立和使用GPU節點池或使用ACS GPU算力。

通過閱讀本文，您可以瞭解到：

大語言模型和vLLM的背景資訊。
使用傳統方式管理叢集內的LLM推理服務會出現的問題。
使用ASM管理叢集內的LLM推理服務的原理及實踐操作。

背景資訊

大語言模型LLM

大語言模型LLM（Large Language Model）指參數數量達到億層級的神經網路語言模型，例如GPT、通義千問和Llama。模型在超大規模的預訓練資料（預訓練資料類型多樣且覆蓋廣泛，包括大量網路文本、專業書籍和代碼等內容）上進行訓練得到，通常用於文本產生式任務，如補全、對話任務等。

在構建基於LLM的應用時，您可以通過兩種方式擷取LLM提供的文本產生能力：

可以使用類似OpenAI、阿里雲百鍊或Moonshot等平台提供的外部LLM API服務。
也可以基於開源或自研大模型、使用vLLM等推理服務架構構建LLM推理服務，並將LLM推理服務部署在Kubernetes叢集中。這種方式適用於希望自行控制LLM推理服務或對LLM的推理能力有較高定製化需求的情境。

vLLM

vLLM是一個高效易用流行的構建LLM推理服務的架構，支援包括通義千問在內的多種常見大語言模型。vLLM通過PagedAttention最佳化、動態批量推理（Continuous Batching）模型量化等最佳化技術，可以取得較好的大語言模型推理效率。

負載平衡與可觀測

ASM支援管理叢集內LLM推理服務流量，當在叢集內部署的服務為LLM推理服務時，可以通過InferencePool和InferenceModel CRD，聲明提供推理服務的工作負載集合以及服務所提供的模型名稱等資訊。通過這種方式，ASM可以為LLM推理服務提供針對LLM推理後端的負載平衡、流量路由和可觀測能力。

重要

目前僅支援基於vLLM實現的LLM推理服務。

傳統負載平衡

對於傳統的HTTP請求，經典負載平衡演算法可以將請求均勻地發送給不同的工作負載。然而，對於LLM推理服務來說，每個請求給後端帶來的負載是難以預測的。

在推理過程中，請求處理可分為預填充和解碼兩個階段：

預填充階段：對使用者的輸入進行編碼。
解碼階段：分為若干步驟，每個步驟都會對先前的輸入進行解碼、並輸出新的token（即LLM資料處理的基本單位、可粗略對應LLM推理輸出的每個單詞）。由於無法事先確定每個請求會輸出多少token（即對應多少次解碼步驟），如果將請求均勻發送到不同工作負載，將導致每個工作負載的實際工作量不一致，造成負載不均衡。

LLM負載平衡

ASM提供了基於LLM後端的負載平衡演算法。通過推理伺服器多個維度指標來評估推理伺服器內部狀態，並根據內部狀態對多個推理伺服器工作負載進行負載平衡。主要包括：

請求隊列長度：代表模型伺服器正在排隊等待處理的請求數量，該指標通過vLLM伺服器的vllm: num_requests_waiting指標擷取。對於模型伺服器來說，排隊的請求數量越少，新請求被及時處理的可能性越大。
GPU Cache利用率：代表模型伺服器用於緩衝推理中間結果的緩衝（KV Cache）利用率百分比。對於vLLM伺服器，緩衝利用率越低，代表GPU還有充足的空間將資源分派給新來的請求。該指標通過vLLM伺服器的vllm: gpu_cache_usage_perc指標擷取。

相對於經典的負載平衡演算法，此方式可以更好地保證多個推理服務工作負載的GPU負載一致性，顯著降低LLM推理請求第一個token的響應時延（ttft）、並提升LLM推理請求的輸送量。

傳統可觀測

LLM推理服務，目前普遍採用OpenAI的請求API格式進行互動。即請求中繼資料大多位於請求體中，如請求的模型名稱、請求最大token數等。而經典的請求路由能力和可觀測能力都是基於要求標頭和請求路徑等中繼資料，不涉及請求體的解析。因此無法滿足根據請求模型名稱進行流量分配，或觀測請求中的模型名稱以及請求產生token數等需求。

推理流量可觀測

對於LLM推理請求，ASM在訪問日誌和監控指標上都提供了針對LLM推理的增強能力。

在訪問日誌中可以提供請求的模型名稱、輸入token數量和輸出token數量欄位。
在請求監控指標中增加指代模型名稱的自訂維度。
支援新增代表工作負載Token消耗的監控指標。

前提條件

已建立帶有GPU節點池的ACK託管叢集或選擇了GPU算力推薦可用性區域的ACS叢集。
您也可以在ACK託管叢集中安裝ACK Virtual Node組件，以使用ACS GPU算力。
已添加叢集到ASM執行個體，且ASM執行個體版本為1.24及以上。
已建立入口網關並開啟8080HTTP連接埠。
（可選）已為default命名空間注入Sidecar。具體操作，請參見。
說明
若您不想體驗可觀測部分的實踐操作，可以不進行Sidecar注入。

情境實踐

以下以在叢集中部署基於vLLM的Llama2大模型為例，展示ASM管理叢集內LLM推理服務流量的實踐步驟。

步驟一：部署樣本推理服務

使用以下內容，建立vllm-service.yaml。

說明

本文使用的鏡像需要GPU顯存大於16GiB，T4卡型（16GiB顯存）的實際可用顯存不足以啟動此應用。因此ACK叢集卡型推薦使用A10，ACS叢集卡型推薦使用8代GPU B。具體對應型號請提交工單諮詢。

同時，由於LLM鏡像體積較大，建議您提前轉存到ACR，使用內網地址進行拉取。直接從公網拉取的速度取決於叢集EIP的頻寬配置，會有較長的等待時間。

ACS叢集

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
        alibabacloud.com/compute-class: gpu  # 指定使用GPU算力
        alibabacloud.com/compute-qos: default
        alibabacloud.com/gpu-model-series: "example-model" # 指定GPU型號為example-model，請按實際情況填寫
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "4"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 16
              memory: 64Gi
              nvidia.com/gpu: 1
            requests:
              cpu: 8
              memory: 30Gi
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

ACK叢集

apiVersion: v1
kind: Service
metadata:
  name: vllm-llama2-7b-pool
spec:
  selector:
    app: vllm-llama2-7b-pool
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-template
data:
  llama-2-chat.jinja: |
    {% if messages[0]['role'] == 'system' %}
      {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
      {% set messages = messages[1:] %}
    {% else %}
        {% set system_message = '' %}
    {% endif %}

    {% for message in messages %}
        {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
            {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
        {% endif %}

        {% if loop.index0 == 0 %}
            {% set content = system_message + message['content'] %}
        {% else %}
            {% set content = message['content'] %}
        {% endif %}
        {% if message['role'] == 'user' %}
            {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
        {% elif message['role'] == 'assistant' %}
            {{ ' ' + content | trim + ' ' + eos_token }}
        {% endif %}
    {% endfor %}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama2-7b-pool
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llama2-7b-pool
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '8000'
        prometheus.io/scrape: 'true'
      labels:
        app: vllm-llama2-7b-pool
    spec:
      containers:
        - name: lora
          image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
          imagePullPolicy: IfNotPresent
          command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
          args:
          - "--model"
          - "/model/llama2"
          - "--tensor-parallel-size"
          - "1"
          - "--port"
          - "8000"
          - '--gpu_memory_utilization'
          - '0.8'
          - "--enable-lora"
          - "--max-loras"
          - "4"
          - "--max-cpu-loras"
          - "12"
          - "--lora-modules"
          - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
          - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
          - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
          - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
          - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
          - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
          - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
          - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
          - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
          - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
          - '--chat-template'
          - '/etc/vllm/llama-2-chat.jinja'
          env:
            - name: PORT
              value: "8000"
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          livenessProbe:
            failureThreshold: 2400
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 6000
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /data
              name: data
            - mountPath: /dev/shm
              name: shm
            - mountPath: /etc/vllm
              name: chat-template
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - name: data
          emptyDir: {}
        - name: shm
          emptyDir:
            medium: Memory
        - name: chat-template
          configMap:
            name: chat-template

使用資料面叢集的kubeconfig，部署LLM推理服務。
```
kubectl apply -f vllm-service.yaml
```

步驟二：配置ASM網關規則

部署生效於ASM網關的網關規則，該規則表示開啟網關8080連接埠的監聽。

使用以下內容，建立gateway.yaml。

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: llm-inference-gateway
  namespace: default
spec:
  selector:
    istio: ingressgateway
  servers:
    - hosts:
        - '*'
      port:
        name: http-service
        number: 8080
        protocol: HTTP

建立網關規則。
```
kubectl apply -f gateway.yaml
```

步驟三：配置LLM推理服務路由和負載平衡

說明

若您想體驗傳統的負載平衡與LLM負載平衡的效能對比，推薦先完成（可選）通過可觀測大盤與普通負載平衡方式進行效能對比中的步驟，再繼續執行後續操作。

使用ASM的kubeconfig，啟用LLM推理服務路由能力。

kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'

部署InferencePool資源。
InferencePool資源通過標籤選取器聲明一組在叢集中啟動並執行LLM推理服務工作負載，ASM會根據您建立的InferencePool來對LLM推理服務開啟vLLM負載平衡。
1. 使用以下內容，建立inferencepool.yaml。
```
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  name: vllm-llama2-7b-pool
spec:
  targetPortNumber: 8000
  selector:
    app: vllm-llama2-7b-pool
```
  部分配置項說明如下：
  配置項
  說明
  .spec.targetPortNumber
  提供推理服務的Pod所暴露的連接埠。
  .spec.selector
  提供推理服務的Pod標籤。標籤key必須為app且值需為對應的Service名稱。
2. 使用資料面叢集的kubeconfig，建立InferencePool資源。
```
kubectl apply -f inferencepool.yaml
```

部署InferenceModel資源。

InferenceModel指定了InferencePool中具體模型的流量分發策略。

使用以下內容，建立inferencemodel.yaml。

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: inferencemodel-sample
spec:
  modelName: tweet-summary
  poolRef:
    group: inference.networking.x-k8s.io
    kind: InferencePool
    name: vllm-llama2-7b-pool
  targetModels:
  - name: tweet-summary
    weight: 100

部分配置項說明如下：

配置項	說明
.spec.modelName	用於匹配請求中的model參數。
.spec.targetModels	配置流量路由規則。上述樣本中是將要求標頭中包含model:tweet-summary的流量100%發送到運行了tweet-summary模型對應的Pod。

建立InferenceModel資源。
```
kubectl apply -f inferencemodel.yaml
```

建立LLMRoute資源。

為網關配置路由規則。此路由規則通過引用InferencePool資源的方式，指定ASM網關將8080連接埠上接收到的請求全部轉寄給樣本LLM推理服務。

使用以下內容，建立llmroute.yaml。

apiVersion: istio.alibabacloud.com/v1
kind: LLMRoute
metadata:  
  name: test-llm-route
spec:
  gateways: 
  - llm-inference-gateway
  host: test.com
  rules:
  - backendRefs:
    - backendRef:
        group: inference.networking.x-k8s.io
        kind: InferencePool
        name: vllm-llama2-7b-pool

建立LLMRoute資源。
```
kubectl apply -f llmroute.yaml
```

步驟四：驗證執行結果

多次執行以下命令，發起測試。

curl -H "host: test.com" ${ASM網關IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -v

預期輸出：

{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}

（可選）步驟五：配置LLM服務可觀測指標與可觀測大盤

在使用InferencePool和InferenceMode資源聲明叢集中的LLM推理服務，並配置路由策略後，可以通過在日誌和監控指標兩方面查看LLM推理的可觀測能力。

開啟ASM的LLM流量可觀測能力，採集服務網格監控指標。

通過操作新增日誌欄位、新增指標以及新增指標維度增強LLM推理請求的可觀測性資訊。
完成配置後，ASM監控指標中會新增一個model維度。您可以通過將監控指標採集到可觀測監控Prometheus版，或者整合自建Prometheus實現網格監控進行指標採集。

ASM新增提供兩個指標來表示所有請求的輸入token數asm_llm_proxy_prompt_tokens和輸出token數asm_llm_proxy_completion_tokens。您可以為Prometheus增加以下規則來新增這些指標。

scrape_configs:
- job_name: asm-envoy-stats-llm
  scrape_interval: 30s
  scrape_timeout: 30s
  metrics_path: /stats/prometheus
  scheme: http
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_container_port_name
    action: keep
    regex: .*-envoy-prom
  - source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:15090
    target_label: __address__
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels:
    - __meta_kubernetes_namespace
    action: replace
    target_label: namespace
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: pod_name
  metric_relabel_configs:
  - action: keep
    source_labels:
    - __name__
    regex: asm_llm_.*

採集vLLM服務監控指標。

ASM提供的LLM推理請求監控指標主要監控了外部LLM推理請求的吞吐情況。您可以為vLLM服務Pod增加Prometheus採集相關的註解，以採集vLLM服務暴露的監控指標，監控vLLM服務的內部狀態。

...
annotations:
  prometheus.io/path: /metrics # 指標暴露的HTTP Path。
  prometheus.io/port: "8000" # 指標暴露連接埠，即為vLLM Server的監聽連接埠。
  prometheus.io/scrape: "true" # 是否抓取當前Pod的指標。
...

通過Prometheus執行個體預設的服務發現機制採集vLLM服務相關指標。具體操作，請參見。

在vLLM服務提供的監控指標中，可以通過以下重點指標來直觀瞭解vLLM工作負載的內部狀態。

指標名稱	說明
vllm:gpu_cache_usage_perc	vllm的GPU緩衝使用百分比。vLLM啟動時，會儘可能多地預先佔有一塊GPU顯存，用於進行KV緩衝。對於vLLM伺服器，緩衝利用率越低，代表GPU還有充足的空間將資源分派給新來的請求。
vllm:request_queue_time_seconds_sum	請求在等待狀態排隊花費的時間。LLM推理請求在到達vLLM伺服器後、可能不會被立刻處理，而是需要等待被vLLM調度器調度運行預填充和解碼。
vllm:num_requests_running vllm:num_requests_waiting vllm:num_requests_swapped	正在運行推理、正在等待和被交換到記憶體的請求數量。可以用來評估vLLM服務當前的請求壓力。
vllm:avg_generation_throughput_toks_per_s vllm:avg_prompt_throughput_toks_per_s	每秒被預填充階段消耗的token以及解碼階段產生的token數量。
vllm:time_to_first_token_seconds_bucket	從請求發送到vLLM服務，到響應第一個token為止的時延水平。該指標通常代表了用戶端在輸出請求內容後得到首個響應所需的時間、是影響LLM使用者體驗的重要指標。

配置Grafana大盤檢測LLM推理服務。

您可以通過Grafana大盤來觀測基於vLLM部署的LLM推理服務：

通過基於ASM監控指標的面板觀測服務的請求速率和整體token吞吐；
通過基於vLLM監控指標的面板觀測推理工作負載的內部狀態。

請確保Grafana使用的資料來源Prometheus執行個體已經採集服務網格和vLLM的監控指標。

將以下內容匯入到Grafana，建立LLM推理服務的可觀測大盤。

展開查看JSON內容

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "description": "Monitoring vLLM Inference Server",
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 49,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 23,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "exemplar": false,
          "expr": "sum by(model) (rate(istio_requests_total{model!=\"unknown\"}[$__rate_interval]))",
          "instant": false,
          "interval": "",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Request Rate",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
      },
      "description": "",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green"
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 0
      },
      "id": 20,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_completion_tokens{}[$__rate_interval]))",
          "instant": false,
          "legendFormat": "generate tokens (from proxy)",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_prompt_tokens{}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "legendFormat": "prompt tokens (from proxy)",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Tokens Rate",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "min": -1,
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 8
      },
      "id": 17,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "avg(vllm:gpu_cache_usage_perc)",
          "hide": false,
          "instant": false,
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Average gpu cache usage",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 8
      },
      "id": 18,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "avg(rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "range": true,
          "refId": "C"
        }
      ],
      "title": "Average Queue Time",
      "type": "stat"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Percentage of used cache blocks by vLLM.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 16
      },
      "id": 4,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "sum by(kubernetes_pod_name) (vllm:gpu_cache_usage_perc{model_name=\"$model_name\"})",
          "instant": false,
          "legendFormat": "GPU Cache Usage ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "A"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "vllm:cpu_cache_usage_perc{model_name=\"$model_name\"}",
          "hide": false,
          "instant": false,
          "legendFormat": "CPU Cache Usage",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Cache Utilization",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "seconds",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 16
      },
      "id": 14,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Queue Time",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "P50, P90, P95, and P99 TTFT latency in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 24
      },
      "id": 5,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "builder",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])) / sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval]))",
          "hide": false,
          "instant": false,
          "legendFormat": "Average ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "Time To First Token Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of tokens processed per second",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 24
      },
      "id": 8,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "Prompt Tokens/Sec",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "sum by(kubernetes_pod_name) (rate(vllm:generation_tokens_total{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "Generation Tokens/Sec ({{kubernetes_pod_name}})",
          "range": true,
          "refId": "B",
          "useBackend": false
        }
      ],
      "title": "Token Throughput",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
      },
      "description": "End to end request latency measured in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 32
      },
      "id": 9,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "editorMode": "code",
          "expr": "rate(vllm:e2e_request_latency_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:e2e_request_latency_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Average",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "E2E Request Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of requests in RUNNING, WAITING, and SWAPPED state",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "none"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 32
      },
      "id": 3,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "vllm:num_requests_running{model_name=\"$model_name\"}",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Running",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "vllm:num_requests_swapped{model_name=\"$model_name\"}",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Swapped",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "exemplar": false,
          "expr": "sum by(kubernetes_pod_name) (vllm:num_requests_waiting{model_name=\"$model_name\"})",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Num Waiting for {{kubernetes_pod_name}}",
          "range": true,
          "refId": "C",
          "useBackend": false
        }
      ],
      "title": "Scheduler State",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Inter token latency in seconds.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "s"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 40
      },
      "id": 10,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P99",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P95",
          "range": true,
          "refId": "B",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P90",
          "range": true,
          "refId": "C",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
          "fullMetaSearch": false,
          "hide": false,
          "includeNullMetadata": false,
          "instant": false,
          "legendFormat": "P50",
          "range": true,
          "refId": "D",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "rate(vllm:time_per_output_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:time_per_output_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Mean",
          "range": true,
          "refId": "E"
        }
      ],
      "title": "Time Per Output Token Latency",
      "type": "timeseries"
    },
    {
      "datasource": {
        "default": false,
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "__systemRef": "hideSeriesFrom",
            "matcher": {
              "id": "byNames",
              "options": {
                "mode": "exclude",
                "names": [
                  "Decode"
                ],
                "prefix": "All except:",
                "readOnly": true
              }
            },
            "properties": [
              {
                "id": "custom.hideFrom",
                "value": {
                  "legend": false,
                  "tooltip": false,
                  "viz": true
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 40
      },
      "id": 15,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "rate(vllm:request_prefill_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Prefill",
          "range": true,
          "refId": "A",
          "useBackend": false
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "expr": "rate(vllm:request_decode_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "hide": false,
          "instant": false,
          "legendFormat": "Decode",
          "range": true,
          "refId": "B"
        }
      ],
      "title": "Requests Prefill and Decode Time",
      "type": "timeseries"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Heatmap of request prompt length",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 48
      },
      "id": 12,
      "options": {
        "calculate": false,
        "cellGap": 1,
        "cellValues": {
          "unit": "none"
        },
        "color": {
          "exponent": 0.5,
          "fill": "dark-orange",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 64
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": true
        },
        "rowsFrame": {
          "layout": "auto",
          "value": "Request count"
        },
        "tooltip": {
          "mode": "single",
          "show": true,
          "showColorScale": false,
          "yHistogram": true
        },
        "yAxis": {
          "axisLabel": "Prompt Length",
          "axisPlacement": "left",
          "reverse": false,
          "unit": "none"
        }
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(le) (increase(vllm:request_prompt_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
          "format": "heatmap",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "{{le}}",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Request Prompt Length",
      "type": "heatmap"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Heatmap of request generation length",
      "fieldConfig": {
        "defaults": {
          "custom": {
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "scaleDistribution": {
              "type": "linear"
            }
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 48
      },
      "id": 13,
      "options": {
        "calculate": false,
        "cellGap": 1,
        "cellValues": {
          "unit": "none"
        },
        "color": {
          "exponent": 0.5,
          "fill": "dark-orange",
          "min": 0,
          "mode": "scheme",
          "reverse": false,
          "scale": "exponential",
          "scheme": "Spectral",
          "steps": 64
        },
        "exemplars": {
          "color": "rgba(255,0,255,0.7)"
        },
        "filterValues": {
          "le": 1e-9
        },
        "legend": {
          "show": true
        },
        "rowsFrame": {
          "layout": "auto",
          "value": "Request count"
        },
        "tooltip": {
          "mode": "single",
          "show": true,
          "showColorScale": false,
          "yHistogram": true
        },
        "yAxis": {
          "axisLabel": "Generation Length",
          "axisPlacement": "left",
          "reverse": false,
          "unit": "none"
        }
      },
      "pluginVersion": "10.0.9",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(le) (increase(vllm:request_generation_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
          "format": "heatmap",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "{{le}}",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Request Generation Length",
      "type": "heatmap"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "description": "Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.",
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 56
      },
      "id": 11,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "disableTextWrap": false,
          "editorMode": "builder",
          "expr": "sum by(finished_reason) (increase(vllm:request_success_total{model_name=\"$model_name\"}[$__rate_interval]))",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "interval": "",
          "legendFormat": "__auto",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Finish Reason",
      "type": "timeseries"
    },
    {
      "datasource": {
        "default": false,
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "palette-classic"
          },
          "custom": {
            "axisCenteredZero": false,
            "axisColorMode": "text",
            "axisLabel": "",
            "axisPlacement": "auto",
            "barAlignment": 0,
            "drawStyle": "line",
            "fillOpacity": 0,
            "gradientMode": "none",
            "hideFrom": {
              "legend": false,
              "tooltip": false,
              "viz": false
            },
            "lineInterpolation": "linear",
            "lineWidth": 1,
            "pointSize": 5,
            "scaleDistribution": {
              "type": "linear"
            },
            "showPoints": "auto",
            "spanNulls": false,
            "stacking": {
              "group": "A",
              "mode": "none"
            },
            "thresholdsStyle": {
              "mode": "off"
            }
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          }
        },
        "overrides": [
          {
            "__systemRef": "hideSeriesFrom",
            "matcher": {
              "id": "byNames",
              "options": {
                "mode": "exclude",
                "names": [
                  "Tokens"
                ],
                "prefix": "All except:",
                "readOnly": true
              }
            },
            "properties": [
              {
                "id": "custom.hideFrom",
                "value": {
                  "legend": false,
                  "tooltip": false,
                  "viz": true
                }
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 12,
        "y": 56
      },
      "id": 16,
      "options": {
        "legend": {
          "calcs": [],
          "displayMode": "list",
          "placement": "bottom",
          "showLegend": true
        },
        "tooltip": {
          "mode": "single",
          "sort": "none"
        }
      },
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "edx8memhpd9tsa"
          },
          "disableTextWrap": false,
          "editorMode": "code",
          "expr": "rate(vllm:request_max_num_generation_tokens_sum{model_name=\"$model_name\"}[$__rate_interval])",
          "fullMetaSearch": false,
          "includeNullMetadata": true,
          "instant": false,
          "legendFormat": "Tokens",
          "range": true,
          "refId": "A",
          "useBackend": false
        }
      ],
      "title": "Max Generation Token in Sequence Group",
      "type": "timeseries"
    }
  ],
  "refresh": false,
  "schemaVersion": 38,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": [
      {
        "current": {
          "selected": true,
          "text": "prom-cec64713b1aab44d0b49236b6f54cd671",
          "value": "prom-cec64713b1aab44d0b49236b6f54cd671"
        },
        "hide": 0,
        "includeAll": false,
        "label": "datasource",
        "multi": false,
        "name": "DS_PROMETHEUS",
        "options": [],
        "query": "prometheus",
        "queryValue": "",
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "type": "datasource"
      },
      {
        "current": {
          "selected": false,
          "text": "/model/llama2",
          "value": "/model/llama2"
        },
        "datasource": {
          "type": "prometheus",
          "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
        },
        "definition": "label_values(model_name)",
        "hide": 0,
        "includeAll": false,
        "label": "model_name",
        "multi": false,
        "name": "model_name",
        "options": [],
        "query": {
          "query": "label_values(model_name)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 0,
        "type": "query"
      }
    ]
  },
  "time": {
    "from": "2025-01-10T04:00:36.511Z",
    "to": "2025-01-10T04:18:26.639Z"
  },
  "timepicker": {},
  "timezone": "",
  "title": "vLLM",
  "uid": "b281712d-8bff-41ef-9f3f-71ad43c05e9c",
  "version": 10,
  "weekStart": ""
}

大盤效果如下：

（可選）通過可觀測大盤與普通負載平衡方式進行效能對比

基於可觀測大盤，可以直觀地對比使用LLM推理服務負載平衡和使用傳統最少請求負載平衡演算法的實際表現：包括vLLM服務緩衝使用率、請求排隊時間、token輸送量和ttft指標等。

說明

若您已經完成了步驟三，請先執行以下命令清除已經建立的資源。

kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --all

反之，請在完成以下1、2步驟後，清除已經建立的虛擬服務，再繼續執行步驟三。

執行以下命令，建立一個虛擬服務為樣本LLM推理服務提供路由和傳統負載平衡。

kubectl apply -f- <<EOF
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: llm-vs
  namespace: default
spec:
  gateways:
    - default/llm-inference-gateway
  hosts:
    - '*'
  http:
    - name: any-host
      route:
        - destination:
            host: vllm-llama2-7b-pool.default.svc.cluster.local
            port:
              number: 8000
EOF

通過llmperf工具對LLM推理服務進行壓力測試。
通過Grafana大盤比對兩種路由和負載平衡方案。
下圖直觀展示了兩種路由和負載平衡方案的對比，可以發現使用LLM推理服務負載平衡時，LLM推理服務擁有更好的延遲、輸送量和緩衝利用率表現。

配置項	說明
.spec.targetPortNumber	提供推理服務的Pod所暴露的連接埠。
.spec.selector	提供推理服務的Pod標籤。標籤key必須為app且值需為對應的Service名稱。