All Products
Search
Document Center

Alibaba Cloud Service Mesh:Intelligent routing and traffic management based on request queue, KVCache, and LoRA awareness

Last Updated:May 30, 2025

This topic explains how to use Service Mesh (ASM) to enhance load balancing and traffic management for LLM inference services deployed in Kubernetes clusters. Due to the unique characteristics of LLM inference traffic and workloads, traditional load balancing methods may fall short. This topic walks you through the steps to define service pools and routing for vLLM inference services to improve performance and gain insight into inference traffic.

Reading tips

Before you begin, ensure you are familiar with:

By reading this topic, you will learn about:

  • The background of large language models and vLLM.

  • Challenges in managing LLM inference services in a cluster using conventional methods.

  • The concepts and practical steps for managing LLM inference services in a cluster using ASM.

Background

Large language model (LLM)

Large language models (LLMs) are neural network-based language models with billions of parameters, exemplified by GPT, Qwen, and Llama. These models are trained on diverse and extensive pre-training datasets, including web text, professional literature, and code, and are primarily used for text generation tasks such as completion and dialogue.

To leverage LLMs for building applications, you can:

  • Utilize external LLM API services from platforms like OpenAI, Alibaba Cloud Model Studio, or Moonshot.

  • Build your own LLM inference services using open-source or proprietary models and frameworks such as vLLM, and deploy them in a Kubernetes cluster. This approach is suitable for scenarios requiring control over the inference service or high customization of LLM inference capabilities.

vLLM

vLLM is a framework designed for efficient and user-friendly construction of LLM inference services. It supports various large language models, including Qwen, and optimizes LLM inference efficiency through techniques like PagedAttention, dynamic batch inference (Continuous Batching), and model quantization.

Load balancing and observability

ASM facilitates the management of LLM inference service traffic within a cluster. When deploying an LLM inference service, you can declare the workloads providing the service and the model nameS through InferencePool and InferenceModel Custom Resource Definitions (CRDs). ASM then provides load balancing, traffic routing, and observability for LLM inference services targeting LLM inference backends.

Important

Currently, only LLM inference services based on vLLM are supported.

Traditional load balancing

Classic load balancing algorithms distribute HTTP requests evenly across various workloads. However, with LLM inference services, the load each request imposes on the backend is unpredictable.

The inference process consists of two phases: prefill and decode:

  • Prefill stage: EncodES the user's input.

  • Decode stage: Involves multiple steps where each one decodes the previous input and generates a new token, the basic unit of LLM data processing. Since the number of tokens each request will produce is unknown beforehand, evenly distributing requests can lead to uneven actual workloads and cause load imbalances.

LLM load balancing

ASM offers a load balancing algorithm tailored for LLM backends. It evaluates the inference server's internal state using multi-dimensional metrics and balances the workload across multiple servers. Key metrics include the following:

  • Request queue length: Indicates the number of pending requests in the model server's queue, derived from the vllm: num_requests_waiting metric. A shorter queue suggests a higher likelihood of timely request processing.

  • GPU Cache utilization: Reflects the percentage of cache (KV Cache) used by the server to store intermediate inference results. Measured by the vllm: gpu_cache_usage_perc metric, lower utilization means more GPU resources are available for new requests.

This approach outperforms traditional algorithms by ensuring consistent GPU load across inference services, reducing the response latency for the first token of LLM requests (ttft), and enhancing throughput.

Traditional observability

LLM inference services typically interact using the request API format of OpenAI. Most request metadata, like the model name and maximum token count, is in the request body. Traditional routing and observability capabilities, based on request headers and paths, do not parse the request body and thus cannot accommodate traffic allocation or observation based on request model names or token counts.

Inference traffic observability

ASM enhances LLM inference capabilities in access logs and monitoring metrics for requests.

  • Access logs can now include fields for the request's model name, and the number of input and output tokens.

  • Monitoring metrics now feature a custom dimension for the model name.

  • New metrics are available to monitor the token consumption of workloads.

Prerequisites

  • Ensure you have either created an ACK managed cluster with a GPU node pool or selected an ACS cluster in a recommended zone for GPU computing power. For more information, see Create an ACK managed cluster and Create an ACS cluster.

    You can install the ACK Virtual Node component in your ACK managed cluster to utilize ACS GPU computing capabilities. For more information, see ACS GPU computing power in ACK.

  • A cluster is added to the ASM instance of v1.24 or later. For more information, see Add a cluster to an ASM instance.

  • An ingress gateway is created and enable the HTTP service on port 8080. For more information, see Create an ingress gateway.

  • (Optional) A Sidecar is injected into the default namespace. For more information, see Enable automatic sidecar proxy injection.

    Note

    You may skip the Sidecar injection if you do not want to explore the practical operations of observability.

Best practices

The following example demonstrates how to manage LLM inference service traffic in a cluster using ASM by deploying a Llama2 large model based on vLLM.

Step 1: Deploy a sample inference service

  1. Create a file named vllm-service.yaml with the content provided.

    Note

    The image discussed in this topic requires a GPU with more than 16 GiB of video memory. The T4 card type, which has 16 GiB of video memory, does not provide sufficient resources to launch this application. It is recommended to use the A10 card type for ACK clusters and the 8th generation GPU B for ACS clusters. For detailed model information, please submit a ticket for further assistance.

    Due to the large size of the LLM image, it is advisable to pre-store it in ACR and use the internal network address for pulling. Pulling directly from the public network may result in long wait times depending on the cluster EIP bandwidth configuration.

    ACS cluster

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      selector:
        app: vllm-llama2-7b-pool
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
      type: ClusterIP
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
            alibabacloud.com/compute-class: gpu  # Specify a GPU computing power
            alibabacloud.com/compute-qos: default
            alibabacloud.com/gpu-model-series: "example-model" # Specify the GPU model as example-model. Fill in according to the actual situation.
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "4"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  cpu: 16
                  memory: 64Gi
                  nvidia.com/gpu: 1
                requests:
                  cpu: 8
                  memory: 30Gi
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template

    ACK cluster

    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-llama2-7b-pool
    spec:
      selector:
        app: vllm-llama2-7b-pool
      ports:
      - protocol: TCP
        port: 8000
        targetPort: 8000
      type: ClusterIP
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: chat-template
    data:
      llama-2-chat.jinja: |
        {% if messages[0]['role'] == 'system' %}
          {% set system_message = '<<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' %}
          {% set messages = messages[1:] %}
        {% else %}
            {% set system_message = '' %}
        {% endif %}
    
        {% for message in messages %}
            {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
                {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
            {% endif %}
    
            {% if loop.index0 == 0 %}
                {% set content = system_message + message['content'] %}
            {% else %}
                {% set content = message['content'] %}
            {% endif %}
            {% if message['role'] == 'user' %}
                {{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}
            {% elif message['role'] == 'assistant' %}
                {{ ' ' + content | trim + ' ' + eos_token }}
            {% endif %}
        {% endfor %}
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama2-7b-pool
      namespace: default
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: vllm-llama2-7b-pool
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: '8000'
            prometheus.io/scrape: 'true'
          labels:
            app: vllm-llama2-7b-pool
        spec:
          containers:
            - name: lora
              image: "registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/llama2-with-lora:v0.2"
              imagePullPolicy: IfNotPresent
              command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
              args:
              - "--model"
              - "/model/llama2"
              - "--tensor-parallel-size"
              - "1"
              - "--port"
              - "8000"
              - '--gpu_memory_utilization'
              - '0.8'
              - "--enable-lora"
              - "--max-loras"
              - "4"
              - "--max-cpu-loras"
              - "12"
              - "--lora-modules"
              - 'sql-lora=/adapters/yard1/llama-2-7b-sql-lora-test_0'
              - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1'
              - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2'
              - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3'
              - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4'
              - 'tweet-summary=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0'
              - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1'
              - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2'
              - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3'
              - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4'
              - '--chat-template'
              - '/etc/vllm/llama-2-chat.jinja'
              env:
                - name: PORT
                  value: "8000"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              livenessProbe:
                failureThreshold: 2400
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              readinessProbe:
                failureThreshold: 6000
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                initialDelaySeconds: 5
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - mountPath: /etc/vllm
                  name: chat-template
          restartPolicy: Always
          schedulerName: default-scheduler
          terminationGracePeriodSeconds: 30
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: chat-template
              configMap:
                name: chat-template
  2. Deploy the LLM inference service using the kubeconfig file of the cluster on the data plane.

    kubectl apply -f vllm-service.yaml

Step 2: Configure ASM gateway rules

Deploy gateway rules to enable port 8080 listening on the ASM gateway.

  1. Create a file named gateway.yaml with the following content.

    apiVersion: networking.istio.io/v1
    kind: Gateway
    metadata:
      name: llm-inference-gateway
      namespace: default
    spec:
      selector:
        istio: ingressgateway
      servers:
        - hosts:
            - '*'
          port:
            name: http-service
            number: 8080
            protocol: HTTP
  2. Create a gateway rule.

    kubectl apply -f gateway.yaml

Step 3: Configure routing and load balancing for LLM inference service

Note

To compare the performance of traditional load balancing with LLM load balancing, you should complete the steps in (Optional) Compare performance with traditional load balancing using an observability dashboard prior to proceeding with further operations.

  1. Enable routing for LLM inference service using the kubeconfig file of ASM.

    kubectl patch asmmeshconfig default --type=merge --patch='{"spec":{"gatewayAPIInferenceExtension":{"enabled":true}}}'
  2. Deploy the InferencePool resource.

    The InferencePool resource defines workloads for a set of LLM inference service in the cluster using a label selector. ASM will apply vLLM load balancing based on the created InferencePoo.

    1. Create a file named inferencepool.yaml with the following content.

      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: InferencePool
      metadata:
        name: vllm-llama2-7b-pool
      spec:
        targetPortNumber: 8000
        selector:
          app: vllm-llama2-7b-pool

      The following table explains some configuration items:

      Configuration item

      Description

      .spec.targetPortNumber

      The port exposed by the Pod that provides inference services.

      .spec.selector

      The Pod label that provides inference services. The label key must be app, and the value must be the corresponding Service name.

    2. Create an InferencePool resource using the kubeconfig file of the cluster on the data plane.

      kubectl apply -f inferencepool.yaml
  3. Deploy the InferenceModel resource.

    The InferenceModel specifies the traffic distribution policy for specific models within the InferencePool resource.

    1. Create a file named inferencemodel.yaml using the following content.

      apiVersion: inference.networking.x-k8s.io/v1alpha1
      kind: InferenceModel
      metadata:
        name: inferencemodel-sample
      spec:
        modelName: tweet-summary
        poolRef:
          group: inference.networking.x-k8s.io
          kind: InferencePool
          name: vllm-llama2-7b-pool
        targetModels:
        - name: tweet-summary
          weight: 100

      The following table explains some configuration items:

      Configuration Item

      Description

      .spec.modelName

      Used to match the model parameter in the request.

      .spec.targetModels

      Configure traffic routing rules. In the above example, traffic with the model: tweet-summary in the request header is 100% sent to the Pod running the tweet-summary model.

    2. Create an InferenceModel resource.

      kubectl apply -f inferencemodel.yaml
  4. Create the LLMRoute resource.

    Set up routing rules for the gateway to forward all requests received on port 8080 to the sample LLM inference service by referencing the InferencePool resource.

    1. Create a file named llmroute.yaml with the following content.

      apiVersion: istio.alibabacloud.com/v1
      kind: LLMRoute
      metadata:  
        name: test-llm-route
      spec:
        gateways: 
        - llm-inference-gateway
        host: test.com
        rules:
        - backendRefs:
          - backendRef:
              group: inference.networking.x-k8s.io
              kind: InferencePool
              name: vllm-llama2-7b-pool
    2. Deploy the LLMRoute resource.

      kubectl apply -f llmroute.yaml

Step 4: Verify

Run the following command multiple times to perform a test.

curl -H "host: test.com" ${ASM Gateway IP}:8080/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}' -v

Expected output:

{"id":"cmpl-2fc9a351-d866-422b-b561-874a30843a6b","object":"text_completion","created":1736933141,"model":"tweet-summary","choices":[{"index":0,"text":", I'm a newbie to this forum. Write a summary of the article.\nWrite a summary of the article.\nWrite a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary of the article. Write a summary","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":2,"total_tokens":102,"completion_tokens":100,"prompt_tokens_details":null}}

(Optional) Step 5: Configure observability metrics and dashboard for LLM services

After declaring LLM inference services in the cluster using InferencePool and InferenceMode resources and setting up routing policies, you can view the observability of the LLM inference service through logs and monitoring metrics.

  1. Enable LLM traffic observability feature on the ASM console to collect monitoring metrics.

    1. Improve the observability of LLM inference requests by incorporating additional log fields, metrics, and metric dimensions. For detailed configuration instructions, see Traffic observation: Efficiently manage LLM traffic using ASM.

    2. Upon configuration completion, a model dimension will be incorporated into the ASM monitoring metrics. You can gather these metrics by either utilizing Prometheus within the observability monitoring framework or integrating a self-hosted Prometheus for service mesh monitoring.

    3. ASM introduces two new metrics: asm_llm_proxy_prompt_tokens, representing the number of input tokens, and asm_llm_proxy_completion_tokens, representing the number of output tokens for all requests. You can incorporate these metrics by adding the following rules to Prometheus. For ins, see Other Prometheus service discovery configurations.

      scrape_configs:
      - job_name: asm-envoy-stats-llm
        scrape_interval: 30s
        scrape_timeout: 30s
        metrics_path: /stats/prometheus
        scheme: http
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels:
          - __meta_kubernetes_pod_container_port_name
          action: keep
          regex: .*-envoy-prom
        - source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:15090
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels:
          - __meta_kubernetes_namespace
          action: replace
          target_label: namespace
        - source_labels:
          - __meta_kubernetes_pod_name
          action: replace
          target_label: pod_name
        metric_relabel_configs:
        - action: keep
          source_labels:
          - __name__
          regex: asm_llm_.*
  2. Collect monitoring metrics for vLLM service.

    1. The monitoring metrics for LLM inference requests mainly involve the throughput of external LLM inference requests. Add the annotations for Prometheus collector to the vLLM service pod to collect metrics exposed by the vLLM service and monitor its internal state.

      ...
      annotations:
        prometheus.io/path: /metrics # The HTTP path to which the metrics is exposed.
        prometheus.io/port: "8000" # The port to which the metrics are exposed, which is the listening port of the vLLM Server.
        prometheus.io/scrape: "true" # Whether to scrape the metrics of the current pod.
      ...
    2. Retrieve metrics related to the vLLM service using Prometheus's default service discovery mechanism. For detailed instructions, see Default service discovery.

      Key metrics from the vLLM service provide insight into the internal state of the vLLM workload.

      Metric name

      Description

      vllm:gpu_cache_usage_perc

      The percentage of GPU cache usage of vLLM. When vLLM starts, it will pre-occupy as much GPU video memory as possible for KV cache. For the vLLM server, the lower the cache utilization, the more space the GPU has to allocate resources to new requests.

      vllm:request_queue_time_seconds_sum

      The time spent in the waiting state queue. After the LLM inference request arrives at the vLLM server, it may not be processed immediately, but needs to wait for the vLLM scheduler to schedule the prefill and decode.

      vllm:num_requests_running

      vllm:num_requests_waiting

      vllm:num_requests_swapped

      The number of requests running inference, waiting, and swapped to memory. It can be used to evaluate the current request pressure of the vLLM service.

      vllm:avg_generation_throughput_toks_per_s

      vllm:avg_prompt_throughput_toks_per_s

      The number of tokens consumed by the prefill stage and generated by the decode stage per second.

      vllm:time_to_first_token_seconds_bucket

      The latency level from the time the request is sent to the vLLM service to the time the first token is responded to. This metric usually represents the time it takes for the client to get the first response after outputting the request content and is an important metric affecting the LLM user experience.

  3. Configure a Grafana dashboard to monitor LLM inference services.

    Observe LLM inference services deployed with vLLM through the Grafana dashboard:

    • Monitor the request rate and token throughput using ASM monitoring metrics;

    • Assess the internal state of the workloads for the LLM inference services with vLLM monitoring metrics.

    You can create a data source (Prometheus instances) in the Grafana console. Ensure the monitoring metrics for ASM and vLLM has been collected by the Prometheus instances.

    To create an observability dashboard for LLM inference services, import the content provided below into Grafana.

    Expand to view JSON content

    {
      "annotations": {
        "list": [
          {
            "builtIn": 1,
            "datasource": {
              "type": "grafana",
              "uid": "-- Grafana --"
            },
            "enable": true,
            "hide": true,
            "iconColor": "rgba(0, 211, 255, 1)",
            "name": "Annotations & Alerts",
            "target": {
              "limit": 100,
              "matchAny": false,
              "tags": [],
              "type": "dashboard"
            },
            "type": "dashboard"
          }
        ]
      },
      "description": "Monitoring vLLM Inference Server",
      "editable": true,
      "fiscalYearStartMonth": 0,
      "graphTooltip": 0,
      "id": 49,
      "links": [],
      "liveNow": false,
      "panels": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 0
          },
          "id": 23,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(model) (rate(istio_requests_total{model!=\"unknown\"}[$__rate_interval]))",
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Request Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 0
          },
          "id": 20,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_completion_tokens{}[$__rate_interval]))",
              "instant": false,
              "legendFormat": "generate tokens (from proxy)",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "sum by(llmproxy_model) (rate(asm_llm_proxy_prompt_tokens{}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "prompt tokens (from proxy)",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Tokens Rate",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "min": -1,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 8
          },
          "id": 17,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "avg(vllm:gpu_cache_usage_perc)",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "A"
            }
          ],
          "title": "Average gpu cache usage",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 8
          },
          "id": 18,
          "options": {
            "colorMode": "value",
            "graphMode": "area",
            "justifyMode": "auto",
            "orientation": "auto",
            "reduceOptions": {
              "calcs": [
                "mean"
              ],
              "fields": "",
              "values": false
            },
            "textMode": "auto"
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "avg(rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "range": true,
              "refId": "C"
            }
          ],
          "title": "Average Queue Time",
          "type": "stat"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Percentage of used cache blocks by vLLM.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 16
          },
          "id": 4,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (vllm:gpu_cache_usage_perc{model_name=\"$model_name\"})",
              "instant": false,
              "legendFormat": "GPU Cache Usage ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "A"
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "vllm:cpu_cache_usage_perc{model_name=\"$model_name\"}",
              "hide": false,
              "instant": false,
              "legendFormat": "CPU Cache Usage",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Cache Utilization",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "seconds",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 16
          },
          "id": 14,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:request_queue_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Queue Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "P50, P90, P95, and P99 TTFT latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 24
          },
          "id": 5,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "builder",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])) / sum by(kubernetes_pod_name) (rate(vllm:time_to_first_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval]))",
              "hide": false,
              "instant": false,
              "legendFormat": "Average ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time To First Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of tokens processed per second",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 24
          },
          "id": 8,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "rate(vllm:prompt_tokens_total{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Prompt Tokens/Sec",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "sum by(kubernetes_pod_name) (rate(vllm:generation_tokens_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "Generation Tokens/Sec ({{kubernetes_pod_name}})",
              "range": true,
              "refId": "B",
              "useBackend": false
            }
          ],
          "title": "Token Throughput",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
          },
          "description": "End to end request latency measured in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 32
          },
          "id": 9,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
              },
              "editorMode": "code",
              "expr": "rate(vllm:e2e_request_latency_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:e2e_request_latency_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Average",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "E2E Request Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of requests in RUNNING, WAITING, and SWAPPED state",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "none"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 32
          },
          "id": 3,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_running{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Running",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "vllm:num_requests_swapped{model_name=\"$model_name\"}",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Swapped",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "exemplar": false,
              "expr": "sum by(kubernetes_pod_name) (vllm:num_requests_waiting{model_name=\"$model_name\"})",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Num Waiting for {{kubernetes_pod_name}}",
              "range": true,
              "refId": "C",
              "useBackend": false
            }
          ],
          "title": "Scheduler State",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Inter token latency in seconds.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              },
              "unit": "s"
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 40
          },
          "id": 10,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.99, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P99",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P95",
              "range": true,
              "refId": "B",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P90",
              "range": true,
              "refId": "C",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket{model_name=\"$model_name\"}[$__rate_interval])))",
              "fullMetaSearch": false,
              "hide": false,
              "includeNullMetadata": false,
              "instant": false,
              "legendFormat": "P50",
              "range": true,
              "refId": "D",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:time_per_output_token_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])\n/\nrate(vllm:time_per_output_token_seconds_count{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Mean",
              "range": true,
              "refId": "E"
            }
          ],
          "title": "Time Per Output Token Latency",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Decode"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 40
          },
          "id": 15,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_prefill_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Prefill",
              "range": true,
              "refId": "A",
              "useBackend": false
            },
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "editorMode": "code",
              "expr": "rate(vllm:request_decode_time_seconds_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "hide": false,
              "instant": false,
              "legendFormat": "Decode",
              "range": true,
              "refId": "B"
            }
          ],
          "title": "Requests Prefill and Decode Time",
          "type": "timeseries"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request prompt length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 48
          },
          "id": 12,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Prompt Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_prompt_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Prompt Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Heatmap of request generation length",
          "fieldConfig": {
            "defaults": {
              "custom": {
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "scaleDistribution": {
                  "type": "linear"
                }
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 48
          },
          "id": 13,
          "options": {
            "calculate": false,
            "cellGap": 1,
            "cellValues": {
              "unit": "none"
            },
            "color": {
              "exponent": 0.5,
              "fill": "dark-orange",
              "min": 0,
              "mode": "scheme",
              "reverse": false,
              "scale": "exponential",
              "scheme": "Spectral",
              "steps": 64
            },
            "exemplars": {
              "color": "rgba(255,0,255,0.7)"
            },
            "filterValues": {
              "le": 1e-9
            },
            "legend": {
              "show": true
            },
            "rowsFrame": {
              "layout": "auto",
              "value": "Request count"
            },
            "tooltip": {
              "mode": "single",
              "show": true,
              "showColorScale": false,
              "yHistogram": true
            },
            "yAxis": {
              "axisLabel": "Generation Length",
              "axisPlacement": "left",
              "reverse": false,
              "unit": "none"
            }
          },
          "pluginVersion": "10.0.9",
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(le) (increase(vllm:request_generation_tokens_bucket{model_name=\"$model_name\"}[$__rate_interval]))",
              "format": "heatmap",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "{{le}}",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Request Generation Length",
          "type": "heatmap"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "description": "Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": []
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 0,
            "y": 56
          },
          "id": 11,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "${DS_PROMETHEUS}"
              },
              "disableTextWrap": false,
              "editorMode": "builder",
              "expr": "sum by(finished_reason) (increase(vllm:request_success_total{model_name=\"$model_name\"}[$__rate_interval]))",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "interval": "",
              "legendFormat": "__auto",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Finish Reason",
          "type": "timeseries"
        },
        {
          "datasource": {
            "default": false,
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "palette-classic"
              },
              "custom": {
                "axisCenteredZero": false,
                "axisColorMode": "text",
                "axisLabel": "",
                "axisPlacement": "auto",
                "barAlignment": 0,
                "drawStyle": "line",
                "fillOpacity": 0,
                "gradientMode": "none",
                "hideFrom": {
                  "legend": false,
                  "tooltip": false,
                  "viz": false
                },
                "lineInterpolation": "linear",
                "lineWidth": 1,
                "pointSize": 5,
                "scaleDistribution": {
                  "type": "linear"
                },
                "showPoints": "auto",
                "spanNulls": false,
                "stacking": {
                  "group": "A",
                  "mode": "none"
                },
                "thresholdsStyle": {
                  "mode": "off"
                }
              },
              "mappings": [],
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "green"
                  },
                  {
                    "color": "red",
                    "value": 80
                  }
                ]
              }
            },
            "overrides": [
              {
                "__systemRef": "hideSeriesFrom",
                "matcher": {
                  "id": "byNames",
                  "options": {
                    "mode": "exclude",
                    "names": [
                      "Tokens"
                    ],
                    "prefix": "All except:",
                    "readOnly": true
                  }
                },
                "properties": [
                  {
                    "id": "custom.hideFrom",
                    "value": {
                      "legend": false,
                      "tooltip": false,
                      "viz": true
                    }
                  }
                ]
              }
            ]
          },
          "gridPos": {
            "h": 8,
            "w": 12,
            "x": 12,
            "y": 56
          },
          "id": 16,
          "options": {
            "legend": {
              "calcs": [],
              "displayMode": "list",
              "placement": "bottom",
              "showLegend": true
            },
            "tooltip": {
              "mode": "single",
              "sort": "none"
            }
          },
          "targets": [
            {
              "datasource": {
                "type": "prometheus",
                "uid": "edx8memhpd9tsa"
              },
              "disableTextWrap": false,
              "editorMode": "code",
              "expr": "rate(vllm:request_max_num_generation_tokens_sum{model_name=\"$model_name\"}[$__rate_interval])",
              "fullMetaSearch": false,
              "includeNullMetadata": true,
              "instant": false,
              "legendFormat": "Tokens",
              "range": true,
              "refId": "A",
              "useBackend": false
            }
          ],
          "title": "Max Generation Token in Sequence Group",
          "type": "timeseries"
        }
      ],
      "refresh": false,
      "schemaVersion": 38,
      "style": "dark",
      "tags": [],
      "templating": {
        "list": [
          {
            "current": {
              "selected": true,
              "text": "prom-cec64713b1aab44d0b49236b6f54cd671",
              "value": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "hide": 0,
            "includeAll": false,
            "label": "datasource",
            "multi": false,
            "name": "DS_PROMETHEUS",
            "options": [],
            "query": "prometheus",
            "queryValue": "",
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "type": "datasource"
          },
          {
            "current": {
              "selected": false,
              "text": "/model/llama2",
              "value": "/model/llama2"
            },
            "datasource": {
              "type": "prometheus",
              "uid": "prom-cec64713b1aab44d0b49236b6f54cd671"
            },
            "definition": "label_values(model_name)",
            "hide": 0,
            "includeAll": false,
            "label": "model_name",
            "multi": false,
            "name": "model_name",
            "options": [],
            "query": {
              "query": "label_values(model_name)",
              "refId": "StandardVariableQuery"
            },
            "refresh": 1,
            "regex": "",
            "skipUrlSync": false,
            "sort": 0,
            "type": "query"
          }
        ]
      },
      "time": {
        "from": "2025-01-10T04:00:36.511Z",
        "to": "2025-01-10T04:18:26.639Z"
      },
      "timepicker": {},
      "timezone": "",
      "title": "vLLM",
      "uid": "b281712d-8bff-41ef-9f3f-71ad43c05e9c",
      "version": 10,
      "weekStart": ""
    }

    You can see the dashboard similar to the following:

    image

(Optional) Compare performance with traditional load balancing using an observability dashboard

Using the observability dashboard, you can directly compare the performance of LLM inference service load balancing with traditional load balancing algorithms, including metrics like cache utilization, request queue time, token throughput, and ttft.

Note

Once you have completed Step 3, you can run the following command to purge the resources you created.

kubectl delete inferencemodel --all
kubectl delete inferencepool --all
kubectl delete llmroute --all

Otherwise, after completing steps 1 and 2, ensure you clear any created virtual services before proceeding to Step 3.

  1. Create a virtual service to provide routing and traditional load balancing for the sample LLM inference service by executing the following command.

    kubectl apply -f- <<EOF
    apiVersion: networking.istio.io/v1
    kind: VirtualService
    metadata:
      name: llm-vs
      namespace: default
    spec:
      gateways:
        - default/llm-inference-gateway
      hosts:
        - '*'
      http:
        - name: any-host
          route:
            - destination:
                host: vllm-llama2-7b-pool.default.svc.cluster.local
                port:
                  number: 8000
    EOF
  2. Perform stress testing on the LLM inference service using the llmperf tool.

  3. Analyze the two routing and load balancing policies through the Grafana dashboard.

    The comparison shows that LLM inference service load balancing provides better latency, throughput, and cache utilization.

    image