All Products
Search
Document Center

Container Service for Kubernetes:Configure monitoring for LLM inference services

Last Updated:Mar 26, 2026

In production, LLM inference services require observability at three levels: model-wide performance, per-pod behavior, and GPU utilization. This topic describes how to connect the ARMS monitoring integration, label your pods for metric collection, and navigate the built-in dashboard.

Prerequisites

Before you begin, ensure that you have:

Billing

Metrics collected from LLM inference services are sent to Managed Service for Prometheus as custom metrics, which incur additional charges. Costs depend on cluster size, number of applications, and data volume. Use usage query to track consumption.

Step 1: Connect the monitoring integration

  1. Log on to the ARMS console.

  2. In the left navigation pane, click Integration Center. In the AI section, click the Cloud-Native AI Suite LLM Inference card.

  3. On the Cloud-Native AI Suite LLM Inference panel, select the target cluster.

    If the component is already installed, skip this step.
  4. In the Configuration Information section, configure the following parameters and click OK.

    ParameterDescriptionDefault
    Access NameA unique name for this monitoring integration. Optional.
    NamespaceThe namespaces from which to collect metrics. If left blank, metrics are collected from all namespaces that meet the criteria. Optional.
    Pod PortThe name of the port on the inference service pod used for metric collection.http
    Metric Collection PathThe HTTP path on the pod that exposes metrics in Prometheus format./metrics
    Collection Interval (seconds)How often monitoring data is collected.
  5. After clicking OK, verify that the integration appears on the Integration Management page of the ARMS console.

For more information about Integration Center, see Integration guide.

Step 2: Enable metric collection on your inference service pods

Metric collection is not enabled by default. Without the following two labels, ARMS does not include a pod in its collection targets.

Add these labels to the pod spec in your deployment manifest:

spec:
  template:
    metadata:
      labels:
        alibabacloud.com/inference-workload: <workload_name>
        alibabacloud.com/inference-backend: <backend>
LabelPurposeDescription
alibabacloud.com/inference-workloadIdentifies the inference service within a namespaceSet this to the name of the workload resource (such as a StatefulSet, Deployment, or RoleBasedGroup) that manages the pods. When present, ARMS adds the pod to its metric collection targets.
alibabacloud.com/inference-backendSpecifies the inference engineSupported values: vllm (standalone or distributed vLLM), sglang (standalone or distributed SGLang), vllm-pd (vLLM with prefill/decode (PD) disaggregation), sglang-pd (SGLang with PD disaggregation)

For complete deployment examples that include these labels, see:

Step 3: View the monitoring dashboard

  1. Log on to the ACK console.

  2. In the left navigation pane, click Clusters.

  3. On the Clusters page, click the target ACK or Alibaba Cloud Container Compute Service (ACS) cluster. In the left navigation pane, choose Operations > Prometheus Monitoring.

  4. On the Prometheus Monitoring page, choose Others > LLM Inference Dashboard.

  5. Use the dashboard filters to select the namespace, workload_name, and model_name you want to inspect.

Metric sources

The dashboard aggregates metrics from the following inference engines:

Dashboard panel descriptions

The dashboard provides a hierarchical view of your service's performance. It assumes a Kubernetes workload deploys an inference service, where the service may contain multiple instances. Each instance consists of one or more pods, and each instance can serve one or more models — for example, a base model combined with LoRA adapters.

The dashboard is organized into three sections:

Model-level

Aggregated metrics for a specific model across all its serving instances. Use these panels to assess overall model performance and health.

image.png
PanelDescriptionInference engine
QPSTotal requests processed per second across all service instances.vllm, sglang
Request success ratePercentage of successfully processed requests.vllm
E2E latencyAverage end-to-end request processing time.vllm, sglang
Token throughputRate of input (prompt) and output (generation) tokens processed per second.vllm, sglang
Token throughput per GPUAverage token throughput per GPU card for inputs and outputs.vllm, sglang
Request prompt lengthDistribution of input token length (average and quantiles).vllm (average and quantiles), sglang (average only)
Request generation lengthDistribution of output token length (average and quantiles).vllm (average and quantiles), sglang (average only)
TTFT (Time To First Token)Latency to generate the first output token (average and quantiles).vllm, sglang
TPOT (Time Per Output Token)Latency to generate each subsequent output token (average and quantiles).vllm, sglang
KV cache hit ratioAverage KV cache hit ratio per instance. Effective only when prefix cache is enabled in the inference framework.vllm, sglang
Request prompt length heatmapHeatmap of input token length distribution.vllm
Request generation length heatmapHeatmap of output token length distribution.vllm

Pod-level

Per-pod performance metrics. Use these panels to analyze load distribution and identify performance variation across pods.

image.png
PanelDescriptionInference engine
E2E request latencyAverage request processing time per pod.vllm, sglang, vllm-pd, sglang-pd
Token throughputInput and output tokens processed per second, per pod.vllm, sglang, vllm-pd, sglang-pd
Time To First Token latencyLatency to generate the first output token per pod (average and quantiles).vllm, sglang, vllm-pd, sglang-pd
Time Per Output Token latencyLatency to generate subsequent output tokens per pod (average and quantiles).vllm, sglang, vllm-pd, sglang-pd
KV cache utilizationPercentage of KV cache in use per pod.vllm, sglang, vllm-pd, sglang-pd
Scheduler stateNumber of requests in Waiting, Running, or Swapped states per pod. When using sglang or sglang-pd, only Waiting and Running are supported.vllm, sglang, vllm-pd, sglang-pd
Finish reasonNumber of requests that finished for a specific reason during a monitoring period: abort (stopped before completion) or length (maximum output length reached).vllm, vllm-pd
Queue timeAverage time a request spends in the scheduler queue per pod.vllm, sglang, vllm-pd, sglang-pd
Requests prefill and decode timeAverage time spent in the prefill and decode phases per pod.vllm, vllm-pd
KV cache hit ratioKV cache hit ratio per pod. Effective only when prefix cache is enabled in the inference framework.vllm, sglang, vllm-pd, sglang-pd

GPU stats (associated with pod)

GPU utilization metrics for each pod. Use these panels to understand how each inference service pod consumes GPU resources.

image.png
PanelDescriptionInference engine
Pods GPU Tensor ActiveAverage percentage of cycles that the Tensor (HMMA/IMMA) pipeline is active across each GPU. This is a time-averaged value, not an instantaneous reading.vllm, sglang, vllm-pd, sglang-pd
Pods GPU UtilizationAverage overall utilization of each GPU.vllm, sglang, vllm-pd, sglang-pd
Pods GPU SM ActiveAverage utilization of Streaming Multiprocessors (SM) across each GPU.vllm, sglang, vllm-pd, sglang-pd
Pods GPU Memory Copy UtilizationAverage memory bandwidth utilization of each GPU.vllm, sglang, vllm-pd, sglang-pd
Pods Used GPU MemoryAverage amount of GPU memory in use per pod.vllm, sglang, vllm-pd, sglang-pd
Pods GPU DRAM ActiveFrequency of memory instruction execution across each GPU during a sample period.vllm, sglang, vllm-pd, sglang-pd