In production, LLM inference services require observability at three levels: model-wide performance, per-pod behavior, and GPU utilization. This topic describes how to connect the ARMS monitoring integration, label your pods for metric collection, and navigate the built-in dashboard.
Prerequisites
Before you begin, ensure that you have:
Managed Service for Prometheus enabled in your Container Service for Kubernetes (ACK) cluster
Billing
Metrics collected from LLM inference services are sent to Managed Service for Prometheus as custom metrics, which incur additional charges. Costs depend on cluster size, number of applications, and data volume. Use usage query to track consumption.
Step 1: Connect the monitoring integration
Log on to the ARMS console.
In the left navigation pane, click Integration Center. In the AI section, click the Cloud-Native AI Suite LLM Inference card.
On the Cloud-Native AI Suite LLM Inference panel, select the target cluster.
If the component is already installed, skip this step.
In the Configuration Information section, configure the following parameters and click OK.
Parameter Description Default Access Name A unique name for this monitoring integration. Optional. — Namespace The namespaces from which to collect metrics. If left blank, metrics are collected from all namespaces that meet the criteria. Optional. — Pod Port The name of the port on the inference service pod used for metric collection. httpMetric Collection Path The HTTP path on the pod that exposes metrics in Prometheus format. /metricsCollection Interval (seconds) How often monitoring data is collected. — After clicking OK, verify that the integration appears on the Integration Management page of the ARMS console.
For more information about Integration Center, see Integration guide.
Step 2: Enable metric collection on your inference service pods
Metric collection is not enabled by default. Without the following two labels, ARMS does not include a pod in its collection targets.
Add these labels to the pod spec in your deployment manifest:
spec:
template:
metadata:
labels:
alibabacloud.com/inference-workload: <workload_name>
alibabacloud.com/inference-backend: <backend>| Label | Purpose | Description |
|---|---|---|
alibabacloud.com/inference-workload | Identifies the inference service within a namespace | Set this to the name of the workload resource (such as a StatefulSet, Deployment, or RoleBasedGroup) that manages the pods. When present, ARMS adds the pod to its metric collection targets. |
alibabacloud.com/inference-backend | Specifies the inference engine | Supported values: vllm (standalone or distributed vLLM), sglang (standalone or distributed SGLang), vllm-pd (vLLM with prefill/decode (PD) disaggregation), sglang-pd (SGLang with PD disaggregation) |
For complete deployment examples that include these labels, see:
Step 3: View the monitoring dashboard
Log on to the ACK console.
In the left navigation pane, click Clusters.
On the Clusters page, click the target ACK or Alibaba Cloud Container Compute Service (ACS) cluster. In the left navigation pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, choose Others > LLM Inference Dashboard.
Use the dashboard filters to select the
namespace,workload_name, andmodel_nameyou want to inspect.
Metric sources
The dashboard aggregates metrics from the following inference engines:
vLLM metrics: See the official vLLM metrics list.
SGLang metrics: See the official SGLang metrics list.
Dashboard panel descriptions
The dashboard provides a hierarchical view of your service's performance. It assumes a Kubernetes workload deploys an inference service, where the service may contain multiple instances. Each instance consists of one or more pods, and each instance can serve one or more models — for example, a base model combined with LoRA adapters.
The dashboard is organized into three sections:
Model-level
Aggregated metrics for a specific model across all its serving instances. Use these panels to assess overall model performance and health.

| Panel | Description | Inference engine |
|---|---|---|
| QPS | Total requests processed per second across all service instances. | vllm, sglang |
| Request success rate | Percentage of successfully processed requests. | vllm |
| E2E latency | Average end-to-end request processing time. | vllm, sglang |
| Token throughput | Rate of input (prompt) and output (generation) tokens processed per second. | vllm, sglang |
| Token throughput per GPU | Average token throughput per GPU card for inputs and outputs. | vllm, sglang |
| Request prompt length | Distribution of input token length (average and quantiles). | vllm (average and quantiles), sglang (average only) |
| Request generation length | Distribution of output token length (average and quantiles). | vllm (average and quantiles), sglang (average only) |
| TTFT (Time To First Token) | Latency to generate the first output token (average and quantiles). | vllm, sglang |
| TPOT (Time Per Output Token) | Latency to generate each subsequent output token (average and quantiles). | vllm, sglang |
| KV cache hit ratio | Average KV cache hit ratio per instance. Effective only when prefix cache is enabled in the inference framework. | vllm, sglang |
| Request prompt length heatmap | Heatmap of input token length distribution. | vllm |
| Request generation length heatmap | Heatmap of output token length distribution. | vllm |
Pod-level
Per-pod performance metrics. Use these panels to analyze load distribution and identify performance variation across pods.

| Panel | Description | Inference engine |
|---|---|---|
| E2E request latency | Average request processing time per pod. | vllm, sglang, vllm-pd, sglang-pd |
| Token throughput | Input and output tokens processed per second, per pod. | vllm, sglang, vllm-pd, sglang-pd |
| Time To First Token latency | Latency to generate the first output token per pod (average and quantiles). | vllm, sglang, vllm-pd, sglang-pd |
| Time Per Output Token latency | Latency to generate subsequent output tokens per pod (average and quantiles). | vllm, sglang, vllm-pd, sglang-pd |
| KV cache utilization | Percentage of KV cache in use per pod. | vllm, sglang, vllm-pd, sglang-pd |
| Scheduler state | Number of requests in Waiting, Running, or Swapped states per pod. When using sglang or sglang-pd, only Waiting and Running are supported. | vllm, sglang, vllm-pd, sglang-pd |
| Finish reason | Number of requests that finished for a specific reason during a monitoring period: abort (stopped before completion) or length (maximum output length reached). | vllm, vllm-pd |
| Queue time | Average time a request spends in the scheduler queue per pod. | vllm, sglang, vllm-pd, sglang-pd |
| Requests prefill and decode time | Average time spent in the prefill and decode phases per pod. | vllm, vllm-pd |
| KV cache hit ratio | KV cache hit ratio per pod. Effective only when prefix cache is enabled in the inference framework. | vllm, sglang, vllm-pd, sglang-pd |
GPU stats (associated with pod)
GPU utilization metrics for each pod. Use these panels to understand how each inference service pod consumes GPU resources.

| Panel | Description | Inference engine |
|---|---|---|
| Pods GPU Tensor Active | Average percentage of cycles that the Tensor (HMMA/IMMA) pipeline is active across each GPU. This is a time-averaged value, not an instantaneous reading. | vllm, sglang, vllm-pd, sglang-pd |
| Pods GPU Utilization | Average overall utilization of each GPU. | vllm, sglang, vllm-pd, sglang-pd |
| Pods GPU SM Active | Average utilization of Streaming Multiprocessors (SM) across each GPU. | vllm, sglang, vllm-pd, sglang-pd |
| Pods GPU Memory Copy Utilization | Average memory bandwidth utilization of each GPU. | vllm, sglang, vllm-pd, sglang-pd |
| Pods Used GPU Memory | Average amount of GPU memory in use per pod. | vllm, sglang, vllm-pd, sglang-pd |
| Pods GPU DRAM Active | Frequency of memory instruction execution across each GPU during a sample period. | vllm, sglang, vllm-pd, sglang-pd |