Configure monitoring for LLM inference services - Container Service for Kubernetes

Observability is crucial for managing large language model (LLM) inference services in a production environment. By monitoring key performance metrics for the service, its pods, and the associated GPUs, you can effectively identify performance bottlenecks and diagnose failures. This topic describes how to configure monitoring for LLM inference services.

Prerequisites

Managed Service for Prometheus is enabled in your Container Service for Kubernetes (ACK) cluster.

Billing

When you enable monitoring for an LLM inference service, its metrics are sent to the Managed Service for Prometheus as custom metrics.

Using custom metrics incurs additional charges. Costs may vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resource consumption through usage query.

Step 1: Access the LLM inference service monitoring dashboard

Log on to the ARMS console.
In the navigation pane on the left, click Integration Center. In the AI section, click the Cloud-Native AI Suite LLM Inference card.
On the Cloud-Native AI Suite LLM Inference panel, select the target cluster.
If the component is already installed, skip this step.

In the Configuration Information section, configure the parameters and click OK to connect the component.

Parameter	Description
Access Name	A unique name for the current LLM inference service monitoring. This parameter is optional.
Namespace	The namespaces from which to collect metrics. This parameter is optional. If left empty, metrics will be collected from all namespaces that meet the criteria.
Pod Port	The name of the port on the LLM inference service pod. This port will be used for metric collection. Default value: `http`.
Metric Collection Path	The HTTP path on the LLM inference service pod that exposes metrics in Prometheus format. Default value: `/metrics`.
Collection Interval (seconds)	The interval at which monitoring data is collected.

You can view all integrated components on the Integration Management page of the ARMS console.

For details about the Integration Center, see Integration guide.

Step 2: Deploy an inference service with metrics collection enabled

To enable metrics collection for your LLM inference service, add the following labels to the pod spec in your deployment manifest:

...
spec:
  template:
    metadata:
      labels:
        alibabacloud.com/inference-workload: <workload_name>
        alibabacloud.com/inference-backend: <backend>

Label

Purpose

Description

alibabacloud.com/inference-workload

A unique identifier for an inference service within a namespace.

Recommended value: The name of the workload resource (such as StatefulSet, Deployment, and RoleBasedGroup) that manages the pods.

When this label is present, the pod will be added to the ARMS metric collection targets.

alibabacloud.com/inference-backend

The inference engine used by the service.

Supported values include:

vllm: For standalone or distributed inference services using vLLM.
sglang: For standalone or distributed inference services using SGLang.
vllm-pd: For inference services using vLLM with prefill/decode (PD) disaggregation.
sglang-pd: For inference services using SGLang with PD disaggregation.

The preceding code snippet shows how to enable metric collection for an LLM inference service pod. For complete deployment examples, see the following topics:

Step 3: View the inference service monitoring dashboard

Log on to the ACK console.
In the navigation pane on the left, click Clusters.
On the Clusters page, click the target ACK or Alibaba Cloud Container Compute Service (ACS) cluster. In the navigation pane on the left, select Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, select Others > LLM Inference Dashboard to view detailed performance data.
Use the dashboard filters to select the namespace, workload_name, and model_name you wish to inspect. For a detailed explanation of each panel, see Dashboard panel descriptions.

Metric references

The monitoring dashboard aggregates metrics from the following sources:

vLLM metrics: See the official vLLM metrics list.
SGLang metrics: See the official SGLang metric list.

Dashboard panel descriptions

The LLM inference service dashboard is designed to provide a hierarchical view of your service's performance. It assumes that a Kubernetes workload deploys an inference service. An inference service may contain multiple instances, where an instance may consist of one or more pods. Each inference service instance can provide LLM inference capabilities for one or more models, such as a base model combined with LoRA adapters.

The dashboard is organized into three main sections:

Model-Level

This section displays aggregated metrics for a specific model across all its serving inference services. Use these panels to assess the overall performance and health of the model service.

Pod-Level

This section breaks down performance metrics by individual pod. Use these panels to analyze the load distribution and identify performance variations between pods of the service.

GPU Stats (Associated with Pod)

This section provides detailed GPU utilization metrics for each pod. Use these panels to understand how each inference service pod is consuming GPU resources.

Detailed panel information

The following table describes each panel in the dashboard and its compatibility with different inference backends:

Model-Level panels

Panel name	Description	Inference engine compatibility
QPS	Total requests processed per second across all service instances.	`vllm` and `sglang`
Request Success Rate	The percentage of successfully processed requests.	`vllm`
E2E Latency	The average request processing time.	`vllm` and `sglang`
Token Throughput	The rate of input (prompt) and output (generation) tokens processed per second.	`vllm` and `sglang`
Token Throughput per GPU	The average token throughput rate per GPU card for inference service inputs (prompt) and outputs (generation).	`vllm` and `sglang`
Request Prompt Length	Distribution (average and quantiles) of input token length.	`vllm` (average and quantiles) and `sglang` (average only)
Request Generation Length	Distribution (average and quantiles) of output token length.	`vllm` (average and quantiles) and `sglang` (average only)
TTFT(Time To First Token)	The latency to generate the first output token (average and quantiles).	`vllm` and `sglang`
TPOT(Time Per Output Token)	The latency to generate subsequent output tokens (average and quantiles).	`vllm` and `sglang`
KV Cache Hit Ratio	The average KV cache hit ratio for each inference service instance. This is effective only when the prefix cache feature is enabled in the inference framework.	`vllm` and `sglang`
Request Prompt Length Heatmap	A heatmap showing the distribution of input token lengths.	`vllm`
Request Generation Length Heatmap	A heatmap showing the distribution of output token lengths.	`vllm`

Pod-Level panels

Panel name	Description	Inference engine compatibility
E2E Request Latency	The average request processing time per pod.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Token Throughput	The rate of inputs (prompt) and outputs (generation) tokens processed per second, per pod.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Time To First Token Latency	The latency to generate the first output token per pod (average and quantiles).	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Time Per Output Token Latency	The latency to generate subsequent output tokens per pod (average and quantiles).	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
KV Cache Utilization	The percentage of the KV cache currently in use per pod.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Scheduler State	The number of requests in `Waiting`, `Running`, or `Swapped` states, per pod.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd` When using `sglang` or `sglang-pd`, only the `Waiting` and `Running` states are supported.
Finish Reason	The number of requests that finished for a specific reason within a monitoring period. Reasons including: `abort`: Stops an operation before it is complete. `length`: The maximum output length was reached.	`vllm` and `vllm-pd`
Queue Time	The average time a request spends in the scheduler queue per pod.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Requests Prefill and Decode Time	The average time spent in the prefill and decode phases per pod.	`vllm` and `vllm-pd`
KV Cache Hit Ratio	The KV cache hit ratio for each inference service pod. This is effective only when the prefix cache feature is enabled in the inference framework.	Applicable to `vllm`, `sglang`, `vllm-pd`, and `sglang-pd`

GPU Stats (Associated with Pod) panels

Panel name	Description	Inference engine compatibility
Pods GPU Tensor Active	The average percentage of cycles that the Tensor (HMMA/IMMA) pipeline is active across each GPU in the inference service pod. This value represents an average over a time interval, not an instantaneous value.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Pods GPU Utilization	The average overall utilization of each GPU.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Pods GPU SM Active	The average utilization of the Streaming Multiprocessors (SM) across each GPU.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Pods GPU Memory Copy Utilization	The average memory bandwidth utilization of each GPU.	`vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Pods Used GPU Memory	The average amount of GPU memory in use by each pod.	Applicable to `vllm`, `sglang`, `vllm-pd`, and `sglang-pd`
Pods GPU DRAM Active	The frequency of memory instruction execution across each GPU during a sample period.	Applicable to `vllm`, `sglang`, `vllm-pd`, and `sglang-pd`