All Products
Search
Document Center

Container Service for Kubernetes:Configure monitoring for LLM inference services

Last Updated:Sep 17, 2025

Observability is crucial for managing large language model (LLM) inference services in a production environment. By monitoring key performance metrics for the service, its pods, and the associated GPUs, you can effectively identify performance bottlenecks and diagnose failures. This topic describes how to configure monitoring for LLM inference services.

Prerequisites

Managed Service for Prometheus is enabled in your Container Service for Kubernetes (ACK) cluster.

Billing

When you enable monitoring for an LLM inference service, its metrics are sent to the Managed Service for Prometheus as custom metrics.

Using custom metrics incurs additional charges. Costs may vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resource consumption through usage query.

Step 1: Access the LLM inference service monitoring dashboard

  1. Log on to the ARMS console.

  2. In the navigation pane on the left, click Integration Center. In the AI section, click the Cloud-Native AI Suite LLM Inference card.

  3. On the Cloud-Native AI Suite LLM Inference panel, select the target cluster.

    If the component is already installed, skip this step.
  4. In the Configuration Information section, configure the parameters and click OK to connect the component.

    Parameter

    Description

    Access Name

    A unique name for the current LLM inference service monitoring. This parameter is optional.

    Namespace

    The namespaces from which to collect metrics. This parameter is optional. If left empty, metrics will be collected from all namespaces that meet the criteria.

    Pod Port

    The name of the port on the LLM inference service pod. This port will be used for metric collection. Default value: http.

    Metric Collection Path

    The HTTP path on the LLM inference service pod that exposes metrics in Prometheus format. Default value: /metrics.

    Collection Interval (seconds)

    The interval at which monitoring data is collected.

  5. You can view all integrated components on the Integration Management page of the ARMS console.

For details about the Integration Center, see Integration guide.

Step 2: Deploy an inference service with metrics collection enabled

To enable metrics collection for your LLM inference service, add the following labels to the pod spec in your deployment manifest:

...
spec:
  template:
    metadata:
      labels:
        alibabacloud.com/inference-workload: <workload_name>
        alibabacloud.com/inference-backend: <backend>

Label

Purpose

Description

alibabacloud.com/inference-workload

A unique identifier for an inference service within a namespace.

Recommended value: The name of the workload resource (such as StatefulSet, Deployment, and RoleBasedGroup) that manages the pods.

When this label is present, the pod will be added to the ARMS metric collection targets.

alibabacloud.com/inference-backend

The inference engine used by the service.

Supported values include:

  • vllm: For standalone or distributed inference services using vLLM.

  • sglang: For standalone or distributed inference services using SGLang.

  • vllm-pd: For inference services using vLLM with prefill/decode (PD) disaggregation.

  • sglang-pd: For inference services using SGLang with PD disaggregation.

The preceding code snippet shows how to enable metric collection for an LLM inference service pod. For complete deployment examples, see the following topics:

Step 3: View the inference service monitoring dashboard

  1. Log on to the ACK console.

  2. In the navigation pane on the left, click Clusters.

  3. On the Clusters page, click the target ACK or Alibaba Cloud Container Compute Service (ACS) cluster. In the navigation pane on the left, select Operations > Prometheus Monitoring.

  4. On the Prometheus Monitoring page, select Others > LLM Inference Dashboard to view detailed performance data.

  5. Use the dashboard filters to select the namespaceworkload_name, and model_name you wish to inspect. For a detailed explanation of each panel, see Dashboard panel descriptions.

Metric references

The monitoring dashboard aggregates metrics from the following sources:

Dashboard panel descriptions

The LLM inference service dashboard is designed to provide a hierarchical view of your service's performance. It assumes that a Kubernetes workload deploys an inference service. An inference service may contain multiple instances, where an instance may consist of one or more pods. Each inference service instance can provide LLM inference capabilities for one or more models, such as a base model combined with LoRA adapters.

The dashboard is organized into three main sections:

Model-Level

This section displays aggregated metrics for a specific model across all its serving inference services. Use these panels to assess the overall performance and health of the model service.

image.png

Pod-Level

This section breaks down performance metrics by individual pod. Use these panels to analyze the load distribution and identify performance variations between pods of the service.

image.png

GPU Stats (Associated with Pod)

This section provides detailed GPU utilization metrics for each pod. Use these panels to understand how each inference service pod is consuming GPU resources.

image.png

Detailed panel information

The following table describes each panel in the dashboard and its compatibility with different inference backends:

Model-Level panels

Panel name

Description

Inference engine compatibility

QPS

Total requests processed per second across all service instances.

vllm and sglang

Request Success Rate

The percentage of successfully processed requests.

vllm

E2E Latency

The average request processing time.

vllm and sglang

Token Throughput

The rate of input (prompt) and output (generation) tokens processed per second.

vllm and sglang

Token Throughput per GPU

The average token throughput rate per GPU card for inference service inputs (prompt) and outputs (generation).

vllm and sglang

Request Prompt Length

Distribution (average and quantiles) of input token length.

vllm (average and quantiles) and sglang (average only)

Request Generation Length

Distribution (average and quantiles) of output token length.

vllm (average and quantiles) and sglang (average only)

TTFT(Time To First Token)

The latency to generate the first output token (average and quantiles).

vllm and sglang

TPOT(Time Per Output Token)

The latency to generate subsequent output tokens (average and quantiles).

vllm and sglang

KV Cache Hit Ratio

The average KV cache hit ratio for each inference service instance. This is effective only when the prefix cache feature is enabled in the inference framework.

vllm and sglang

Request Prompt Length Heatmap

A heatmap showing the distribution of input token lengths.

vllm

Request Generation Length Heatmap

A heatmap showing the distribution of output token lengths.

vllm

Pod-Level panels

Panel name

Description

Inference engine compatibility

E2E Request Latency

The average request processing time per pod.

vllm, sglang, vllm-pd, and sglang-pd

Token Throughput

The rate of inputs (prompt) and outputs (generation) tokens processed per second, per pod.

vllm, sglang, vllm-pd, and sglang-pd

Time To First Token Latency

The latency to generate the first output token per pod (average and quantiles).

vllm, sglang, vllm-pd, and sglang-pd

Time Per Output Token Latency

The latency to generate subsequent output tokens per pod (average and quantiles).

vllm, sglang, vllm-pd, and sglang-pd

KV Cache Utilization

The percentage of the KV cache currently in use per pod.

vllm, sglang, vllm-pd, and sglang-pd

Scheduler State

The number of requests in WaitingRunning, or Swapped states, per pod.

vllm, sglang, vllm-pd, and sglang-pd

When using sglang or sglang-pd, only the Waiting and Running states are supported.

Finish Reason

The number of requests that finished for a specific reason within a monitoring period. Reasons including:

  • abort: Stops an operation before it is complete.

  • length: The maximum output length was reached.

vllm and vllm-pd

Queue Time

The average time a request spends in the scheduler queue per pod.

vllm, sglang, vllm-pd, and sglang-pd

Requests Prefill and Decode Time

The average time spent in the prefill and decode phases per pod.

vllm and vllm-pd

KV Cache Hit Ratio

The KV cache hit ratio for each inference service pod. This is effective only when the prefix cache feature is enabled in the inference framework.

Applicable to vllm, sglang, vllm-pd, and sglang-pd

GPU Stats (Associated with Pod) panels

Panel name

Description

Inference engine compatibility

Pods GPU Tensor Active

The average percentage of cycles that the Tensor (HMMA/IMMA) pipeline is active across each GPU in the inference service pod.

This value represents an average over a time interval, not an instantaneous value.

vllm, sglang, vllm-pd, and sglang-pd

Pods GPU Utilization

The average overall utilization of each GPU.

vllm, sglang, vllm-pd, and sglang-pd

Pods GPU SM Active

The average utilization of the Streaming Multiprocessors (SM) across each GPU.

vllm, sglang, vllm-pd, and sglang-pd

Pods GPU Memory Copy Utilization

The average memory bandwidth utilization of each GPU.

vllm, sglang, vllm-pd, and sglang-pd

Pods Used GPU Memory

The average amount of GPU memory in use by each pod.

Applicable to vllm, sglang, vllm-pd, and sglang-pd

Pods GPU DRAM Active

The frequency of memory instruction execution across each GPU during a sample period.

Applicable to vllm, sglang, vllm-pd, and sglang-pd