Observability is crucial for managing large language model (LLM) inference services in a production environment. By monitoring key performance metrics for the service, its pods, and the associated GPUs, you can effectively identify performance bottlenecks and diagnose failures. This topic describes how to configure monitoring for LLM inference services.
Prerequisites
Managed Service for Prometheus is enabled in your Container Service for Kubernetes (ACK) cluster.
Billing
When you enable monitoring for an LLM inference service, its metrics are sent to the Managed Service for Prometheus as custom metrics.
Using custom metrics incurs additional charges. Costs may vary based on factors such as your cluster size, number of applications, and data volume. You can monitor and manage your resource consumption through usage query.
Step 1: Access the LLM inference service monitoring dashboard
Log on to the ARMS console.
In the navigation pane on the left, click Integration Center. In the AI section, click the Cloud-Native AI Suite LLM Inference card.
On the Cloud-Native AI Suite LLM Inference panel, select the target cluster.
If the component is already installed, skip this step.
In the Configuration Information section, configure the parameters and click OK to connect the component.
Parameter
Description
Access Name
A unique name for the current LLM inference service monitoring. This parameter is optional.
Namespace
The namespaces from which to collect metrics. This parameter is optional. If left empty, metrics will be collected from all namespaces that meet the criteria.
Pod Port
The name of the port on the LLM inference service pod. This port will be used for metric collection. Default value:
http.Metric Collection Path
The HTTP path on the LLM inference service pod that exposes metrics in Prometheus format. Default value:
/metrics.Collection Interval (seconds)
The interval at which monitoring data is collected.
You can view all integrated components on the Integration Management page of the ARMS console.
For details about the Integration Center, see Integration guide.
Step 2: Deploy an inference service with metrics collection enabled
To enable metrics collection for your LLM inference service, add the following labels to the pod spec in your deployment manifest:
...
spec:
template:
metadata:
labels:
alibabacloud.com/inference-workload: <workload_name>
alibabacloud.com/inference-backend: <backend>Label | Purpose | Description |
| A unique identifier for an inference service within a namespace. | Recommended value: The name of the workload resource (such as StatefulSet, Deployment, and RoleBasedGroup) that manages the pods. When this label is present, the pod will be added to the ARMS metric collection targets. |
| The inference engine used by the service. | Supported values include:
|
The preceding code snippet shows how to enable metric collection for an LLM inference service pod. For complete deployment examples, see the following topics:
Step 3: View the inference service monitoring dashboard
Log on to the ACK console.
In the navigation pane on the left, click Clusters.
On the Clusters page, click the target ACK or Alibaba Cloud Container Compute Service (ACS) cluster. In the navigation pane on the left, select .
On the Prometheus Monitoring page, select to view detailed performance data.
Use the dashboard filters to select the
namespace,workload_name, andmodel_nameyou wish to inspect. For a detailed explanation of each panel, see Dashboard panel descriptions.
Metric references
The monitoring dashboard aggregates metrics from the following sources:
vLLM metrics: See the official vLLM metrics list.
SGLang metrics: See the official SGLang metric list.
Dashboard panel descriptions
The LLM inference service dashboard is designed to provide a hierarchical view of your service's performance. It assumes that a Kubernetes workload deploys an inference service. An inference service may contain multiple instances, where an instance may consist of one or more pods. Each inference service instance can provide LLM inference capabilities for one or more models, such as a base model combined with LoRA adapters.
The dashboard is organized into three main sections:
Model-Level
This section displays aggregated metrics for a specific model across all its serving inference services. Use these panels to assess the overall performance and health of the model service.

Pod-Level
This section breaks down performance metrics by individual pod. Use these panels to analyze the load distribution and identify performance variations between pods of the service.

GPU Stats (Associated with Pod)
This section provides detailed GPU utilization metrics for each pod. Use these panels to understand how each inference service pod is consuming GPU resources.

Detailed panel information
The following table describes each panel in the dashboard and its compatibility with different inference backends: