After you deploy a service by using Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can view the service-related metrics on the Service Monitoring tab to learn about service calls and running status. This topic describes how to view service monitoring information and provides detailed descriptions of monitoring metrics.
View the service monitoring information
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click the target service name, and then switch to the Monitoring tab.
View the service monitoring information.
Switch between dashboards
Dashboards are displayed based on service and instance dimensions. You can switch between dashboards.

Service: Service dimension. The default service monitoring dashboard is in the
Service-<service_name>format.<service_name>specifies the name of the EAS service.Instance: Instance dimension, which supports single-instance and multi-instance modes.
Single Instance: monitoring dashboard for a single instance. You can switch between instances.

Multiple Instance: monitoring dashboard for multiple instances. You can select multiple instances to compare their metrics.

Switch between time ranges
Click
on the right side of the Monitoring tab to switch the time range displayed on the dashboard.
ImportantMinute-level metrics can be retained for up to one month, and second-level metrics can be retained for up to one hour.
ImportantSetting the service tag to
ServiceEngineType: vllmorServiceEngineType: sglangdisplays LLM-related metrics.
View the metrics
Service monitoring dashboard (minute-level)
The following table describes the metrics that you can view on the service monitoring dashboard.
Metric | Description | |
QPS | The number of requests per second (QPS) for the service. The number of requests is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of requests sent to all instances. 1d offset indicates the QPS data from the same time on the previous day, which can be used for comparative analysis. | |
Response | The number of responses received by the service within the specified time range. The number of responses is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of responses received by all instances. | |
RT | The response time (RT) of requests.
For example, TP5 indicates that 5% of requests have a response time less than or equal to this value. TP100 indicates that all requests have a response time less than or equal to this value. If the service contains multiple instances, TP100 indicates that all requests across all instances have a response time less than or equal to this metric value. For other TP metrics, TPXX indicates the average TPXX value across all instances. For example, TP5 indicates the average of the TP5 values from each individual instance. | |
Daily Invoke | The number of daily calls to the service. The number of calls is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of daily calls to all instances of the service. | |
Single-instance monitoring dashboard (minute-level)
The following table describes the metrics that you can view on the single-instance dashboard.
Metric | Description |
QPS | The number of requests received by the instance per second. The number of requests is calculated separately by response code. |
RT | The response time of requests for the instance. |
Response | The total number of responses received by the instance within a specific time range. The number of responses is calculated separately by response code. |
Multi-instance monitoring dashboard
The following table describes the minute-level and second-level metrics that you can view on the multi-instance dashboard.
Minute-Level
Metric
Description
Instance QPS
The number of requests received by each instance per second. The number of requests is calculated separately by response code.
Instance RT
The average response time for each instance.
Instance CPU
The number of CPU cores used by each instance. Unit: cores.
Instance Memory -- RSS
The size of resident physical memory for each instance.
Instance Memory -- Cache
The cache size for each instance.
Instance GPU
The GPU utilization for each instance.
Instance GPU Memory
The GPU memory usage for each instance.
Instance TCP Connections
The number of TCP connections for each instance.
Second-Level
ImportantData is accurate to 5 seconds. Only data of the most recent hour is retained.
Metric
Description
Instance QPS Fine
The number of requests received by each instance per second. The number of requests is calculated separately by response code.
Instance RT Fine
The average response time of requests received by each instance.
GPU monitoring dashboard
The following table describes the GPU-related metrics that you can view on both the service-level and instance-level dashboards. For service-level metrics, average values are calculated across all instances.
Metric | Description |
GPU Utilization | The GPU utilization of the service at a specific point in time. |
GPU Memory | The GPU memory usage and total memory of the service at a specific point in time.
|
Memory Copy Utilization | The GPU memory copy utilization of the service at a specific point in time. |
GPU Memory Utilization | The GPU memory utilization of the service at a specific point in time. Calculation formula: Memory usage/Total memory. |
PCIe | The Peripheral Component Interconnect Express (PCIe) rate of the service at a specific point in time, measured by Data Center GPU Manager (DCGM).
|
Memory Bandwidth | The GPU memory bandwidth of the service at a specific point in time. |
SM Utilization and Occupancy | Streaming Multiprocessor (SM)-related metric of the service at a specific point in time. SM is a core component of GPU, responsible for executing and scheduling parallel computing tasks.
|
Graphics Engine Utilization | The utilization of the GPU graphics engine of the service at a specific point in time. |
Pipe Active Ratio | The activity rate of the GPU compute pipeline of the service at a specific point in time.
|
Tflops Usage | The Tera floating-point operations per second (TFLOPS) of the GPU compute pipeline of the service at a specific point in time.
|
DRAM Active Ratio | The activity rate of the GPU memory interface for data transmission or reception at a specific point in time. |
SM Clock | The SM clock frequency of the service at a specific point in time. |
GPU Temperature | The GPU temperature-related metric of the service at a specific point in time.
|
Power Usage | The GPU power usage of the service at a specific point in time. |
VLLM monitoring dashboard
If a service has multiple instances, throughput is the sum of all instances, and latency is the average of all instances.
Metric | Description |
Requests Status | The total number of requests for the service at a specific point in time.
|
Token Throughput | The number of input and generated tokens for all requests of the service at a specific point in time.
|
Request Completion Status | Statistics on the completion status of all requests for the service at a specific point in time.
|
Time To First Token | The time to first token (TTFT) latency for all requests of the service at a specific point in time. This is the time from when a request is received to when the first token is generated.
|
Time Per Output Token | The time per output token (TPOT) latency for all requests of the service at a specific point in time. This is the average time required to generate each output token after the first token.
|
E2E Request Latency | The end-to-end latency for all requests of the service at a specific point in time. This is the time from when a request is received to when all tokens are returned.
|
Queue Time | The queue time latency for all requests of the service at a specific point in time. This is the time a request waits in the queue to be processed by the engine.
|
Inference Time | The inference latency for all requests of the service at a specific point in time. This is the time the request is processed by the engine.
|
Prefill Time | The prefill phase latency for all requests of the service at a specific point in time. This is the time the engine takes to process the input tokens of a request.
|
Decode Time | The decode phase latency for all requests of the service at a specific point in time. This is the time the engine takes to generate the output tokens.
|
Input Token Length | The number of input tokens processed by the service at a specific point in time.
|
Output Token Length | The number of output tokens generated by the service at a specific point in time.
|
Request Parameters(params_n & max_tokens) | The `n` and max_tokens parameters for all requests of the service at a specific point in time.
|
GPU KV Cache Usage | The average GPU KV cache usage of the service at a specific point in time. |
CPU KV Cache Usage | The average CPU KV cache usage of the service at a specific point in time. |
Prefix Cache Hit Rate | The average prefix cache hit rate for all requests of the service at a specific point in time.
|
HTTP Requests by Endpoint | The number of requests for the service at a specific point in time, grouped by request method, path, and response status code. |
HTTP Request Latency | The average latency for different request paths of the service at a specific point in time. |
Speculative Decoding Throughput | The number of speculative decodings for the service at a specific point in time. If the service has multiple instances, this metric is the average value of all instances.
|
Speculative Decoding Efficiency | The speculative decoding performance of the service at a specific point in time.
|
Token Acceptance by Position | The number of draft tokens accepted at different generation positions for the service at a specific point in time. If the service has multiple instances, this metric is the average value of all instances. |
SGLang monitoring dashboard
If a service has multiple instances, throughput metrics show the total across all instances. Latency metrics show the average across all instances.
Metric | Description |
Requests Num | The total number of requests for the service at a given time.
|
Token Throughput | The number of input and generated tokens for all service requests at a given time.
|
Time To First Token | The time to first token (TTFT) for all service requests at a given time. TTFT measures the time from when a request is received to when the first token is generated.
|
Time Per Output Token | The time per output token for all service requests at a given time. This metric measures the average time to generate each subsequent output token after the first one.
|
E2E Request Latency | The end-to-end latency for all service requests at a given time. End-to-end latency measures the time from when a request is received to when all tokens are returned.
|
Cache Hit Rate | The average prefix cache hit rate for all service requests at a given time. |
Used Tokens Num | The number of key-value (KV) cache tokens that the service uses at a given time. If the service has multiple instances, this metric shows the average value across all instances. |
Token Usage | The average KV cache token usage rate for the service at a given time. If the service has multiple instances, this metric shows the average value across all instances. |
References
After enabling service monitoring alerts, you can receive alert notifications when the service triggers alert rules.
You can view EAS CloudMonitor events in the CloudMonitor console or by calling API operations, and use the events for O&M, auditing, or alert settings.
You can configure a custom monitoring metric to perform auto scaling based on your business logic.