After you deploy a service by using Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can view the service-related metrics on the Service Monitoring tab to learn about service calls and running status. This topic describes how to view service monitoring information and provides detailed descriptions of monitoring metrics.
View the service monitoring information
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click the
icon in the Monitoring column of the service that you want to manage to go to the Monitoring tab.
View the service monitoring information.
Switch between dashboards
Dashboards are displayed based on service and instance dimensions. You can switch between dashboards.
Service: Service dimension. The default service monitoring dashboard is in the
Service-<service_name>
format.<service_name>
specifies the name of the EAS service.Instance: Instance dimension, which supports single-instance and multi-instance modes.
Single Instance: monitoring dashboard for a single instance. You can switch between instances.
Multiple Instance: monitoring dashboard for multiple instances. You can select multiple instances to compare their metrics.
Switch between time ranges
Click
on the right side of the Monitoring tab to switch the time range displayed on the dashboard.
ImportantMinute-level metrics can be retained for up to one month, and second-level metrics can be retained for up to one hour.
View the metrics
Service monitoring dashboard (minute-level)
The following table describes the metrics that you can view on the service monitoring dashboard.
Metric | Description | |
QPS | The number of requests per second (QPS) for the service. The number of requests is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of requests sent to all instances. | |
Response | The number of responses received by the service within the specified time range. The number of responses is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of responses received by all instances. | |
RT | The response time (RT) of requests.
For example, TP5 indicates that 5% of requests have a response time less than or equal to this value. TP100 indicates that all requests have a response time less than or equal to this value. If the service contains multiple instances, TP100 indicates that all requests across all instances have a response time less than or equal to this metric value. For other TP metrics, TPXX indicates the average TPXX value across all instances. For example, TP5 indicates the average of the TP5 values from each individual instance. | |
Daily Invoke | The number of daily calls to the service. The number of calls is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of daily calls to all instances of the service. |
Single-instance monitoring dashboard (minute-level)
The following table describes the metrics that you can view on the single-instance dashboard.
Metric | Description |
QPS | The number of requests received by the instance per second. The number of requests is calculated separately by response code. |
RT | The response time of requests for the instance. |
Response | The total number of responses received by the instance within a specific time range. The number of responses is calculated separately by response code. |
Multi-instance monitoring dashboard
The following table describes the minute-level and second-level metrics that you can view on the multi-instance dashboard.
Minute-Level
Metric
Description
Instance QPS
The number of requests received by each instance per second. The number of requests is calculated separately by response code.
Instance RT
The average response time for each instance.
Instance CPU
The number of CPU cores used by each instance. Unit: cores.
Instance Memory -- RSS
The size of resident physical memory for each instance.
Instance Memory -- Cache
The cache size for each instance.
Instance GPU
The GPU utilization for each instance.
Instance GPU Memory
The GPU memory usage for each instance.
Instance TCP Connections
The number of TCP connections for each instance.
Second-Level
ImportantData is accurate to 5 seconds. Only data of the most recent hour is retained.
Metric
Description
Instance QPS Fine
The number of requests received by each instance per second. The number of requests is calculated separately by response code.
Instance RT Fine
The average response time of requests received by each instance.
GPU monitoring dashboard
The following table describes the GPU-related metrics that you can view on both the service-level and instance-level dashboards. For service-level metrics, average values are calculated across all instances.
Metric | Description |
GPU Utilization | The GPU utilization of the service at a specific point in time. |
GPU Memory | The GPU memory usage and total memory of the service at a specific point in time.
|
Memory Copy Utilization | The GPU memory copy utilization of the service at a specific point in time. |
GPU Memory Utilization | The GPU memory utilization of the service at a specific point in time. Calculation formula: Memory usage/Total memory. |
PCIe | The Peripheral Component Interconnect Express (PCIe) rate of the service at a specific point in time, measured by Data Center GPU Manager (DCGM).
|
Memory Bandwidth | The GPU memory bandwidth of the service at a specific point in time. |
SM Utilization and Occupancy | Streaming Multiprocessor (SM)-related metric of the service at a specific point in time. SM is a core component of GPU, responsible for executing and scheduling parallel computing tasks.
|
Graphics Engine Utilization | The utilization of the GPU graphics engine of the service at a specific point in time. |
Pipe Active Ratio | The activity rate of the GPU compute pipeline of the service at a specific point in time.
|
Tflops Usage | The Tera floating-point operations per second (TFLOPS) of the GPU compute pipeline of the service at a specific point in time.
|
DRAM Active Ratio | The activity rate of the GPU memory interface for data transmission or reception at a specific point in time. |
SM Clock | The SM clock frequency of the service at a specific point in time. |
GPU Temperature | The GPU temperature-related metric of the service at a specific point in time.
|
Power Usage | The GPU power usage of the service at a specific point in time. |
vLLM monitoring dashboard
If the service has multiple instances, the sum of throughput-related metrics across all instances is calculated, while the average of latency-related metrics across all instances is calculated.
Metric | Description |
Requests Num | The number of all requests of the service at a specific point in time.
|
Token Throughput | The number of input and output tokens for all requests of the service at a specific point in time.
|
Time To First Token | The first token latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when the first token is generated.
|
Time Per Output Token | The per-token latency for all requests of the service at a specific point in time, which indicates the average time required to generate each output token after the first token.
|
E2E Request Latency | The end-to-end latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when all tokens are returned.
|
Request Params N | The average value of parameter N for all requests of the service at a specific point in time. |
GPU Cache Usage | The average usage rate of the GPU KV cache of the service at a specific point in time. |
CPU Cache Usage | The average usage rate of the CPU KV cache of the service at a specific point in time. |
Prefix Cache Hit Rate | The average prefix cache hit rate for all requests of the service at a specific point in time.
|
BladeLLM monitoring dashboard
If the service has multiple instances, the sum of throughput-related metrics across all instances is calculated, while the average of latency-related metrics across all instances is calculated.
Metric | Description |
Token Throughput | The number of input and output tokens for all requests of the service at a specific point in time.
|
Prompt Length | The average number of prompt tokens for all requests of the service at a specific point in time. |
Time To First Token | The first token latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when the first token is generated.
|
Time Per Output Token | The per-token latency for all requests of the service at a specific point in time, which indicates the average time required to generate each output token after the first token.
|
Decode Latency | The time required for decoding service tokens at a specific point in time. |
Ragged Latency | The time required for processing batches that contain both prefill and decode requests at a specific point in time. |
Prefill Batch Size | The size of the prefill batch processed by the service at a specific point in time. |
Decode Batch Size | The size of the decode batch processed by the service at a specific point in time. |
GPU Block Usage | The average block utilization of GPU KV cache for the service at a specific point in time. |
Wait Queue Size | The number of requests waiting in the queue to be scheduled for the service at a specific point in time. |
Scheduler Step Latency | The time required for scheduling all requests of the service at a specific point in time. |
Worker Bubble | The average idle time of GPU workers for the service at a specific point in time. |
Updated Tokens | The average time required by the service worker to generate a token at this point in time. |
Chunk Util | The percentage of prefill tokens relative to the chunk size for the service at a specific point in time. |
References
After enabling service monitoring alerts, you can receive alert notifications when the service triggers alert rules.
You can view EAS CloudMonitor events in the CloudMonitor console or by calling API operations, and use the events for O&M, auditing, or alert settings.
You can configure a custom monitoring metric to perform auto scaling based on your business logic.