All Products
Search
Document Center

Platform For AI:View service monitoring and metric information

Last Updated:May 14, 2025

After you deploy a service by using Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can view the service-related metrics on the Service Monitoring tab to learn about service calls and running status. This topic describes how to view service monitoring information and provides detailed descriptions of monitoring metrics.

View the service monitoring information

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).

  2. Click the image.png icon in the Monitoring column of the service that you want to manage to go to the Monitoring tab.

  3. View the service monitoring information.

    Switch between dashboards

    Dashboards are displayed based on service and instance dimensions. You can switch between dashboards.

    image

    • Service: Service dimension. The default service monitoring dashboard is in the Service-<service_name> format. <service_name> specifies the name of the EAS service.

    • Instance: Instance dimension, which supports single-instance and multi-instance modes.

      • Single Instance: monitoring dashboard for a single instance. You can switch between instances.

        image

      • Multiple Instance: monitoring dashboard for multiple instances. You can select multiple instances to compare their metrics.

        image

    Switch between time ranges

    Click image on the right side of the Monitoring tab to switch the time range displayed on the dashboard.

    image

    Important

    Minute-level metrics can be retained for up to one month, and second-level metrics can be retained for up to one hour.

View the metrics

Service monitoring dashboard (minute-level)

The following table describes the metrics that you can view on the service monitoring dashboard.

Metric

Description

QPS

The number of requests per second (QPS) for the service. The number of requests is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of requests sent to all instances.

Response

The number of responses received by the service within the specified time range. The number of responses is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of responses received by all instances.

RT

The response time (RT) of requests.

  • Avg: the average response time of all requests sent at a specific point in time.

  • TPXX: the response time within which XX% of requests are completed. This metric is calculated based on all requests sent at a specific point in time.

For example, TP5 indicates that 5% of requests have a response time less than or equal to this value. TP100 indicates that all requests have a response time less than or equal to this value.

If the service contains multiple instances, TP100 indicates that all requests across all instances have a response time less than or equal to this metric value. For other TP metrics, TPXX indicates the average TPXX value across all instances. For example, TP5 indicates the average of the TP5 values from each individual instance.

Daily Invoke

The number of daily calls to the service. The number of calls is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of daily calls to all instances of the service.

More metrics (CPU | Memory | GPU | Network | Resources)

Metric

Description

CPU

CPU

The average number of CPU cores used by the service at a specific point in time. Unit: cores. If the service contains multiple instances, this metric indicates the average number of CPU cores used by all instances of the service.

CPU Utilization

The average CPU utilization of the service at a specific point in time. Calculation formula: CPU Utilization = Average number of used CPU cores/Maximum number of available CPU cores. If the service contains multiple instances, this metric indicates the average CPU utilization of all instances of the service.

CPU Total

The total number of CPU cores available for the service at a specific point in time. Calculation formula: Number of CPU cores available for a single instance × Number of instances.

Memory

Memory

The average amount of memory used by the service at a specific point in time. If the service contains multiple instances, this metric indicates the average amount of memory used by all instances of the service.

  • RSS: the size of resident physical memory.

  • Cache: the cache size.

  • Total: the maximum physical memory size available for a single instance.

Memory Utilization

The average memory utilization of the service at a specific point in time. Calculation formula: Memory Utilization = RSS/Total. If the service contains multiple instances, this metric indicates the average memory utilization of all instances of the service.

GPU

GPU Utilization

If the deployed service uses GPU resources, this metric indicates the average GPU utilization of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU utilization of all instances of the service.

GPU Memory

If the deployed service uses GPU resources, this metric indicates the GPU memory usage of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU memory usage of all instances of the service.

GPU Total

If the deployed service uses GPU resources, this metric indicates the total number of GPU cards available for the service at a specific point in time. If the service contains multiple instances, this metric indicates the total number of GPU cards of all instances of the service.

GPU Memory Utilization

If the deployed service uses GPU resources, this metric indicates the GPU memory utilization of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU memory utilization of all instances of the service.

Network

Traffic

The amount of data received and sent by the service per second. Unit: bit/s. If the service contains multiple instances, this metric indicates the average amount of data received and sent by all instances of the service.

  • In: the amount of data received by the service.

  • Out: the amount of data sent by the service.

TCP Connections

The number of TCP connections.

Resources

Replicas

The number of instances in different states for the service at a specific point in time, including Total, Pending, and Available.

Replicas By Resource

The number of instances of different resource types for the service at a specific point in time, including Total, Dedicated (dedicated resources), and Public (public resources).

Single-instance monitoring dashboard (minute-level)

The following table describes the metrics that you can view on the single-instance dashboard.

Metric

Description

QPS

The number of requests received by the instance per second. The number of requests is calculated separately by response code.

RT

The response time of requests for the instance.

Response

The total number of responses received by the instance within a specific time range. The number of responses is calculated separately by response code.

More metrics (CPU | Memory | GPU | Network)

Metric

Description

CPU

CPU

The number of CPU cores used by the instance. Unit: cores.

CPU Utilization

The average CPU utilization of the instance. Calculation formula: Average number of used CPU cores/Maximum number of available CPU cores.

Memory

Memory

The memory size of the instance.

  • RSS: the size of resident physical memory.

  • Cache: the cache size.

  • Total: the maximum physical memory size available for a single instance.

Memory Utilization

The average memory utilization of the instance at a specific point in time. Calculation formula: RSS/Total.

GPU

GPU Utilization

The GPU utilization of the instance.

GPU Memory

The GPU memory usage of the instance.

GPU Memory Utilization

The GPU memory utilization of the instance.

Network

Traffic

The amount of data received and sent by the instance per second. Unit: bit/s.

  • In: the amount of data received by the instance.

  • Out: the amount of data sent by the instance.

TCP Connections

The number of TCP connections.

Multi-instance monitoring dashboard

The following table describes the minute-level and second-level metrics that you can view on the multi-instance dashboard.

  • Minute-Level

    Metric

    Description

    Instance QPS

    The number of requests received by each instance per second. The number of requests is calculated separately by response code.

    Instance RT

    The average response time for each instance.

    Instance CPU

    The number of CPU cores used by each instance. Unit: cores.

    Instance Memory -- RSS

    The size of resident physical memory for each instance.

    Instance Memory -- Cache

    The cache size for each instance.

    Instance GPU

    The GPU utilization for each instance.

    Instance GPU Memory

    The GPU memory usage for each instance.

    Instance TCP Connections

    The number of TCP connections for each instance.

  • Second-Level

    Important

    Data is accurate to 5 seconds. Only data of the most recent hour is retained.

    Metric

    Description

    Instance QPS Fine

    The number of requests received by each instance per second. The number of requests is calculated separately by response code.

    Instance RT Fine

    The average response time of requests received by each instance.

GPU monitoring dashboard

The following table describes the GPU-related metrics that you can view on both the service-level and instance-level dashboards. For service-level metrics, average values are calculated across all instances.

Metric

Description

GPU Utilization

The GPU utilization of the service at a specific point in time.

GPU Memory

The GPU memory usage and total memory of the service at a specific point in time.

  • Used: the GPU memory usage at a specific point in time.

  • Total: the total GPU memory at a specific point in time.

Memory Copy Utilization

The GPU memory copy utilization of the service at a specific point in time.

GPU Memory Utilization

The GPU memory utilization of the service at a specific point in time. Calculation formula: Memory usage/Total memory.

PCIe

The Peripheral Component Interconnect Express (PCIe) rate of the service at a specific point in time, measured by Data Center GPU Manager (DCGM).

  • PCIe Transmit: the PCIe transmission rate at a specific point in time.

  • PCIe Receive: the PCIe reception rate at a specific point in time.

Memory Bandwidth

The GPU memory bandwidth of the service at a specific point in time.

SM Utilization and Occupancy

Streaming Multiprocessor (SM)-related metric of the service at a specific point in time. SM is a core component of GPU, responsible for executing and scheduling parallel computing tasks.

  • SM Utilization: the SM utilization at a specific point in time.

  • SM Occupancy: the proportion of active warps residing on the SM at a specific point in time.

Graphics Engine Utilization

The utilization of the GPU graphics engine of the service at a specific point in time.

Pipe Active Ratio

The activity rate of the GPU compute pipeline of the service at a specific point in time.

  • Pipe Fp32 Active Ratio: the FP32 pipeline activity rate at a specific point in time.

  • Pipe Fp16 Active Ratio: the FP16 pipeline activity rate at a specific point in time.

  • Pipe Tensor Active Ratio: the Tensor pipeline activity rate at a specific point in time.

Tflops Usage

The Tera floating-point operations per second (TFLOPS) of the GPU compute pipeline of the service at a specific point in time.

  • FP32 Tflops Used: the TFLOPS of the FP32 pipeline at a specific point in time.

  • FP16 Tflops Used: the TFLOPS of the FP16 pipeline at a specific point in time.

  • Tensor Tflops Used: the TFLOPS of the Tensor pipeline at a specific point in time.

DRAM Active Ratio

The activity rate of the GPU memory interface for data transmission or reception at a specific point in time.

SM Clock

The SM clock frequency of the service at a specific point in time.

GPU Temperature

The GPU temperature-related metric of the service at a specific point in time.

  • GPU Temperature: the GPU temperature at a specific point in time.

  • GPU Slowdown Temperature: the temperature threshold at which the GPU automatically reduces its operating frequency to prevent overheating.

  • GPU Shutdown Temperature: the temperature threshold at which the system forcibly shuts down the GPU to prevent hardware damage or critical system failures caused by overheating.

Power Usage

The GPU power usage of the service at a specific point in time.

The following table describes metrics related to GPU health status and abnormal information.

Metric

Description

GPU Health Count

The number of healthy GPU cards of the service at a specific point in time.

GPU Lost Card Num

The number of unavailable GPU cards of the service at a specific point in time.

ECC Error Count

The number of Error Correction Code (ECC) errors of the service at a specific point in time. ECC is used to detect and correct errors in GPU memory that may occur during data transmission or storage processes.

  • Volatile SBE ECC Error: the number of single-bit volatile ECC errors detected in the service at a specific point in time.

  • Volatile DBE ECC Error: the number of double-bit volatile ECC errors detected in the service at a specific point in time.

  • Aggregate SBE ECC Error: the number of single-bit persistent ECC errors detected in the service at a specific point in time.

  • Aggregate DBE ECC Error: the number of double-bit persistent ECC errors detected in the service at a specific point in time.

  • Uncorrectable ECC Error: the number of uncorrectable ECC errors detected in the service at a specific point in time.

NVSwitch Error Count

The number of NVSwitch errors detected in the service at a specific point in time. NVSwitch provides high-bandwidth and low-latency communication channels and enables high-speed communication between multiple GPUs.

  • NVSwitch Fatal Error: the number of fatal NVSwitch errors detected in the service at a specific point in time.

  • NVSwitch Non-Fatal Error: the number of fatal NVSwitch errors detected in the service at a specific point in time.

Xid Error Count

The number of Xid errors detected in the service at a specific point in time. Xid errors are error codes reported by GPU drivers to indicate GPU runtime issues. Such errors are usually recorded in system logs as Xid codes, such as in the dmesg log of Linux or the Event Viewer of Windows.

  • Xid Error: the number of non-fatal Xid errors detected in the service at a specific point in time.

  • Fatal Xid Error: the number of fatal Xid errors detected in the service at a specific point in time.

Kernel Error Count

The number of non-Xid errors detected in the service at a specific point in time. Non-Xid errors refer to errors reported in kernel logs other than Xid errors.

Driver Hang

The number of times the GPU driver of the service is suspended at a specific point in time.

Remap Status

The status of the GPU when the system attempts to remap a memory row for the service at a specific point in time.

vLLM monitoring dashboard

If the service has multiple instances, the sum of throughput-related metrics across all instances is calculated, while the average of latency-related metrics across all instances is calculated.

Metric

Description

Requests Num

The number of all requests of the service at a specific point in time.

  • Running: the number of requests running on the GPU at a specific point in time.

  • Waiting: the number of requests waiting to be processed at a specific point in time.

  • Swapped: the number of requests that are swapped out to the CPU at a specific point in time.

Token Throughput

The number of input and output tokens for all requests of the service at a specific point in time.

  • TPS_IN: the number of input tokens at a specific point in time.

  • TPS_OUT: the number of output tokens at a specific point in time.

Time To First Token

The first token latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when the first token is generated.

  • Avg: the average of first token latency for all requests at a specific point in time.

  • TPXX: the percentile value of first token latency for all requests at a specific point in time.

Time Per Output Token

The per-token latency for all requests of the service at a specific point in time, which indicates the average time required to generate each output token after the first token.

  • Avg: the average of per-token latency for all requests at a specific point in time.

  • TPXX: the percentile value of per-token latency for all requests at a specific point in time.

E2E Request Latency

The end-to-end latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when all tokens are returned.

  • Avg: the average of end-to-end latency for all requests at a specific point in time.

  • TPXX: the percentile value of end-to-end latency for all requests at a specific point in time.

Request Params N

The average value of parameter N for all requests of the service at a specific point in time.

GPU Cache Usage

The average usage rate of the GPU KV cache of the service at a specific point in time.

CPU Cache Usage

The average usage rate of the CPU KV cache of the service at a specific point in time.

Prefix Cache Hit Rate

The average prefix cache hit rate for all requests of the service at a specific point in time.

  • GPU: the average hit rate of the GPU prefix cache for all requests at a specific point in time.

  • CPU: the average hit rate of the CPU prefix cache for all requests at a specific point in time.

BladeLLM monitoring dashboard

If the service has multiple instances, the sum of throughput-related metrics across all instances is calculated, while the average of latency-related metrics across all instances is calculated.

Metric

Description

Token Throughput

The number of input and output tokens for all requests of the service at a specific point in time.

  • TPS_IN: the number of input tokens at a specific point in time.

  • TPS_OUT: the number of output tokens at a specific point in time.

Prompt Length

The average number of prompt tokens for all requests of the service at a specific point in time.

Time To First Token

The first token latency for all requests of the service at a specific point in time, which indicates the time from when a request is received to when the first token is generated.

  • Avg: the average of first token latency for all requests at a specific point in time.

  • Min: the minimum first token latency for all requests at a specific point in time.

  • TPXX: the percentile value of first token latency for all requests at a specific point in time.

Time Per Output Token

The per-token latency for all requests of the service at a specific point in time, which indicates the average time required to generate each output token after the first token.

  • Avg: the average of per-token latency for all requests at a specific point in time.

  • Min: the minimum per-token latency for all requests at a specific point in time.

  • TPXX: the percentile value of per-token latency for all requests at a specific point in time.

Decode Latency

The time required for decoding service tokens at a specific point in time.

Ragged Latency

The time required for processing batches that contain both prefill and decode requests at a specific point in time.

Prefill Batch Size

The size of the prefill batch processed by the service at a specific point in time.

Decode Batch Size

The size of the decode batch processed by the service at a specific point in time.

GPU Block Usage

The average block utilization of GPU KV cache for the service at a specific point in time.

Wait Queue Size

The number of requests waiting in the queue to be scheduled for the service at a specific point in time.

Scheduler Step Latency

The time required for scheduling all requests of the service at a specific point in time.

Worker Bubble

The average idle time of GPU workers for the service at a specific point in time.

Updated Tokens

The average time required by the service worker to generate a token at this point in time.

Chunk Util

The percentage of prefill tokens relative to the chunk size for the service at a specific point in time.

References