All Products
Search
Document Center

Platform For AI:View service monitoring and metric information

Last Updated:Dec 15, 2025

After you deploy a service by using Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can view the service-related metrics on the Service Monitoring tab to learn about service calls and running status. This topic describes how to view service monitoring information and provides detailed descriptions of monitoring metrics.

View the service monitoring information

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click the target service name, and then switch to the Monitoring tab.

  3. View the service monitoring information.

    Switch between dashboards

    Dashboards are displayed based on service and instance dimensions. You can switch between dashboards.

    image

    • Service: Service dimension. The default service monitoring dashboard is in the Service-<service_name> format. <service_name> specifies the name of the EAS service.

    • Instance: Instance dimension, which supports single-instance and multi-instance modes.

      • Single Instance: monitoring dashboard for a single instance. You can switch between instances.

        image

      • Multiple Instance: monitoring dashboard for multiple instances. You can select multiple instances to compare their metrics.

        image

    Switch between time ranges

    Click image on the right side of the Monitoring tab to switch the time range displayed on the dashboard.

    image

    Important

    Minute-level metrics can be retained for up to one month, and second-level metrics can be retained for up to one hour.

    Important

    Setting the service tag to ServiceEngineType: vllm or ServiceEngineType: sglang displays LLM-related metrics.

View the metrics

Service monitoring dashboard (minute-level)

The following table describes the metrics that you can view on the service monitoring dashboard.

Metric

Description

QPS

The number of requests per second (QPS) for the service. The number of requests is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of requests sent to all instances. 1d offset indicates the QPS data from the same time on the previous day, which can be used for comparative analysis.

Response

The number of responses received by the service within the specified time range. The number of responses is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of responses received by all instances.

RT

The response time (RT) of requests.

  • Avg: the average response time of all requests sent at a specific point in time.

  • TPXX: the response time within which XX% of requests are completed. This metric is calculated based on all requests sent at a specific point in time.

For example, TP5 indicates that 5% of requests have a response time less than or equal to this value. TP100 indicates that all requests have a response time less than or equal to this value.

If the service contains multiple instances, TP100 indicates that all requests across all instances have a response time less than or equal to this metric value. For other TP metrics, TPXX indicates the average TPXX value across all instances. For example, TP5 indicates the average of the TP5 values from each individual instance.

Daily Invoke

The number of daily calls to the service. The number of calls is calculated separately by response code. If the service contains multiple instances, this metric indicates the total number of daily calls to all instances of the service.

More metrics (CPU | Memory | GPU | Network | Resources)

Metric

Description

CPU

CPU

The average number of CPU cores used by the service at a specific point in time. Unit: cores. If the service contains multiple instances, this metric indicates the average number of CPU cores used by all instances of the service.

CPU Utilization

The average CPU utilization of the service at a specific point in time. Calculation formula: CPU Utilization = Average number of used CPU cores/Maximum number of available CPU cores. If the service contains multiple instances, this metric indicates the average CPU utilization of all instances of the service.

CPU Total

The total number of CPU cores available for the service at a specific point in time. Calculation formula: Number of CPU cores available for a single instance × Number of instances.

Memory

Memory

The average amount of memory used by the service at a specific point in time. If the service contains multiple instances, this metric indicates the average amount of memory used by all instances of the service.

  • RSS: the size of resident physical memory.

  • Cache: the cache size.

  • Total: the maximum physical memory size available for a single instance.

Memory Utilization

The average memory utilization of the service at a specific point in time. Calculation formula: Memory Utilization = RSS/Total. If the service contains multiple instances, this metric indicates the average memory utilization of all instances of the service.

GPU

GPU Utilization

If the deployed service uses GPU resources, this metric indicates the average GPU utilization of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU utilization of all instances of the service.

GPU Memory

If the deployed service uses GPU resources, this metric indicates the GPU memory usage of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU memory usage of all instances of the service.

GPU Total

If the deployed service uses GPU resources, this metric indicates the total number of GPU cards available for the service at a specific point in time. If the service contains multiple instances, this metric indicates the total number of GPU cards of all instances of the service.

GPU Memory Utilization

If the deployed service uses GPU resources, this metric indicates the GPU memory utilization of the service at a specific point in time. If the service contains multiple instances, this metric indicates the average GPU memory utilization of all instances of the service.

Network

Traffic

The amount of data received and sent by the service per second. Unit: bit/s. If the service contains multiple instances, this metric indicates the average amount of data received and sent by all instances of the service.

  • In: the amount of data received by the service.

  • Out: the amount of data sent by the service.

TCP Connections

The number of TCP connections.

Resources

Replicas

The number of instances in different states for the service at a specific point in time, including Total, Pending, and Available.

Replicas By Resource

The number of instances of different resource types for the service at a specific point in time, including Total, Dedicated (dedicated resources), and Public (public resources).

Single-instance monitoring dashboard (minute-level)

The following table describes the metrics that you can view on the single-instance dashboard.

Metric

Description

QPS

The number of requests received by the instance per second. The number of requests is calculated separately by response code.

RT

The response time of requests for the instance.

Response

The total number of responses received by the instance within a specific time range. The number of responses is calculated separately by response code.

More metrics (CPU | Memory | GPU | Network)

Metric

Description

CPU

CPU

The number of CPU cores used by the instance. Unit: cores.

CPU Utilization

The average CPU utilization of the instance. Calculation formula: Average number of used CPU cores/Maximum number of available CPU cores.

Memory

Memory

The memory size of the instance.

  • RSS: the size of resident physical memory.

  • Cache: the cache size.

  • Total: the maximum physical memory size available for a single instance.

Memory Utilization

The average memory utilization of the instance at a specific point in time. Calculation formula: RSS/Total.

GPU

GPU Utilization

The GPU utilization of the instance.

GPU Memory

The GPU memory usage of the instance.

GPU Memory Utilization

The GPU memory utilization of the instance.

Network

Traffic

The amount of data received and sent by the instance per second. Unit: bit/s.

  • In: the amount of data received by the instance.

  • Out: the amount of data sent by the instance.

TCP Connections

The number of TCP connections.

Multi-instance monitoring dashboard

The following table describes the minute-level and second-level metrics that you can view on the multi-instance dashboard.

  • Minute-Level

    Metric

    Description

    Instance QPS

    The number of requests received by each instance per second. The number of requests is calculated separately by response code.

    Instance RT

    The average response time for each instance.

    Instance CPU

    The number of CPU cores used by each instance. Unit: cores.

    Instance Memory -- RSS

    The size of resident physical memory for each instance.

    Instance Memory -- Cache

    The cache size for each instance.

    Instance GPU

    The GPU utilization for each instance.

    Instance GPU Memory

    The GPU memory usage for each instance.

    Instance TCP Connections

    The number of TCP connections for each instance.

  • Second-Level

    Important

    Data is accurate to 5 seconds. Only data of the most recent hour is retained.

    Metric

    Description

    Instance QPS Fine

    The number of requests received by each instance per second. The number of requests is calculated separately by response code.

    Instance RT Fine

    The average response time of requests received by each instance.

GPU monitoring dashboard

The following table describes the GPU-related metrics that you can view on both the service-level and instance-level dashboards. For service-level metrics, average values are calculated across all instances.

Metric

Description

GPU Utilization

The GPU utilization of the service at a specific point in time.

GPU Memory

The GPU memory usage and total memory of the service at a specific point in time.

  • Used: the GPU memory usage at a specific point in time.

  • Total: the total GPU memory at a specific point in time.

Memory Copy Utilization

The GPU memory copy utilization of the service at a specific point in time.

GPU Memory Utilization

The GPU memory utilization of the service at a specific point in time. Calculation formula: Memory usage/Total memory.

PCIe

The Peripheral Component Interconnect Express (PCIe) rate of the service at a specific point in time, measured by Data Center GPU Manager (DCGM).

  • PCIe Transmit: the PCIe transmission rate at a specific point in time.

  • PCIe Receive: the PCIe reception rate at a specific point in time.

Memory Bandwidth

The GPU memory bandwidth of the service at a specific point in time.

SM Utilization and Occupancy

Streaming Multiprocessor (SM)-related metric of the service at a specific point in time. SM is a core component of GPU, responsible for executing and scheduling parallel computing tasks.

  • SM Utilization: the SM utilization at a specific point in time.

  • SM Occupancy: the proportion of active warps residing on the SM at a specific point in time.

Graphics Engine Utilization

The utilization of the GPU graphics engine of the service at a specific point in time.

Pipe Active Ratio

The activity rate of the GPU compute pipeline of the service at a specific point in time.

  • Pipe Fp32 Active Ratio: the FP32 pipeline activity rate at a specific point in time.

  • Pipe Fp16 Active Ratio: the FP16 pipeline activity rate at a specific point in time.

  • Pipe Tensor Active Ratio: the Tensor pipeline activity rate at a specific point in time.

Tflops Usage

The Tera floating-point operations per second (TFLOPS) of the GPU compute pipeline of the service at a specific point in time.

  • FP32 Tflops Used: the TFLOPS of the FP32 pipeline at a specific point in time.

  • FP16 Tflops Used: the TFLOPS of the FP16 pipeline at a specific point in time.

  • Tensor Tflops Used: the TFLOPS of the Tensor pipeline at a specific point in time.

DRAM Active Ratio

The activity rate of the GPU memory interface for data transmission or reception at a specific point in time.

SM Clock

The SM clock frequency of the service at a specific point in time.

GPU Temperature

The GPU temperature-related metric of the service at a specific point in time.

  • GPU Temperature: the GPU temperature at a specific point in time.

  • GPU Slowdown Temperature: the temperature threshold at which the GPU automatically reduces its operating frequency to prevent overheating.

  • GPU Shutdown Temperature: the temperature threshold at which the system forcibly shuts down the GPU to prevent hardware damage or critical system failures caused by overheating.

Power Usage

The GPU power usage of the service at a specific point in time.

The following table describes metrics related to GPU health status and abnormal information.

Metric

Description

GPU Health Count

The number of healthy GPU cards of the service at a specific point in time.

GPU Lost Card Num

The number of unavailable GPU cards of the service at a specific point in time.

ECC Error Count

The number of Error Correction Code (ECC) errors of the service at a specific point in time. ECC is used to detect and correct errors in GPU memory that may occur during data transmission or storage processes.

  • Volatile SBE ECC Error: the number of single-bit volatile ECC errors detected in the service at a specific point in time.

  • Volatile DBE ECC Error: the number of double-bit volatile ECC errors detected in the service at a specific point in time.

  • Aggregate SBE ECC Error: the number of single-bit persistent ECC errors detected in the service at a specific point in time.

  • Aggregate DBE ECC Error: the number of double-bit persistent ECC errors detected in the service at a specific point in time.

  • Uncorrectable ECC Error: the number of uncorrectable ECC errors detected in the service at a specific point in time.

NVSwitch Error Count

The number of NVSwitch errors detected in the service at a specific point in time. NVSwitch provides high-bandwidth and low-latency communication channels and enables high-speed communication between multiple GPUs.

  • NVSwitch Fatal Error: the number of fatal NVSwitch errors detected in the service at a specific point in time.

  • NVSwitch Non-Fatal Error: the number of fatal NVSwitch errors detected in the service at a specific point in time.

Xid Error Count

The number of Xid errors detected in the service at a specific point in time. Xid errors are error codes reported by GPU drivers to indicate GPU runtime issues. Such errors are usually recorded in system logs as Xid codes, such as in the dmesg log of Linux or the Event Viewer of Windows.

  • Xid Error: the number of non-fatal Xid errors detected in the service at a specific point in time.

  • Fatal Xid Error: the number of fatal Xid errors detected in the service at a specific point in time.

Kernel Error Count

The number of non-Xid errors detected in the service at a specific point in time. Non-Xid errors refer to errors reported in kernel logs other than Xid errors.

Driver Hang

The number of times the GPU driver of the service is suspended at a specific point in time.

Remap Status

The status of the GPU when the system attempts to remap a memory row for the service at a specific point in time.

VLLM monitoring dashboard

If a service has multiple instances, throughput is the sum of all instances, and latency is the average of all instances.

Metric

Description

Requests Status

The total number of requests for the service at a specific point in time.

  • Running: The number of requests running on the GPU at a specific point in time.

  • Waiting: The number of requests waiting to be processed at a specific point in time.

  • Swapped: The number of requests swapped to the CPU at a specific point in time.

Token Throughput

The number of input and generated tokens for all requests of the service at a specific point in time.

  • TPS_IN: The number of input tokens at a specific point in time.

  • TPS_OUT: The number of output tokens at a specific point in time.

Request Completion Status

Statistics on the completion status of all requests for the service at a specific point in time.

  • preemptions: The request was preempted.

  • stop: The request completed successfully because the model output a stop token, such as <EOS>.

  • length: The request reached the maximum output token length.

  • abort: The request was aborted.

Time To First Token

The time to first token (TTFT) latency for all requests of the service at a specific point in time. This is the time from when a request is received to when the first token is generated.

  • Avg: The average TTFT latency for all requests at a specific point in time.

  • TPXX: The percentile values for the TTFT latency of all requests at a specific point in time.

Time Per Output Token

The time per output token (TPOT) latency for all requests of the service at a specific point in time. This is the average time required to generate each output token after the first token.

  • Avg: The average TPOT latency for all requests at a specific point in time.

  • TPXX: The percentile values for the TPOT latency of all requests at a specific point in time.

E2E Request Latency

The end-to-end latency for all requests of the service at a specific point in time. This is the time from when a request is received to when all tokens are returned.

  • Avg: The average end-to-end latency for all requests at a specific point in time.

  • TPXX: The percentile values for the end-to-end latency of all requests at a specific point in time.

Queue Time

The queue time latency for all requests of the service at a specific point in time. This is the time a request waits in the queue to be processed by the engine.

  • Avg: The average queue time latency for all requests at a specific point in time.

  • TPXX: The percentile values for the queue time latency of all requests at a specific point in time.

Inference Time

The inference latency for all requests of the service at a specific point in time. This is the time the request is processed by the engine.

  • Avg: The average inference latency for all requests at a specific point in time.

  • TPXX: The percentile values for the inference latency of all requests at a specific point in time.

Prefill Time

The prefill phase latency for all requests of the service at a specific point in time. This is the time the engine takes to process the input tokens of a request.

  • Avg: The average prefill latency for all requests at a specific point in time.

  • TPXX: The percentile values for the prefill latency of all requests at a specific point in time.

Decode Time

The decode phase latency for all requests of the service at a specific point in time. This is the time the engine takes to generate the output tokens.

  • Avg: The average decode latency for all requests at a specific point in time.

  • TPXX: The percentile values for the decode latency of all requests at a specific point in time.

Input Token Length

The number of input tokens processed by the service at a specific point in time.

  • Avg: The average input token length for all requests at a specific point in time.

  • TPXX: The percentile values for the input token length of all requests at a specific point in time.

Output Token Length

The number of output tokens generated by the service at a specific point in time.

  • Avg: The average output token length for all requests at a specific point in time.

  • TPXX: The percentile values for the output token length of all requests at a specific point in time.

Request Parameters(params_n & max_tokens)

The `n` and max_tokens parameters for all requests of the service at a specific point in time.

  • Params_n: The average value of the `n` parameter for all requests at a specific point in time.

  • Params_max_tokens: The average value of the `max_tokens` parameter for all requests at a specific point in time.

GPU KV Cache Usage

The average GPU KV cache usage of the service at a specific point in time.

CPU KV Cache Usage

The average CPU KV cache usage of the service at a specific point in time.

Prefix Cache Hit Rate

The average prefix cache hit rate for all requests of the service at a specific point in time.

  • GPU: The average GPU prefix cache hit rate for all requests at a specific point in time.

  • CPU: The average CPU prefix cache hit rate for all requests at a specific point in time.

HTTP Requests by Endpoint

The number of requests for the service at a specific point in time, grouped by request method, path, and response status code.

HTTP Request Latency

The average latency for different request paths of the service at a specific point in time.

Speculative Decoding Throughput

The number of speculative decodings for the service at a specific point in time. If the service has multiple instances, this metric is the average value of all instances.

  • Drafts: The number of draft tokens generated at a specific point in time.

  • Draft Tokens: The number of draft tokens processed at a specific point in time.

  • Accepted Tokens: The number of draft tokens accepted at a specific point in time.

  • Emitted Tokens: The number of draft tokens emitted at a specific point in time.

Speculative Decoding Efficiency

The speculative decoding performance of the service at a specific point in time.

  • Draft Acceptance Rate: The average acceptance rate of draft tokens at a specific point in time.

  • Efficiency: The average efficiency of speculative decoding at a specific point in time.

Token Acceptance by Position

The number of draft tokens accepted at different generation positions for the service at a specific point in time. If the service has multiple instances, this metric is the average value of all instances.

SGLang monitoring dashboard

If a service has multiple instances, throughput metrics show the total across all instances. Latency metrics show the average across all instances.

Metric

Description

Requests Num

The total number of requests for the service at a given time.

  • Running: The number of requests running on the GPU at that time.

  • Waiting: The number of requests waiting for processing at that time.

Token Throughput

The number of input and generated tokens for all service requests at a given time.

  • TPS_IN: The number of input tokens at that time.

  • TPS_OUT: The number of output tokens at that time.

Time To First Token

The time to first token (TTFT) for all service requests at a given time. TTFT measures the time from when a request is received to when the first token is generated.

  • Avg: The average TTFT for all requests at that time.

  • TPXX: The percentile values for the TTFT of all requests at that time.

Time Per Output Token

The time per output token for all service requests at a given time. This metric measures the average time to generate each subsequent output token after the first one.

  • Avg: The average time per output token for all requests at that time.

  • TPXX: The percentile values for the time per output token of all requests at that time.

E2E Request Latency

The end-to-end latency for all service requests at a given time. End-to-end latency measures the time from when a request is received to when all tokens are returned.

  • Avg: The average end-to-end latency for all requests at that time.

  • TPXX: The percentile values for the end-to-end latency of all requests at that time.

Cache Hit Rate

The average prefix cache hit rate for all service requests at a given time.

Used Tokens Num

The number of key-value (KV) cache tokens that the service uses at a given time. If the service has multiple instances, this metric shows the average value across all instances.

Token Usage

The average KV cache token usage rate for the service at a given time. If the service has multiple instances, this metric shows the average value across all instances.

References