Use the Python agent of Application Monitoring to observe vLLM and SGLang inference engines.
Application Real-Time Monitoring Service (ARMS) currently supports observability only for the vLLM/SGLang framework.
Set up observability for PAI-EAS
Elastic Algorithm Service (EAS) is a PAI service for deploying and serving models online. To enable ARMS observability for a vLLM or SGLang model deployed on EAS, follow these steps.
Step 1: Prepare environment variables
export ARMS_APP_NAME=xxx # The name of the EAS application.
export ARMS_REGION_ID=xxx # The region ID for your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx # The Alibaba Cloud license key.Step 2: Modify the PAI-EAS run command
Log on to the PAI console. At the top of the page, select the target region, and then navigate to the target workspace.
In the navigation pane on the left, choose .
On the Inference Service tab, find the application for which you want to enable model observability, and then click Update in the Actions column.
Modify the run command.
The following example uses the DeepSeek-R1-Distill-Qwen-7B model.
Original vLLM command:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BModified vLLM command with ARMS observability:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BThe modifications include the following steps:
Configure the PyPI repository. You can adjust this as needed.
export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;Download the agent installer.
pip3 install aliyun-bootstrap;Use the installer to install the agent.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Original SGLang command:
python -m sglang.launch_server --model-path /model_dirModified SGLang command with ARMS observability:
export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dirThe modifications include the following steps:
Configure the PyPI repository. You can adjust this as needed.
export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;Download the agent installer.
pip3 install aliyun-bootstrap;Use the installer to install the agent.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Click Update.
Set up observability in other environments
ARMS supports the official versions of vLLM (V0 and V1) and SGLang. Custom-modified versions are not supported. For more information about supported versions, see LLM (large language model) services.
ARMS collects two spans for non-streaming requests and three spans for streaming requests. The following table describes the supported scenarios.
Supported scenario | Data processing | Collected content | vLLM V0 | vLLM V1 | SGLang |
Chat or completion | Streaming | span |
|
|
|
Key metrics TTFT/TPOT | Supported | Supported | Supported | ||
Non-streaming | span |
|
|
| |
Key metrics TTFT/TPOT | Not applicable | Not applicable | Not applicable | ||
Embedding | http | Not supported | Supported | Not supported | |
Rerank | http | Not supported | Supported | Not supported | |
Span attributes
Attributes of the llm_request span:
Attribute | Description |
gen_ai.latency.e2e | End-to-end time |
gen_ai.latency.time_in_queue | Time in queue |
gen_ai.latency.time_in_scheduler | Scheduling time |
gen_ai.latency.time_to_first_token | Time to first token |
gen_ai.request.id | Request ID |
Metric descriptions
vLLM
Dimension descriptions
Dimension name | Dimension Key | Example | Description |
Model name | modelName / model_name | qwen-7b, llama3-8b | Model name |
Engine index | engine_index | 0, 1, 2 | For V1 only, engine instance index |
Operation type | spanKind | LLM | LLM type operation |
Usage type | usageType | input, output | For Token-related metrics only, indicates the Token type |
End reason | finished_reason | stop | Reason for request termination |
Common metrics (V0/V1 shared)
Metric name | Metric | Metric type | Unit | Description |
Iterations | vllm_iter_count | Counter | None | Iteration count |
Successful requests | gen_ai_vllm_request_success | Counter | None | Number of successfully processed requests |
Time to first token | genai_llm_first_token_seconds | Counter | Seconds | Time to generate the first token |
Time per output token | gen_ai_server_time_per_output_token | Counter | Seconds | Time to generate each output token |
End-to-end request duration | gen_ai_server_request_duration | Counter | Seconds | Request end-to-end latency |
Token usage | llm_usage_tokens | Counter | None | Number of tokens used (distinguishes input/output) |
V0 system metrics
Metric name | Metric | Metric type | Unit | Description |
GPU cache usage | gpu_cache_usage_sys | Gauge | None | System GPU cache usage |
CPU cache usage | cpu_cache_usage_sys | Gauge | None | System CPU cache usage |
Number of running sequences | num_running_sys | Gauge | None | Number of currently running sequences |
Number of waiting sequences | num_waiting_sys | Gauge | None | Number of sequences waiting to be processed |
Swapped sequences | num_swapped_sys | Gauge | None | Number of swapped sequences |
V0 iteration metrics
Metric name | Metric | Metric type | Unit | Description |
Iteration prompt token count | num_prompt_tokens_iter | Counter | None | Number of prompt tokens in the current iteration |
Iteration generated token count | num_generation_tokens_iter | Counter | None | Number of generated tokens in the current iteration |
Iteration total token count | num_tokens_iter | Counter | None | Total number of tokens in the current iteration |
Iteration preemption count | num_preemption_iter | Counter | None | Number of preemptions in the current iteration |
V1 system metrics
Metric name | Metric | Metric type | Unit | Description |
Running requests | gen_ai_vllm_num_requests_running | Gauge | None | Number of requests in the model execution batch |
Number of pending requests | gen_ai_vllm_num_requests_waiting | Gauge | None | Number of requests waiting to be processed |
KV cache usage | gen_ai_vllm_kv_cache_usage_perc | Gauge | None | KV cache usage, range [0,1] |
Prefix cache queries | gen_ai_vllm_prefix_cache_queries | Counter | None | Number of prefix cache queries (counted by query tokens) |
Prefix cache hits | gen_ai_vllm_prefix_cache_hits | Counter | None | Number of prefix cache hits (counted by cached tokens) |
V1 iteration metrics
Metric name | Metric | Metric type | Unit | Description |
Preemptions | gen_ai_vllm_num_preemptions | Counter | None | Cumulative engine preemptions |
Prompt tokens | gen_ai_vllm_prompt_tokens | Counter | None | Number of prefill tokens processed |
Generated tokens | gen_ai_vllm_generation_tokens | Counter | None | Number of generated tokens processed |
Request parameter n | gen_ai_vllm_request_params_n | Counter | None | Value of request parameter n |
Request parameter max_tokens | gen_ai_vllm_request_params_max_tokens | Counter | None | Value of request parameter max_tokens |
V1 request latency metrics
Metric name | Metric | Metric type | Unit | Description |
Request queue time | gen_ai_vllm_request_queue_time_seconds | Counter | Seconds | Time spent by the request in the WAITING stage |
Request prefill time | gen_ai_vllm_request_prefill_time_seconds | Counter | Seconds | Time spent by the request in the PREFILL stage |
Request decode time | gen_ai_vllm_request_decode_time_seconds | Counter | Seconds | Time spent by the request in the DECODE stage |
Request inference time | gen_ai_vllm_request_inference_time_seconds | Counter | Seconds | Time spent by the request in the RUNNING stage |
SGLang
Dimension descriptions
Dimension name | Dimension Key | Example | Description |
Model name | modelName / model_name | qwen-7b, deepseek-r1 | Model name |
Operation type | spanKind | LLM | LLM type operation |
Usage type | usageType | input, output | For Token-related metrics only |
Call type | callType | gen_ai | Default value is gen_ai |
RPC type | rpcType | 2100 | RPC type identifier |
System status metrics
Metric name | Metric | Metric type | Unit | Description |
Running requests | sglang_num_running_reqs | Counter | None | Number of requests currently running |
Queued requests | sglang_num_queue_reqs | Counter | None | Number of requests waiting to be processed in the queue |
Log count | sglang_log_count | Counter | None | Log count |
Token-related metrics
Metric name | Metric | Metric type | Unit | Description |
Used tokens | sglang_num_used_tokens | Counter | None | Number of tokens currently in use |
Token usage rate | sglang_token_usage | Counter | None | Token usage rate |
Total prompt tokens | prompt_tokens_total | Counter | None | Cumulative prompt tokens |
Total generated tokens | generation_tokens_total | Counter | None | Cumulative generated tokens |
Total cached tokens | gen_ai_sglang_cached_tokens_total | Counter | None | Number of cached prompt tokens |
Token usage | llm_usage_tokens | Counter | None | Number of tokens used (distinguishes input/output) |
Performance metrics
Metric name | Metric | Metric type | Unit | Description |
Generation throughput | sglang_gen_throughput | Counter | None | Number of tokens generated per second |
Time to first token | gen_ai_server_time_to_first_token | Counter | Seconds | Time to generate the first token |
Time per output token | gen_ai_server_time_per_output_token | Counter | Seconds | Time to generate each output token |
Inter-token latency | sglang_inter_token_latency_seconds | Counter | Seconds | Generation latency between tokens |
End-to-end request duration | gen_ai_server_request_duration | Counter | Seconds | Request end-to-end latency |
Cache and speculative execution metrics
Metric name | Metric | Metric type | Unit | Description |
Cache hit ratio | gen_ai_sglang_cache_hit_rate | Counter | None | Cache hit ratio |
Speculative accept length | sglang_spec_accept_length | Counter | None | Length accepted by speculative decoding |
Request statistics metrics
Metric name | Metric | Metric type | Unit | Description |
Total requests | num_requests_total | Counter | None | Cumulative processed requests |
Configuration reference
Environment variable name | Description |
OTEL_INSTRUMENTATION_VLLM_TRACING_LEVEL | Observability granularity for the vLLM inference engine. 0: Records only request-level spans (llm_request span). 1: Also records spans for different inference stages (Wait/Prefill/Decode). 2: Also records detailed events for each generated token in the span event of the llm_request span. |
OTEL_SPAN_EVENT_COUNT_LIMIT | Maximum number of token generation events observed when OTEL_INSTRUMENTATION_VLLM_TRACING_LEVEL is set to 2. Default: 128. |