All Products
Search
Document Center

Application Real-Time Monitoring Service:Observe vLLM/SGLang inference engines

Last Updated:Mar 19, 2026

Use the Python agent of Application Monitoring to observe vLLM and SGLang inference engines.

Note

Application Real-Time Monitoring Service (ARMS) currently supports observability only for the vLLM/SGLang framework.

Set up observability for PAI-EAS

Elastic Algorithm Service (EAS) is a PAI service for deploying and serving models online. To enable ARMS observability for a vLLM or SGLang model deployed on EAS, follow these steps.

Step 1: Prepare environment variables

export ARMS_APP_NAME=xxx   # The name of the EAS application.
export ARMS_REGION_ID=xxx   # The region ID for your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx   # The Alibaba Cloud license key.

Step 2: Modify the PAI-EAS run command

  1. Log on to the PAI console. At the top of the page, select the target region, and then navigate to the target workspace.

  2. In the navigation pane on the left, choose Model Deployment > Elastic Algorithm Service (EAS).

  3. On the Inference Service tab, find the application for which you want to enable model observability, and then click Update in the Actions column.

  4. Modify the run command.

    The following example uses the DeepSeek-R1-Distill-Qwen-7B model.

    Original vLLM command:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    Modified vLLM command with ARMS observability:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    The modifications include the following steps:

    1. Configure the PyPI repository. You can adjust this as needed.

      export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Use the installer to install the agent.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;

    Original SGLang command:

    python -m sglang.launch_server --model-path /model_dir

    Modified SGLang command with ARMS observability:

    export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

    The modifications include the following steps:

    1. Configure the PyPI repository. You can adjust this as needed.

      export PIP_INDEX_URL=http://mirrors.cloud.aliyuncs.com/pypi/simple; export PIP_TRUSTED_HOST=mirrors.cloud.aliyuncs.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Use the installer to install the agent.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
  5. Click Update.

Set up observability in other environments

ARMS supports the official versions of vLLM (V0 and V1) and SGLang. Custom-modified versions are not supported. For more information about supported versions, see LLM (large language model) services.

ARMS collects two spans for non-streaming requests and three spans for streaming requests. The following table describes the supported scenarios.

Supported scenario

Data processing

Collected content

vLLM V0

vLLM V1

SGLang

Chat

or

completion

Streaming

span

  • http

  • input/output

  • llm_request: key metrics

  • http

  • input/output

  • http

  • input/output

  • key metrics

  • reasoning

Key metrics

TTFT/TPOT

Supported

Supported

Supported

Non-streaming

span

  • http

  • input/output

  • http

  • input/output

  • http

  • input/output

Key metrics

TTFT/TPOT

Not applicable

Not applicable

Not applicable

Embedding

http

Not supported

Supported

Not supported

Rerank

http

Not supported

Supported

Not supported

Span attributes

Attributes of the llm_request span:

Attribute

Description

gen_ai.latency.e2e

End-to-end time

gen_ai.latency.time_in_queue

Time in queue

gen_ai.latency.time_in_scheduler

Scheduling time

gen_ai.latency.time_to_first_token

Time to first token

gen_ai.request.id

Request ID

Metric descriptions

vLLM

Dimension descriptions

Dimension name

Dimension Key

Example

Description

Model name

modelName / model_name

qwen-7b, llama3-8b

Model name

Engine index

engine_index

0, 1, 2

For V1 only, engine instance index

Operation type

spanKind

LLM

LLM type operation

Usage type

usageType

input, output

For Token-related metrics only, indicates the Token type

End reason

finished_reason

stop

Reason for request termination

Common metrics (V0/V1 shared)

Metric name

Metric

Metric type

Unit

Description

Iterations

vllm_iter_count

Counter

None

Iteration count

Successful requests

gen_ai_vllm_request_success

Counter

None

Number of successfully processed requests

Time to first token

genai_llm_first_token_seconds

Counter

Seconds

Time to generate the first token

Time per output token

gen_ai_server_time_per_output_token

Counter

Seconds

Time to generate each output token

End-to-end request duration

gen_ai_server_request_duration

Counter

Seconds

Request end-to-end latency

Token usage

llm_usage_tokens

Counter

None

Number of tokens used (distinguishes input/output)

V0 system metrics

Metric name

Metric

Metric type

Unit

Description

GPU cache usage

gpu_cache_usage_sys

Gauge

None

System GPU cache usage

CPU cache usage

cpu_cache_usage_sys

Gauge

None

System CPU cache usage

Number of running sequences

num_running_sys

Gauge

None

Number of currently running sequences

Number of waiting sequences

num_waiting_sys

Gauge

None

Number of sequences waiting to be processed

Swapped sequences

num_swapped_sys

Gauge

None

Number of swapped sequences

V0 iteration metrics

Metric name

Metric

Metric type

Unit

Description

Iteration prompt token count

num_prompt_tokens_iter

Counter

None

Number of prompt tokens in the current iteration

Iteration generated token count

num_generation_tokens_iter

Counter

None

Number of generated tokens in the current iteration

Iteration total token count

num_tokens_iter

Counter

None

Total number of tokens in the current iteration

Iteration preemption count

num_preemption_iter

Counter

None

Number of preemptions in the current iteration

V1 system metrics

Metric name

Metric

Metric type

Unit

Description

Running requests

gen_ai_vllm_num_requests_running

Gauge

None

Number of requests in the model execution batch

Number of pending requests

gen_ai_vllm_num_requests_waiting

Gauge

None

Number of requests waiting to be processed

KV cache usage

gen_ai_vllm_kv_cache_usage_perc

Gauge

None

KV cache usage, range [0,1]

Prefix cache queries

gen_ai_vllm_prefix_cache_queries

Counter

None

Number of prefix cache queries (counted by query tokens)

Prefix cache hits

gen_ai_vllm_prefix_cache_hits

Counter

None

Number of prefix cache hits (counted by cached tokens)

V1 iteration metrics

Metric name

Metric

Metric type

Unit

Description

Preemptions

gen_ai_vllm_num_preemptions

Counter

None

Cumulative engine preemptions

Prompt tokens

gen_ai_vllm_prompt_tokens

Counter

None

Number of prefill tokens processed

Generated tokens

gen_ai_vllm_generation_tokens

Counter

None

Number of generated tokens processed

Request parameter n

gen_ai_vllm_request_params_n

Counter

None

Value of request parameter n

Request parameter max_tokens

gen_ai_vllm_request_params_max_tokens

Counter

None

Value of request parameter max_tokens

V1 request latency metrics

Metric name

Metric

Metric type

Unit

Description

Request queue time

gen_ai_vllm_request_queue_time_seconds

Counter

Seconds

Time spent by the request in the WAITING stage

Request prefill time

gen_ai_vllm_request_prefill_time_seconds

Counter

Seconds

Time spent by the request in the PREFILL stage

Request decode time

gen_ai_vllm_request_decode_time_seconds

Counter

Seconds

Time spent by the request in the DECODE stage

Request inference time

gen_ai_vllm_request_inference_time_seconds

Counter

Seconds

Time spent by the request in the RUNNING stage

SGLang

Dimension descriptions

Dimension name

Dimension Key

Example

Description

Model name

modelName / model_name

qwen-7b, deepseek-r1

Model name

Operation type

spanKind

LLM

LLM type operation

Usage type

usageType

input, output

For Token-related metrics only

Call type

callType

gen_ai

Default value is gen_ai

RPC type

rpcType

2100

RPC type identifier

System status metrics

Metric name

Metric

Metric type

Unit

Description

Running requests

sglang_num_running_reqs

Counter

None

Number of requests currently running

Queued requests

sglang_num_queue_reqs

Counter

None

Number of requests waiting to be processed in the queue

Log count

sglang_log_count

Counter

None

Log count

Token-related metrics

Metric name

Metric

Metric type

Unit

Description

Used tokens

sglang_num_used_tokens

Counter

None

Number of tokens currently in use

Token usage rate

sglang_token_usage

Counter

None

Token usage rate

Total prompt tokens

prompt_tokens_total

Counter

None

Cumulative prompt tokens

Total generated tokens

generation_tokens_total

Counter

None

Cumulative generated tokens

Total cached tokens

gen_ai_sglang_cached_tokens_total

Counter

None

Number of cached prompt tokens

Token usage

llm_usage_tokens

Counter

None

Number of tokens used (distinguishes input/output)

Performance metrics

Metric name

Metric

Metric type

Unit

Description

Generation throughput

sglang_gen_throughput

Counter

None

Number of tokens generated per second

Time to first token

gen_ai_server_time_to_first_token

Counter

Seconds

Time to generate the first token

Time per output token

gen_ai_server_time_per_output_token

Counter

Seconds

Time to generate each output token

Inter-token latency

sglang_inter_token_latency_seconds

Counter

Seconds

Generation latency between tokens

End-to-end request duration

gen_ai_server_request_duration

Counter

Seconds

Request end-to-end latency

Cache and speculative execution metrics

Metric name

Metric

Metric type

Unit

Description

Cache hit ratio

gen_ai_sglang_cache_hit_rate

Counter

None

Cache hit ratio

Speculative accept length

sglang_spec_accept_length

Counter

None

Length accepted by speculative decoding

Request statistics metrics

Metric name

Metric

Metric type

Unit

Description

Total requests

num_requests_total

Counter

None

Cumulative processed requests

Configuration reference

Environment variable name

Description

OTEL_INSTRUMENTATION_VLLM_TRACING_LEVEL

Observability granularity for the vLLM inference engine.

0: Records only request-level spans (llm_request span).

1: Also records spans for different inference stages (Wait/Prefill/Decode).

2: Also records detailed events for each generated token in the span event of the llm_request span.

OTEL_SPAN_EVENT_COUNT_LIMIT

Maximum number of token generation events observed when OTEL_INSTRUMENTATION_VLLM_TRACING_LEVEL is set to 2. Default: 128.