All Products
Search
Document Center

Application Real-Time Monitoring Service:Observe vLLM/SGLang inference engines

Last Updated:Dec 01, 2025

The Application Monitoring Python agent includes a new vLLM/SGLang plugin. This plugin lets you observe vLLM/SGLang inference engines.

Note

ARMS currently supports observability only for the vLLM/SGLang framework.

Connect to PAI-EAS

Elastic Algorithm Service (EAS) is a PAI service for online inference. It provides a one-stop platform for model development, deployment, and usage. You can deploy model services to public or dedicated resource groups. EAS provides real-time responses to model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.

Step 1: Prepare environment variables

export ARMS_APP_NAME=xxx   # The name of the EAS application.
export ARMS_REGION_ID=xxx   # The region ID for your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx   # The Alibaba Cloud license key.

Step 2: Modify the PAI-EAS run command

  1. Log on to the PAI console. At the top of the page, select the destination region, and then navigate to the target workspace.

  2. In the navigation pane on the left, choose Model Deployment > Elastic Algorithm Service (EAS).

  3. On the Inference Service tab, find the application for which you want to enable model observability, and then click Update in the Actions column.

  4. Modify the Run Command.

    This example shows how to connect to the DeepSeek-R1-Distill-Qwen-7B model.

    Original vLLM instruction:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    vLLM instruction for connecting to Application Monitoring:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    Description of the added parts:

    1. Configure the PyPI repository. You can adjust this as needed.

      pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Use the installer to install the agent.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;

    Original SGLang command:

    python -m sglang.launch_server --model-path /model_dir

    SGLang instruction for connecting to Application Monitoring:

    pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

    Description of the added parts:

    1. Configure the PyPI repository. You can adjust this as needed.

      pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Use the installer to install the agent.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
  5. Click Update.

Connect models in general scenarios

ARMS currently supports only the official versions of vLLM (V0 and V1) and SGLang. User-modified versions are not supported. For more information about the supported versions, see Python libraries supported by Application Monitoring.

ARMS supports two scenarios: completion and chat. ARMS collects two spans for non-streaming requests and three spans for streaming requests.

Supported scenario

Data processing

Collected content

vLLM V0

vLLM V1

SGLang

Chat

or

completion

Streaming

span

  • http

  • input/output

  • llm_request: key metrics

  • http

  • input/output

  • http

  • input/output

  • key metrics

  • reasoning

Key metrics

TTFT/TPOP

Supported

Not supported

Supported

Non-streaming

span

  • http

  • input/output

  • http

  • input/output

  • http

  • input/output

Key metrics

TTFT/TPOP

Not applicable

Not applicable

Not applicable

Embedding

http

Not supported

Not supported

Not supported

Rerank

http

Not supported

Not supported

Not supported

Descriptions of important spans and attributes

Attributes related to llm_request:

Attribute

Description

gen_ai.latency.e2e

End-to-end time

gen_ai.latency.time_in_queue

Time in queue

gen_ai.latency.time_in_scheduler

Scheduling time

gen_ai.latency.time_to_first_token

Time to first token

gen_ai.request.id

Request ID