All Products
Search
Document Center

Cloud Monitor:Observe vLLM/SGLang inference engines

Last Updated:Dec 03, 2025

The Application Monitoring Python agent includes a new vLLM/SGLang plugin that provides observability for vLLM/SGLang inference engines.

Note

ARMS supports observability only for the vLLM/SGLang framework.

PAI-EAS integration

Elastic Algorithm Service (EAS) is a PAI product for online inference. It provides a comprehensive service for model development and deployment. You can deploy model services to public or dedicated resource groups to enable real-time responses for model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.

Step 1: Prepare environment variables

export ARMS_APP_NAME=xxx   # The name of the EAS application.
export ARMS_REGION_ID=xxx   # The region ID of your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx   # The Alibaba Cloud license key.

Step 2: Modify the PAI-EAS run command

  1. Log on to the PAI console. In the top navigation bar, select the destination region and then navigate to the destination workspace.

  2. In the navigation pane on the left, choose Model Deployment > Elastic Algorithm Service (EAS).

  3. On the Inference Service tab, find the target application and click Update in the Actions column.

  4. Modify the run command.

    This example uses the DeepSeek-R1-Distill-Qwen-7B model.

    Original vLLM instruction:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    vLLM instruction for integrating with Application Monitoring:

    gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

    Description of the added commands:

    1. Configure the PyPI repository. Adjust this configuration as needed.

      pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Install the agent using the installer.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;

    Original SGLang command:

    python -m sglang.launch_server --model-path /model_dir

    Instruction for integrating SGLang with Application Monitoring:

    pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

    Command descriptions:

    1. Configure the PyPI repository. Adjust this configuration as needed.

      pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;
    2. Download the agent installer.

      pip3 install aliyun-bootstrap;
    3. Install the agent.

      Replace cn-hangzhou with your actual region.

      ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
  5. Click Update.

Model integration for general scenarios

ARMS supports only the official vLLM (V0 and V1) and SGLang versions. Modified versions are not supported. For a detailed list of supported versions, see Large Language Model (LLM) Service.

ARMS supports completion and chat scenarios. Two spans are collected for non-streaming requests, and three spans are collected for streaming requests.

Supported scenarios

Data processing

Collected content

vLLM V0

vLLM V1

SGLang

Chat

or

completion

Streaming

span

  • http

  • input/output

  • llm_request: key metrics

  • http

  • input/output

  • http

  • input/output

  • key metrics

  • reasoning

key metrics

TTFT/TPOP

Supported

Not supported

Supported

Non-streaming

span

  • http

  • input/output

  • http

  • input/output

  • http

  • input/output

key metrics

TTFT/TPOP

Not applicable

Not applicable

Not applicable

Embedding

http

Not supported

Not supported

Not supported

Rerank

http

Not supported

Not supported

Not supported

Important spans and attributes

Related to llm_request:

Attribute

Description

gen_ai.latency.e2e

End-to-end time

gen_ai.latency.time_in_queue

Time in queue

gen_ai.latency.time_in_scheduler

Scheduling time

gen_ai.latency.time_to_first_token

Time to first token

gen_ai.request.id

Request ID