Observe vLLM/SGLang inference engines - Application Real-Time Monitoring Service

The Application Monitoring Python agent includes a new vLLM/SGLang plugin. This plugin lets you observe vLLM/SGLang inference engines.

Note

ARMS currently supports observability only for the vLLM/SGLang framework.

Connect to PAI-EAS

Elastic Algorithm Service (EAS) is a PAI service for online inference. It provides a one-stop platform for model development, deployment, and usage. You can deploy model services to public or dedicated resource groups. EAS provides real-time responses to model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.

Step 1: Prepare environment variables

export ARMS_APP_NAME=xxx   # The name of the EAS application.
export ARMS_REGION_ID=xxx   # The region ID for your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx   # The Alibaba Cloud license key.

Step 2: Modify the PAI-EAS run command

Log on to the PAI console. At the top of the page, select the destination region, and then navigate to the target workspace.
In the navigation pane on the left, choose Model Deployment > Elastic Algorithm Service (EAS).
On the Inference Service tab, find the application for which you want to enable model observability, and then click Update in the Actions column.

Modify the Run Command.

This example shows how to connect to the DeepSeek-R1-Distill-Qwen-7B model.

Original vLLM instruction:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

vLLM instruction for connecting to Application Monitoring:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

Description of the added parts:

Configure the PyPI repository. You can adjust this as needed.

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

Download the agent installer.
```
pip3 install aliyun-bootstrap;
```
Use the installer to install the agent.
Replace cn-hangzhou with your actual region.
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

Original SGLang command:

python -m sglang.launch_server --model-path /model_dir

SGLang instruction for connecting to Application Monitoring:

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

Description of the added parts:

Configure the PyPI repository. You can adjust this as needed.

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

Download the agent installer.
```
pip3 install aliyun-bootstrap;
```
Use the installer to install the agent.
Replace cn-hangzhou with your actual region.
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

Click Update.

Connect models in general scenarios

ARMS currently supports only the official versions of vLLM (V0 and V1) and SGLang. User-modified versions are not supported. For more information about the supported versions, see Python libraries supported by Application Monitoring.

ARMS supports two scenarios: completion and chat. ARMS collects two spans for non-streaming requests and three spans for streaming requests.

Supported scenario	Data processing	Collected content	vLLM V0	vLLM V1	SGLang
Chat or completion	Streaming	span	http input/output llm_request: key metrics	http input/output	http input/output key metrics reasoning
	Streaming	Key metrics TTFT/TPOP	Supported	Not supported	Supported
	Non-streaming	span	http input/output	http input/output	http input/output
	Non-streaming	Key metrics TTFT/TPOP	Not applicable	Not applicable	Not applicable
Embedding		http	Not supported	Not supported	Not supported
Rerank		http	Not supported	Not supported	Not supported

Descriptions of important spans and attributes

Attributes related to llm_request:

Attribute	Description
gen_ai.latency.e2e	End-to-end time
gen_ai.latency.time_in_queue	Time in queue
gen_ai.latency.time_in_scheduler	Scheduling time
gen_ai.latency.time_to_first_token	Time to first token
gen_ai.request.id	Request ID