Observe models on PAI-EAS - Cloud Monitor - Alibaba Cloud Documentation Center

The Application Monitoring Python agent includes a new vLLM/SGLang plugin that provides observability for vLLM/SGLang inference engines.

Note

ARMS supports observability only for the vLLM/SGLang framework.

PAI-EAS integration

Elastic Algorithm Service (EAS) is a PAI product for online inference. It provides a comprehensive service for model development and deployment. You can deploy model services to public or dedicated resource groups to enable real-time responses for model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.

Step 1: Prepare environment variables

export ARMS_APP_NAME=xxx   # The name of the EAS application.
export ARMS_REGION_ID=xxx   # The region ID of your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx   # The Alibaba Cloud license key.

Step 2: Modify the PAI-EAS run command

Log on to the PAI console. In the top navigation bar, select the destination region and then navigate to the destination workspace.
In the navigation pane on the left, choose Model Deployment > Elastic Algorithm Service (EAS).
On the Inference Service tab, find the target application and click Update in the Actions column.

Modify the run command.

This example uses the DeepSeek-R1-Distill-Qwen-7B model.

Original vLLM instruction:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

vLLM instruction for integrating with Application Monitoring:

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

Description of the added commands:

Configure the PyPI repository. Adjust this configuration as needed.

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

Download the agent installer.
```
pip3 install aliyun-bootstrap;
```
Install the agent using the installer.
Replace cn-hangzhou with your actual region.
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

Original SGLang command:

python -m sglang.launch_server --model-path /model_dir

Instruction for integrating SGLang with Application Monitoring:

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

Command descriptions:

Configure the PyPI repository. Adjust this configuration as needed.

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

Download the agent installer.
```
pip3 install aliyun-bootstrap;
```
Install the agent.
Replace cn-hangzhou with your actual region.
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

Click Update.

Model integration for general scenarios

ARMS supports only the official vLLM (V0 and V1) and SGLang versions. Modified versions are not supported. For a detailed list of supported versions, see Large Language Model (LLM) Service.

ARMS supports completion and chat scenarios. Two spans are collected for non-streaming requests, and three spans are collected for streaming requests.

Supported scenarios	Data processing	Collected content	vLLM V0	vLLM V1	SGLang
Chat or completion	Streaming	span	http input/output llm_request: key metrics	http input/output	http input/output key metrics reasoning
	Streaming	key metrics TTFT/TPOP	Supported	Not supported	Supported
	Non-streaming	span	http input/output	http input/output	http input/output
	Non-streaming	key metrics TTFT/TPOP	Not applicable	Not applicable	Not applicable
Embedding		http	Not supported	Not supported	Not supported
Rerank		http	Not supported	Not supported	Not supported

Important spans and attributes

Related to llm_request:

Attribute	Description
gen_ai.latency.e2e	End-to-end time
gen_ai.latency.time_in_queue	Time in queue
gen_ai.latency.time_in_scheduler	Scheduling time
gen_ai.latency.time_to_first_token	Time to first token
gen_ai.request.id	Request ID