The Application Monitoring Python agent includes a new vLLM/SGLang plugin. This plugin lets you observe vLLM/SGLang inference engines.
ARMS currently supports observability only for the vLLM/SGLang framework.
Connect to PAI-EAS
Elastic Algorithm Service (EAS) is a PAI service for online inference. It provides a one-stop platform for model development, deployment, and usage. You can deploy model services to public or dedicated resource groups. EAS provides real-time responses to model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.
Step 1: Prepare environment variables
export ARMS_APP_NAME=xxx # The name of the EAS application.
export ARMS_REGION_ID=xxx # The region ID for your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx # The Alibaba Cloud license key.Step 2: Modify the PAI-EAS run command
Log on to the PAI console. At the top of the page, select the destination region, and then navigate to the target workspace.
In the navigation pane on the left, choose .
On the Inference Service tab, find the application for which you want to enable model observability, and then click Update in the Actions column.
Modify the Run Command.
This example shows how to connect to the DeepSeek-R1-Distill-Qwen-7B model.
Original vLLM instruction:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BvLLM instruction for connecting to Application Monitoring:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BDescription of the added parts:
Configure the PyPI repository. You can adjust this as needed.
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;Download the agent installer.
pip3 install aliyun-bootstrap;Use the installer to install the agent.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Original SGLang command:
python -m sglang.launch_server --model-path /model_dirSGLang instruction for connecting to Application Monitoring:
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dirDescription of the added parts:
Configure the PyPI repository. You can adjust this as needed.
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;Download the agent installer.
pip3 install aliyun-bootstrap;Use the installer to install the agent.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Click Update.
Connect models in general scenarios
ARMS currently supports only the official versions of vLLM (V0 and V1) and SGLang. User-modified versions are not supported. For more information about the supported versions, see Python libraries supported by Application Monitoring.
ARMS supports two scenarios: completion and chat. ARMS collects two spans for non-streaming requests and three spans for streaming requests.
Supported scenario | Data processing | Collected content | vLLM V0 | vLLM V1 | SGLang |
Chat or completion | Streaming | span |
|
|
|
Key metrics TTFT/TPOP | Supported | Not supported | Supported | ||
Non-streaming | span |
|
|
| |
Key metrics TTFT/TPOP | Not applicable | Not applicable | Not applicable | ||
Embedding | http | Not supported | Not supported | Not supported | |
Rerank | http | Not supported | Not supported | Not supported | |
Descriptions of important spans and attributes
Attributes related to llm_request:
Attribute | Description |
gen_ai.latency.e2e | End-to-end time |
gen_ai.latency.time_in_queue | Time in queue |
gen_ai.latency.time_in_scheduler | Scheduling time |
gen_ai.latency.time_to_first_token | Time to first token |
gen_ai.request.id | Request ID |