The Application Monitoring Python agent includes a new vLLM/SGLang plugin that provides observability for vLLM/SGLang inference engines.
ARMS supports observability only for the vLLM/SGLang framework.
PAI-EAS integration
Elastic Algorithm Service (EAS) is a PAI product for online inference. It provides a comprehensive service for model development and deployment. You can deploy model services to public or dedicated resource groups to enable real-time responses for model loading and data requests on heterogeneous hardware, such as CPUs and GPUs.
Step 1: Prepare environment variables
export ARMS_APP_NAME=xxx # The name of the EAS application.
export ARMS_REGION_ID=xxx # The region ID of your Alibaba Cloud account.
export ARMS_LICENSE_KEY=xxx # The Alibaba Cloud license key.Step 2: Modify the PAI-EAS run command
Log on to the PAI console. In the top navigation bar, select the destination region and then navigate to the destination workspace.
In the navigation pane on the left, choose .
On the Inference Service tab, find the target application and click Update in the Actions column.
Modify the run command.
This example uses the DeepSeek-R1-Distill-Qwen-7B model.
Original vLLM instruction:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BvLLM instruction for integrating with Application Monitoring:
gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7BDescription of the added commands:
Configure the PyPI repository. Adjust this configuration as needed.
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;Download the agent installer.
pip3 install aliyun-bootstrap;Install the agent using the installer.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Original SGLang command:
python -m sglang.launch_server --model-path /model_dirInstruction for integrating SGLang with Application Monitoring:
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dirCommand descriptions:
Configure the PyPI repository. Adjust this configuration as needed.
pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;Download the agent installer.
pip3 install aliyun-bootstrap;Install the agent.
Replace
cn-hangzhouwith your actual region.ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
Click Update.
Model integration for general scenarios
ARMS supports only the official vLLM (V0 and V1) and SGLang versions. Modified versions are not supported. For a detailed list of supported versions, see Large Language Model (LLM) Service.
ARMS supports completion and chat scenarios. Two spans are collected for non-streaming requests, and three spans are collected for streaming requests.
Supported scenarios | Data processing | Collected content | vLLM V0 | vLLM V1 | SGLang |
Chat or completion | Streaming | span |
|
|
|
key metrics TTFT/TPOP | Supported | Not supported | Supported | ||
Non-streaming | span |
|
|
| |
key metrics TTFT/TPOP | Not applicable | Not applicable | Not applicable | ||
Embedding | http | Not supported | Not supported | Not supported | |
Rerank | http | Not supported | Not supported | Not supported | |
Important spans and attributes
Related to llm_request:
Attribute | Description |
gen_ai.latency.e2e | End-to-end time |
gen_ai.latency.time_in_queue | Time in queue |
gen_ai.latency.time_in_scheduler | Scheduling time |
gen_ai.latency.time_to_first_token | Time to first token |
gen_ai.request.id | Request ID |