對vLLM/SGLang推理引擎進行可觀測 - Application Real-Time Monitoring Service

應用監控的Python探針新增了vLLM/SGLang外掛程式，支援對vLLM/SGLang推理引擎進行可觀測。

說明

ARMS目前僅支援對 vLLM/SGLang 架構進行可觀測。

PAI-EAS接入

模型線上服務EAS（Elastic Algorithm Service）是PAI產品為實現一站式模型開發部署應用，針對線上推理情境提供的模型線上服務，支援將模型服務部署在公用資源群組或專屬資源群組，實現基於異構硬體（CPU和GPU）的模型載入和資料請求的即時響應。

步驟一：準備環境變數

export ARMS_APP_NAME=xxx   # EAS應用程式名稱。
export ARMS_REGION_ID=xxx   # 對應的阿里雲帳號的RegionID。
export ARMS_LICENSE_KEY=xxx   # 阿里雲 LicenseKey。

步驟二：修改PAI-EAS運行命令

登入PAI控制台，在頁面上方選擇目標地區，然後進入目標工作空間。
在左側導覽列選擇模型部署 > 模型線上服務（EAS）。
在推理服務頁簽選擇要接入模型觀測的應用，然後單擊操作列的更新。

修改運行命令。

以接入DeepSeek-R1-Distill-Qwen-7B模型為例。

vLLM原始指令：

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

接入應用監控對應的vLLM指令：

gpu_count=$(nvidia-smi --query-gpu=count --format=csv,noheader | wc -l);pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument vllm serve /model_dir --host 0.0.0.0 --port 8000 --root-path '/' --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 32768 --tensor-parallel-size $gpu_count --served-model-name DeepSeek-R1-Distill-Qwen-7B

新增部分說明：

配置 pipy 倉庫, 可以根據實際情況調整。

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

下載探針安裝器。
```
pip3 install aliyun-bootstrap;
```
使用安裝器安裝探針。
請根據實際地區替換cn-hangzhou。
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

SGLang原始命令：

python -m sglang.launch_server --model-path /model_dir

接入應用監控對應的SGLang指令：

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com; pip3 install aliyun-bootstrap;ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;ARMS_APP_NAME=qwq32 ARMS_LICENSE_KEY=it0kjz0oxz@3115ad****** ARMS_REGION_ID=cn-hangzhou aliyun-instrument python -m sglang.launch_server --model-path /model_dir

新增部分說明：

配置 pipy 倉庫, 可以根據實際情況調整。

pip3 config set global.index-url https://mirrors.aliyun.com/pypi/simple/ ; pip3 config set install.trusted-host mirrors.aliyun.com;

下載探針安裝器。
```
pip3 install aliyun-bootstrap;
```
使用安裝器安裝探針。
請根據實際地區替換cn-hangzhou。
```
ARMS_REGION_ID=cn-hangzhou aliyun-bootstrap -a install;
```

單擊更新。

通用情境模型接入

ARMS 目前只支援官方提供的 vLLM 版本（V0和V1版本）和 SGLang 版本，具體支援的版本範圍請參考LLM（大語言模型）服務，使用者修改過的版本不支援接入。

ARMS 支援補全和對話兩個情境，如果是非流式請求會採集 2 個 Span，流式請求會採集 3 個 Span。

支援的情境	資料處理	採集內容	vLLM V0	vLLM V1	SGLang
chat 對話 or completion 補全	流式	span	http input/output llm_request: key metrics	http input/output	http input/output key metrics reasoning（思考）
	流式	key metrics TTFT/TPOP	支援	不支援	支援
	非流式	span	http input/output	http input/output	http input/output
	非流式	key metrics TTFT/TPOP	不適用	不適用	不適用
Embedding		http	不支援	不支援	不支援
Rerank		http	不支援	不支援	不支援

重要Span及Attributes說明

llm_request相關：

Attribute	描述
gen_ai.latency.e2e	端到端的時間
gen_ai.latency.time_in_queue	進入隊列的時間
gen_ai.latency.time_in_scheduler	調度時間
gen_ai.latency.time_to_first_token	首包時間
gen_ai.request.id	請求ID