vLLM v0.9.0.1 & SGLang v0.4.7 in inference-nv-pytorch 25.06 - Container Compute Service

This release updates vLLM to v0.9.0.1 and SGLang to v0.4.7, and introduces the deepgpu-comfyui plug-in for accelerated ComfyUI inference on L20 GPUs. Use this page to confirm what changed, verify that your environment meets the driver requirements, and run the quick start to validate the image.

What's new

Main features

vLLM is updated to v0.9.0.1.
SGLang is updated to v0.4.7.
The deepgpu-comfyui plug-in is introduced to accelerate ComfyUI services on L20 GPUs for Wan2.1 and FLUX model inference. Performance improves by 8%–40% compared to PyTorch.

Bugs fixed

None.

System components

This release includes two image variants: one based on vLLM and one based on SGLang. The table below lists their image tags, target scenarios, and component versions.

	vLLM image	SGLang image
Image tag	`25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless`	`25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless`
Scenario	LLM reasoning	LLM inference
Framework	PyTorch	PyTorch
NVIDIA driver requirement	>= 570	>= 550
Ubuntu	24.04	22.04
Python	3.12	3.10
Torch	2.7.1+cu128	2.7.1+cu128
CUDA	12.8	12.8
NCCL	2.27.3	2.27.3
accelerate	1.7.0	1.7.0
diffusers	0.33.1	0.33.1
deepgpu-comfyui	1.1.5	1.1.5
deepgpu-torch	0.0.21+torch2.7.0cu128	0.0.21+torch2.7.0cu128
flash_attn	2.7.4.post1	2.7.4.post1
flash_mla	—	1.0.0+9edee0c
flashinfer-python	—	0.2.6.post1
imageio	2.37.0	2.37.0
imageio-ffmpeg	0.6.0	0.6.0
ray	2.46.0	2.46.0
transformers	4.52.4	4.52.3
vllm	0.9.0.2.dev0+g5fbbfe9a4.d20250609	—
sgl-kernel	—	0.1.7
sglang	—	0.4.7
xgrammar	0.1.19	0.1.19
triton	3.3.1	3.3.1
torchao	—	0.9.0

Image access

Public images

Pull either image directly from the public registry:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless

VPC images

For VPC access, use the following address pattern:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated, such as cn-beijing or cn-wulanchabu. Replace {image:tag} with the image name and tag.

Important

VPC image pulls are available only in the China (Beijing) region.

Important

Both images (25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless and 25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless) are compatible with ACS services and FLUX multi-tenant services, but are not compatible with FLUX single-tenant services.

Driver requirements

Both images require CUDA 12.8. The minimum NVIDIA driver version differs by image:

Image	Minimum NVIDIA driver version
vLLM image	570
SGLang image	550

Quick start

The following example pulls the inference-nv-pytorch image with Docker and runs an inference test using the Qwen2.5-7B-Instruct model.

To use the inference-nv-pytorch image in ACS, select the image from the artifact center page when creating workloads, or specify it in a YAML file. For details, see:

Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model

Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version

Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the Qwen2.5-7B-Instruct model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Start the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Run the inference test.

Start the vLLM API server.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Send a test request from the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Please introduce deep learning."}
    ]}'

For more information about vLLM, see vLLM.

Known issues

The deepgpu-comfyui plug-in supports only GN8IS for accelerating video generation based on the Wanx model.