This release updates vLLM to v0.9.0.1 and SGLang to v0.4.7, and introduces the deepgpu-comfyui plug-in for accelerated ComfyUI inference on L20 GPUs. Use this page to confirm what changed, verify that your environment meets the driver requirements, and run the quick start to validate the image.
What's new
Main features
-
vLLM is updated to v0.9.0.1.
-
SGLang is updated to v0.4.7.
-
The deepgpu-comfyui plug-in is introduced to accelerate ComfyUI services on L20 GPUs for Wan2.1 and FLUX model inference. Performance improves by 8%–40% compared to PyTorch.
Bugs fixed
None.
System components
This release includes two image variants: one based on vLLM and one based on SGLang. The table below lists their image tags, target scenarios, and component versions.
| vLLM image | SGLang image | |
|---|---|---|
| Image tag | 25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless |
25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless |
| Scenario | LLM reasoning | LLM inference |
| Framework | PyTorch | PyTorch |
| NVIDIA driver requirement | >= 570 | >= 550 |
| Ubuntu | 24.04 | 22.04 |
| Python | 3.12 | 3.10 |
| Torch | 2.7.1+cu128 | 2.7.1+cu128 |
| CUDA | 12.8 | 12.8 |
| NCCL | 2.27.3 | 2.27.3 |
| accelerate | 1.7.0 | 1.7.0 |
| diffusers | 0.33.1 | 0.33.1 |
| deepgpu-comfyui | 1.1.5 | 1.1.5 |
| deepgpu-torch | 0.0.21+torch2.7.0cu128 | 0.0.21+torch2.7.0cu128 |
| flash_attn | 2.7.4.post1 | 2.7.4.post1 |
| flash_mla | — | 1.0.0+9edee0c |
| flashinfer-python | — | 0.2.6.post1 |
| imageio | 2.37.0 | 2.37.0 |
| imageio-ffmpeg | 0.6.0 | 0.6.0 |
| ray | 2.46.0 | 2.46.0 |
| transformers | 4.52.4 | 4.52.3 |
| vllm | 0.9.0.2.dev0+g5fbbfe9a4.d20250609 | — |
| sgl-kernel | — | 0.1.7 |
| sglang | — | 0.4.7 |
| xgrammar | 0.1.19 | 0.1.19 |
| triton | 3.3.1 | 3.3.1 |
| torchao | — | 0.9.0 |
Image access
Public images
Pull either image directly from the public registry:
-
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless -
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless
VPC images
For VPC access, use the following address pattern:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace {region-id} with the region where your ACS is activated, such as cn-beijing or cn-wulanchabu. Replace {image:tag} with the image name and tag.
VPC image pulls are available only in the China (Beijing) region.
Both images (25.06-vllm0.9.0.1-pytorch2.7-cu128-20250609-serverless and 25.06-sglang0.4.7-pytorch2.7-cu128-20250611-serverless) are compatible with ACS services and FLUX multi-tenant services, but are not compatible with FLUX single-tenant services.
Driver requirements
Both images require CUDA 12.8. The minimum NVIDIA driver version differs by image:
| Image | Minimum NVIDIA driver version |
|---|---|
| vLLM image | 570 |
| SGLang image | 550 |
Quick start
The following example pulls the inference-nv-pytorch image with Docker and runs an inference test using the Qwen2.5-7B-Instruct model.
To use the inference-nv-pytorch image in ACS, select the image from the artifact center page when creating workloads, or specify it in a YAML file. For details, see:
Use ACS GPU compute power to deploy a model inference service from a DeepSeek distilled model
Use ACS GPU compute power to deploy a model inference service based on the DeepSeek full version
Use ACS GPU compute power to deploy a distributed model inference service based on the DeepSeek full version
-
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Download the Qwen2.5-7B-Instruct model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Run the inference test.
-
Start the vLLM API server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a test request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ]}'For more information about vLLM, see vLLM.
-
Known issues
-
The deepgpu-comfyui plug-in supports only GN8IS for accelerating video generation based on the Wanx model.