This release upgrades vLLM, SGLang, and deepgpu-comfyui, and resolves a multi-node inference bug in DeepSeek-R1 deployments.
What's new
Framework upgrades
| Framework | Version |
|---|---|
| vLLM | v0.9.2 |
| SGLang | v0.4.9.post1 |
| deepgpu-comfyui | v1.1.7 |
Bug fix
vLLM 0.9.2 encountered a PPMissingLayer error when running the DeepSeek-R1 model in a multi-node (dual-machine) configuration. This release pre-applies the fix from upstream PR #20665, so distributed inference on multi-node clusters works without manual patching.
Image specifications
This release provides two image variants, both targeting Large Language Model (LLM) inference on PyTorch with CUDA 12.8.
| vLLM image | SGLang image | |
|---|---|---|
| Image tag | 25.07-vllm0.9.2-pytorch2.7-cu128-20250714-serverless | 25.07-sglang0.4.9-pytorch2.7-cu128-20250710-serverless |
| Use case | LLM inference | LLM inference |
| Framework | PyTorch | PyTorch |
| Driver requirement | NVIDIA Driver ≥570 | NVIDIA Driver ≥570 |
System components — vLLM image
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.7.1+cu128 |
| CUDA | 12.8 |
| NCCL | 2.27.5 |
| accelerate | 1.8.1 |
| diffusers | 0.34.0 |
| deepgpu-comfyui | 1.1.7 |
| deepgpu-torch | 0.0.24+torch2.7.0cu128 |
| flash_attn | 2.8.1 |
| imageio | 2.37.0 |
| imageio-ffmpeg | 0.6.0 |
| ray | 2.47.1 |
| transformers | 4.53.1 |
| vllm | 0.9.3.dev0+ga5dd03c1e.d20250709 |
| xgrammar | 0.1.19 |
| triton | 3.3.1 |
System components — SGLang image
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.7.1+cu128 |
| CUDA | 12.8 |
| NCCL | 2.27.5 |
| accelerate | 1.8.1 |
| diffusers | 0.34.0 |
| deepgpu-comfyui | 1.1.7 |
| deepgpu-torch | 0.0.24+torch2.7.0cu128 |
| flash_attn | 2.8.1 |
| flash_mla | 1.0.0+9edee0c |
| flashinfer-python | 0.2.7.post1 |
| imageio | 2.37.0 |
| imageio-ffmpeg | 0.6.0 |
| transformers | 4.53.0 |
| sgl-kernel | 0.2.4 |
| sglang | 0.4.9.post1 |
| xgrammar | 0.1.20 |
| triton | 3.3.1 |
| torchao | 0.9.0 |
Image access
Public images
Pull either image directly from the public registry:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.07-vllm0.9.2-pytorch2.7-cu128-20250714-serverlessegslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.07-sglang0.4.9-pytorch2.7-cu128-20250710-serverless
VPC images
For lower-latency pulls within a Virtual Private Cloud (VPC), use:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}Replace {region-id} with the region where your Alibaba Cloud Container Compute Service (ACS) is activated (for example, cn-beijing or cn-wulanchabu), and {image:tag} with the image name and tag.
VPC image pulling is currently supported only in the China (Beijing) region.
Both images are compatible with ACS clusters and Lingjun multi-tenant clusters. They are not supported on Lingjun single-tenant clusters.
Driver requirement
CUDA 12.8 images require NVIDIA Driver 570 or later.
Quick start
This example pulls the vLLM image, downloads the Qwen2.5-7B-Instruct model, and runs an inference test.
For ACS deployments, select the image from the Artifact Center in the console or specify it in your YAML configuration. See the following guides for end-to-end deployment instructions:
Pull the image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download the model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructLaunch the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Start the vLLM inference server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Send a test request from the client.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Please introduce deep learning."} ] }'For more information, see the vLLM documentation.
Known issues
The deepgpu-comfyui plugin for Wanx model video generation supports only gn8is instance types.