This page covers the changes in inference-nv-pytorch 25.09.
Announcements
The inference-nv-pytorch:25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless and inference-nv-pytorch:25.09-sglang0.5.2-pytorch2.8-cu128-20250917-serverless images apply to ACS (Alibaba Cloud Container Service for Kubernetes) and Lingjun multi-tenant products only. They do not apply to Lingjun single-tenant products.
What's new
Framework upgrades
-
PyTorch upgraded to 2.8.0
-
vLLM upgraded to v0.10.2
-
SGLang upgraded to v0.5.2
-
deepgpu-comfyui upgraded to 1.2.1
-
deepgpu-torch optimization component upgraded to 0.1.1+torch2.8.0cu128
Bug fixes
None.
Image contents
The following tables list the system components for each image tag.
vLLM image
Tag: 25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.8.0 |
| CUDA | 12.8 |
| Diffusers | 0.35.1 |
| deepgpu-comfyui | 1.2.1 |
| deepgpu-torch | 0.1.1+torch2.8.0cu128 |
| Flash Attention | 2.8.3 |
| flashinfer | 0.3.1 |
| imageio | 2.37.0 |
| imageio-ffmpeg | 0.6.0 |
| Ray | 2.49.1 |
| Transformers | 4.56.1 |
| Triton | 3.4.0 |
| vLLM | 0.10.2 |
| xFormers | 0.0.32.post1 |
| xFuser | 0.4.4 |
| XGrammar | 0.1.23 |
| ljperf | 0.1.0+477686c5 |
SGLang image
Tag: 25.09-sglang0.5.2-pytorch2.8-cu128-20250917-serverless
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.8.0 |
| CUDA | 12.8 |
| Decord | 0.6.0 |
| Diffusers | 0.35.1 |
| deepgpu-comfyui | 1.2.1 |
| deepgpu-torch | 0.1.1+torch2.8.0cu128 |
| Flash Attention | 2.8.3 |
| flash_mla | 1.0.0+261330b |
| flashinfer | 0.3.1 |
| imageio | 2.37.0 |
| imageio-ffmpeg | 0.6.0 |
| Transformers | 4.56.1 |
| sgl-kernel | 0.3.9 |
| SGLang | 0.5.2 |
| XGrammar | 0.1.24 |
| Triton | 3.4.0 |
| torchao | 0.9.0 |
| torchaudio | 2.8.0 |
| xFuser | 0.4.4 |
| ljperf | 0.1.0+477686c5 |
Assets
Public images
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.09-sglang0.5.2-pytorch2.8-cu128-20250917-serverless
VPC images
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace the placeholders with the actual values:
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
Region where your ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} |
Image name and tag | inference-nv-pytorch:25.09-vllm0.10.2-pytorch2.8-cu128-20250922-serverless |
Currently, you can pull VPC images only from the China (Beijing) region.
Driver requirements
NVIDIA Driver release >= 570
Quick start
This example shows how to pull the inference-nv-pytorch image with Docker and test an inference service using the Qwen2.5-7B-Instruct model.
To use the inference-nv-pytorch image in ACS, select it on the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file. For more information about building model inference services with ACS GPU computing power, see:
Build a DeepSeek distilled model inference service with ACS GPU computing power
Build a full-featured DeepSeek model inference service with ACS GPU computing power
Build a distributed full-featured DeepSeek inference service with ACS GPU computing power
Accelerate Wan2.1 video generation with DeepGPU
-
Pull the inference container image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Download the Qwen2.5-7B-Instruct model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Test the vLLM conversational inference.
-
Start the vLLM server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a test request.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Tell me about deep learning."} ]}'For more information about vLLM, see the vLLM documentation.
-
Known issues
The deepgpu-comfyui plugin for Wanx model video generation acceleration currently supports only GN8IS and G49E GPU types.