inference-nv-pytorch 25.08 New Features & vLLM Support - Container Compute Service

This release upgrades vLLM to v0.10.0 and SGLang to v0.4.10.post2. No bug fixes are included.

What's new

vLLM upgraded to v0.10.0
SGLang upgraded to v0.4.10.post2

Image variants

Two image variants are available: one built around vLLM and one around SGLang.

Field	vLLM variant	SGLang variant
Tag	`25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless`	`25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless`
Use case	Large model inference	Large model inference
Framework	PyTorch	PyTorch
Minimum driver	NVIDIA Driver >= 570	NVIDIA Driver >= 570

System components

vLLM image

Package type	Package	Version
OS	Ubuntu	24.04
Runtime	Python	3.12
Runtime	Torch	2.7.1+cu128
Runtime	CUDA	12.8
Library	NCCL	2.27.5
Library	flash_attn	2.8.2
Library	triton	3.3.1
Library	xformers	0.0.31
Library	xfuser	0.4.4
Library	xgrammar	0.1.21
Library	ray	2.48.0
Library	transformers	4.55.0
Library	diffusers	0.34.0
Library	imageio	2.37.0
Library	imageio-ffmpeg	0.6.0
Library	vllm	0.10.0
DeepGPU	deepgpu-torch	0.0.24+torch2.7.0cu128
DeepGPU	deepgpu-comfyui	1.1.7

SGLang image

Package type	Package	Version
OS	Ubuntu	24.04
Runtime	Python	3.12
Runtime	Torch	2.7.1+cu128
Runtime	CUDA	12.8
Library	NCCL	2.27.5
Library	flash_attn	2.8.2
Library	flash_mla	1.0.0+41b611f
Library	flashinfer-python	0.2.9rc2
Library	triton	3.3.1
Library	xgrammar	0.1.22
Library	torchao	0.9.0
Library	transformers	4.54.1
Library	diffusers	0.34.0
Library	imageio	2.37.0
Library	imageio-ffmpeg	0.6.0
Library	sgl-kernel	0.2.8
Library	sglang	0.4.10.post2

Image registry

Internet images

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless

VPC images

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with your actual values:

Placeholder	Description	Example
`{region-id}`	Region where your ACS is activated	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	Image name and tag	`inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless`

Important

VPC image pulls are currently supported only in the China (Beijing) region.

Note

Both images are compatible with ACS products and Lingjun multi-tenant products. They are not compatible with Lingjun single-tenant products.

Driver requirements

NVIDIA Driver release >= 570

Quick start

The following example pulls the vLLM image and runs an inference test using the Qwen2.5-7B-Instruct model.

Note

To use inference-nv-pytorch images in ACS, select the image on the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file. For step-by-step guides, see:

Pull the container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the Qwen2.5-7B-Instruct model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Start the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Start the vLLM inference service inside the container.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Send a test request from the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

For more information about vLLM, see the vLLM documentation.

Known issues

The deepgpu-comfyui plug-in for Wanx model video generation currently supports only the GN8IS and G49E GPU instance types.