inference-nv-pytorch 25.08 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for inference-nv-pytorch 25.08.

Main features and bug fix lists

Main features

Upgraded vLLM to v0.10.0.
Upgraded SGLang to v0.4.10.post2.

Bug fix

(None)

	inference-nv-pytorch	inference-nv-pytorch
Tag	25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless	25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless
Application scenario	Large model inference	Large model inference
Framework	PyTorch	PyTorch
Requirements	NVIDIA Driver release >= 570	NVIDIA Driver release >= 570
System components	Ubuntu 24.04 Python 3.12 Torch 2.7.1+cu128 CUDA 12.8 NCCL 2.27.5 diffusers 0.34.0 deepgpu-comfyui 1.1.7 deepgpu-torch 0.0.24+torch2.7.0cu128 flash_attn 2.8.2 imageio 2.37.0 imageio-ffmpeg 0.6.0 diffusers 0.34.0 ray 2.48.0 transformers 4.55.0 triton 3.3.1 vllm 0.10.0 xformers 0.0.31 xfuser 0.4.4 xgrammar 0.1.21	Ubuntu 24.04 Python 3.12 Torch 2.7.1+cu128 CUDA 12.8 NCCL 2.27.5 diffusers 0.34.0 deepgpu-comfyui 1.1.7 deepgpu-torch 0.0.24+torch2.7.0cu128 flash_attn 2.8.2 flash_mla 1.0.0+41b611f flashinfer-python 0.2.9rc2 imageio 2.37.0 imageio-ffmpeg 0.6.0 diffusers 0.34.0 transformers 4.54.1 sgl-kernel 0.2.8 sglang 0.4.10.post2 xgrammar 0.1.22 triton 3.3.1 torchao 0.9.0

Asset

Internet images

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The inference-nv-pytorch:25.08-vllm0.10.0-pytorch2.7-cu128-20250811-serverless and inference-nv-pytorch:25.08-sglang0.4.10.post2-pytorch2.7-cu128-20250808-serverless images are applicable to ACS products and Lingjun multi-tenant products, but not to Lingjun single-tenant products.

Driver requirements

NVIDIA Driver release >= 570

Quick start

The following example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, you can select the image on the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file. For more information, see the following topics about building model inference services using ACS GPU computing power:

Build a DeepSeek distilled model inference service using ACS GPU computing power
Build a full-featured DeepSeek model inference service using ACS GPU computing power
Build a distributed full-featured DeepSeek inference service using ACS GPU computing power
Accelerate Wan2.1 video generation using DeepGPU

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the open-source model in ModelScope format.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Run the following command to start and enter the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Test the vLLM inference and conversation feature.

Start the service.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Run a test from the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

For more information about how to use vLLM, see vLLM.

Known issues

The deepgpu-comfyui plug-in, which accelerates video generation for Wanx models, currently supports only GN8IS and G49E.

Container Compute Service:inference-nv-pytorch 25.08