inference-nv-pytorch 25.10 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for inference-nv-pytorch 25.10.

Main features and bug fixes

Main features

Images are provided for two CUDA versions: CUDA 12.8 and CUDA 13.0.
- The CUDA 12.8 image supports only the amd64 architecture.
- The CUDA 13.0 image supports the amd64 and aarch64 architectures. It can be used with L20A/20C instance types.
In the CUDA 12.8 image, deepgpu-comfyui is upgraded to 1.3.0, and the deepgpu-torch optimization component is upgraded to 0.1.6+torch2.8.0cu128.
In the CUDA 13.0 image, the PyTorch version is upgraded to 2.9.0.
In the CUDA 12.8 and CUDA 13.0 images, the vLLM version is upgraded to v0.11.0, and the SGLang version is upgraded to v0.5.4.

Bug fixes

None

inference-nv-pytorch
Tag	25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless	25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless	25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless		25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless
Supported architectures	amd64	amd64	amd64	aarch64	amd64	aarch64
Scenarios	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference
Framework	pytorch	pytorch	pytorch	pytorch	pytorch	pytorch
Requirements	NVIDIA Driver release >= 570	NVIDIA Driver release >= 570	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580	NVIDIA Driver release >= 580
System components	Ubuntu 24.04 Python 3.12 Torch 2.8.0+cu128 CUDA 12.8 diffusers 0.35.2 deepgpu-comfyui 1.3.0 deepgpu-torch 0.1.6+torch2.8.0cu128 flash_attn 2.8.3 imageio 2.37.0 imageio-ffmpeg 0.6.0 ray 2.50.1 transformers 4.57.1 triton 3.4.0 tokenizers 0.22.1 torchaudio 2.8.0+cu128 torchsde 0.2.6 torchvision 0.23.0+cu128 vllm 0.11.0 xfuser 0.4.4 xgrammar 0.1.25 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.8.0+cu128 CUDA 12.8 diffusers 0.35.2 decord 0.6.0 decord2 2.0.0 deepgpu-comfyui 1.3.0 deepgpu-torch 0.1.6+torch2.8.0cu128 flash_attn 2.8.3 flash_mla 1.0.0+1858932 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.4.0 torchao 0.9.0 torchaudio 2.8.0+cu128 torchsde 0.2.6 torchvision 0.23.0+cu128 xfuser 0.4.4 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 flash_attn 2.8.3 imageio 2.37.0 imageio-ffmpeg 0.6.0 ray 2.50.1 transformers 4.57.1 triton 3.5.0 tokenizers 0.22.1 torchvision 0.24.0+cu130 vllm 0.11.0 xfuser 0.4.4 xgrammar 0.1.25 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 flash_attn 2.8.3 transformers 4.57.1 ray 2.50.1 vllm 0.11.0 triton 3.5.0 tokenizers 0.22.1 torchaudio 2.9.0 torchvision 0.24.0 xfuser 0.3 xgrammar 0.1.25 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 decord 0.6.0 decord2 2.0.0 flash_attn 2.8.3 flash_mla 1.0.0+1858932 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.5.0 torchao 0.9.0 torchaudio 2.9.0+cu130 torchvision 0.24.0+cu130 xfuser 0.4.4 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 decord2 2.0.0 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.5.0 torchao 0.9.0 torchaudio 2.9.0 torchvision 0.24.0 xfuser 0.4.4

Assets

Public images

CUDA 12.8 assets

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless

CUDA 13.0 assets

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Driver requirements

CUDA 12.8: NVIDIA Driver release >= 570
CUDA 13.0: NVIDIA Driver release >= 580

Quick start

This example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, you can select the image on the Artifacts page when you create a workload in the console or specify the image reference in a YAML file. For more information, see the following topics about building model inference services using ACS GPU computing power:

Build a DeepSeek distillation model inference service using ACS GPU computing power
Build a full-featured DeepSeek model inference service using ACS GPU computing power
Build a distributed, full-featured DeepSeek inference service using ACS GPU computing power
Accelerate Wanx 2.1 video generation using DeepGPU

Pull the inference container image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the open source model in ModelScope format.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Run the following command to start and enter the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Test the vLLM conversational inference feature.

Start the server service.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Test on the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

For more information about how to use vLLM, see vLLM.

Known issues

The deepgpu-comfyui plug-in for Wanx model video generation acceleration currently supports only GN8IS and G49E.

Container Compute Service:inference-nv-pytorch 25.10