inference-nv-pytorch 25.10 Release Notes with Dual CUDA & vLLM 0.11.0 - Container Compute Service

Release notes for the inference-nv-pytorch 25.10 images.

Key features and bug fixes

Key features

Dual CUDA version support Two images are now available for different CUDA versions:
- The CUDA 12.8 image supports the amd64 architecture.
- The CUDA 13.0 image supports both amd64 and aarch64 architectures.
Core component upgrades
- For the CUDA 12.8 image:
  - deepgpu-comfyui has been upgraded to 1.3.0.
  - The deepgpu-torch optimization component has been upgraded to 0.1.6+torch2.8.0cu128.
- For the CUDA 13.0 image:
  - PyTorch has been upgraded to version 2.9.0.
- For both the CUDA 12.8 and CUDA 13.0 images:
  - vLLM has been upgraded to version 0.11.0.
  - SGLang has been upgraded to version 0.5.4.

Bug fixes

No bug fixes in this release.

inference-nv-pytorch
Tag	25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless	25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless	25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless		25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless
Supported architectures	amd64	amd64	amd64	aarch64	amd64	aarch64
Use case	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference
Framework	PyTorch	PyTorch	PyTorch	PyTorch	PyTorch	PyTorch
Requirements	NVIDIA driver release ≥ 570	NVIDIA driver release ≥ 570	NVIDIA driver release ≥ 580	NVIDIA driver release ≥ 580	NVIDIA driver release ≥ 580	NVIDIA driver release ≥ 580
System components	Ubuntu 24.04 Python 3.12 Torch 2.8.0+cu128 CUDA 12.8 diffusers 0.35.2 deepgpu-comfyui 1.3.0 deepgpu-torch 0.1.6+torch2.8.0cu128 flash_attn 2.8.3 imageio 2.37.0 imageio-ffmpeg 0.6.0 ray 2.50.1 transformers 4.57.1 triton 3.4.0 tokenizers 0.22.1 torchaudio 2.8.0+cu128 torchsde 0.2.6 torchvision 0.23.0+cu128 vllm 0.11.0 xfuser 0.4.4 xgrammar 0.1.25 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.8.0+cu128 CUDA 12.8 diffusers 0.35.2 decord 0.6.0 decord2 2.0.0 deepgpu-comfyui 1.3.0 deepgpu-torch 0.1.6+torch2.8.0cu128 flash_attn 2.8.3 flash_mla 1.0.0+1858932 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.4.0 torchao 0.9.0 torchaudio 2.8.0+cu128 torchsde 0.2.6 torchvision 0.23.0+cu128 xfuser 0.4.4 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 flash_attn 2.8.3 imageio 2.37.0 imageio-ffmpeg 0.6.0 ray 2.50.1 transformers 4.57.1 triton 3.5.0 tokenizers 0.22.1 torchvision 0.24.0+cu130 vllm 0.11.0 xfuser 0.4.4 xgrammar 0.1.25 ljperf 0.1.0+d0e4a408	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 flash_attn 2.8.3 transformers 4.57.1 ray 2.50.1 vllm 0.11.0 triton 3.5.0 tokenizers 0.22.1 torchaudio 2.9.0 torchvision 0.24.0 xfuser 0.3 xgrammar 0.1.25 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 decord 0.6.0 decord2 2.0.0 flash_attn 2.8.3 flash_mla 1.0.0+1858932 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.5.0 torchao 0.9.0 torchaudio 2.9.0+cu130 torchvision 0.24.0+cu130 xfuser 0.4.4 ljperf 0.1.0+477686c5	Ubuntu 24.04 Python 3.12 Torch 2.9.0+cu130 CUDA 13.0.1 diffusers 0.35.2 decord2 2.0.0 flashinfer-python 0.4.1 imageio 2.37.0 imageio-ffmpeg 0.6.0 transformers 4.57.1 sgl-kernel 0.3.16.post3 sglang 0.5.4 xgrammar 0.1.25 triton 3.5.0 torchao 0.9.0 torchaudio 2.9.0 torchvision 0.24.0 xfuser 0.4.4

Assets

Public images

CUDA 12.8

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless

CUDA 13.0

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your Alibaba Cloud Container Service (ACS) is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, only images in the China (Beijing) region can be pulled over a VPC.

Driver requirements

CUDA 12.8: Requires NVIDIA driver version 570 or later.
CUDA 13.0: Requires NVIDIA driver version 580 or later.

Quick start

The following example pulls the inference-nv-pytorch image using Docker and tests the inference service with the Qwen2.5-7B-Instruct model.

Note

To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For more information, see the following topics about building model inference services with ACS GPU computing power:

Pull the image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the open-source model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Start the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Run an inference test on the vLLM conversation feature.

Start the server.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Send a test request from the client.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",  
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

For more information about vLLM, see vLLM.

Known issues

The deepgpu-comfyui plugin, which accelerates Wan model video generation, currently supports only the GN8IS and G49E instance types.

Container Compute Service:inference-nv-pytorch 25.10