All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.11

Last Updated:Mar 26, 2026

Version 25.11 introduces dual CUDA version support and upgrades key inference framework components across all images.

Key features and bug fixes

Key features

Dual CUDA version support

Two sets of images are now available, each targeting a different CUDA version:

  • CUDA 12.8 image: supports amd64 architecture. Requires NVIDIA driver 570 or later.

  • CUDA 13.0 image: supports both amd64 and aarch64 architectures. Requires NVIDIA driver 580 or later.

Core component upgrades

PyTorch has been upgraded to 2.9.0 across all images. Additional upgrades by image:

Component CUDA 12.8 image CUDA 13.0 image
vLLM 0.11.1 0.11.2
SGLang 0.5.5.post3 0.5.5.post3
deepgpu-comfyui 1.3.2
deepgpu-torch 0.1.12+torch2.9.0cu128

Bug fixes

No bug fixes in this release.

Contents

The following table lists the image tags and their system components.

All four tags follow the pattern 25.11-{framework}{version}-pytorch2.9-{cuda}-{date}-serverless.

CUDA 12.8 images

Image tag: 25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless

Attribute Value
Supported architectures amd64
Use case Large model inference
Framework PyTorch
NVIDIA driver requirement ≥ 570

System components:

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu128
CUDA 12.8
vLLM 0.11.1
diffusers 0.35.2
deepgpu-comfyui 1.3.2
deepgpu-torch 0.1.12+torch2.9.0cu128
flash_attn 2.8.3
flashinfer-python 0.5.2
imageio 2.37.2
imageio-ffmpeg 0.6.0
ray 2.51.1
transformers 4.57.1
triton 3.4.0
torchaudio 2.8.0+cu128
torchvision 0.24.0+cu128
xfuser 0.4.5
xgrammar 0.1.25
ljperf 0.1.0+477686c5

Image tag: 25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless

Attribute Value
Supported architectures amd64
Use case Large model inference
Framework PyTorch
NVIDIA driver requirement ≥ 570

System components:

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu128
CUDA 12.8
SGLang 0.5.5.post3
sgl-kernel 0.3.17.post1
diffusers 0.35.2
decord 0.6.0
decord2 2.0.0
deepgpu-comfyui 1.3.2
deepgpu-torch 0.1.12+torch2.9.0cu128
flash_attn 2.8.3
flash_mla 1.0.0+1408756
flashinfer-python 0.5.2
imageio 2.37.2
imageio-ffmpeg 0.6.0
ray 2.51.1
transformers 4.57.1
triton 3.5.0
torchao 0.9.0
torchaudio 2.8.0+cu128
torchvision 0.24.0+cu128
xfuser 0.4.5
xgrammar 0.1.25
ljperf 0.1.0+477686c5

CUDA 13.0 images

Image tag: 25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless

Attribute Value
Supported architectures amd64, aarch64
Use case Large model inference
Framework PyTorch
NVIDIA driver requirement ≥ 580

System components (amd64):

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu130
CUDA 13.0.2
vLLM 0.11.2
diffusers 0.35.2
flash_attn 2.8.3
flashinfer-python 0.5.2
imageio 2.37.2
imageio-ffmpeg 0.6.0
ray 2.51.1
transformers 4.57.1
triton 3.5.0
torchaudio 2.9.0+cu130
torchvision 0.24.0+cu130
xfuser 0.4.5
xgrammar 0.1.25
ljperf 0.1.0+d0e4a408

System components (aarch64):

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu130
CUDA 13.0.2
vLLM 0.11.1
diffusers 0.35.2
flash_attn 2.8.3
flashinfer-python 0.5.2
ray 2.51.1
transformers 4.57.1
triton 3.5.0
torchaudio 2.9.0
torchvision 0.24.0
xfuser 0.4.5
xgrammar 0.1.25

Image tag: 25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless

Attribute Value
Supported architectures amd64, aarch64
Use case Large model inference
Framework PyTorch
NVIDIA driver requirement ≥ 580

System components (amd64):

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu130
CUDA 13.0.2
SGLang 0.5.5.post3
sgl-kernel 0.3.17.post1
diffusers 0.35.2
decord 0.6.0
decord2 2.0.0
flash_attn 2.8.3
flashinfer-python 0.5.2
imageio 2.37.2
imageio-ffmpeg 0.6.0
ray 2.51.1
transformers 4.57.1
triton 3.5.0
torchao 0.9.0
torchaudio 2.9.0
torchvision 0.24.0
xfuser 0.4.5
xgrammar 0.1.25
ljperf 0.1.0+d0e4a408

System components (aarch64):

Component Version
Ubuntu 24.04
Python 3.12
Torch 2.9.0+cu130
CUDA 13.0.2
SGLang 0.5.5.post3
sgl-kernel 0.3.17.post1
diffusers 0.35.2
decord2 2.0.0
flash_attn 2.8.3
flashinfer-python 0.5.2
imageio 2.37.2
imageio-ffmpeg 0.6.0
transformers 4.57.1
triton 3.5.0
torchao 0.9.0
torchaudio 2.9.0
torchvision 0.24.0
xfuser 0.4.5
xgrammar 0.1.25

Assets

Public images

CUDA 12.8

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless

CUDA 13.0

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless

VPC images

To speed up image pulls from within your virtual private cloud (VPC), use a region-specific VPC endpoint instead of the public registry.

Replace the public image URI format:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

With the VPC endpoint format:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Where:

  • {region-id}: The ID of the region where your ACS service is deployed. Examples: cn-beijing, cn-wulanchabu.

  • {image:tag}: The name and tag of the target container image. Examples: inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless, training-nv-pytorch:25.10-serverless.

Note

These images are for standard ACS and the multi-tenant Lingjun environment only. Do not use them in a single-tenant Lingjun setup.

Driver requirements

CUDA version Minimum NVIDIA driver version
CUDA 12.8 570
CUDA 13.0 580

Quick start

The following example shows how to pull an inference-nv-pytorch image using Docker and run an inference service with the Qwen2.5-7B-Instruct model.

  1. Pull the image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Test vLLM inference.

    1. Start the vLLM API server.

      python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 
    2. Send a test request.

      curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Introduce deep learning."} ]}' 

      For more information about how to use vLLM, see vLLM.

Known issues

  • The deepgpu-comfyui plugin for accelerating Wanx model video generation supports only the GN8IS, G49E, and G59 instance types.