All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.11

Last Updated:Dec 25, 2025

This topic describes the release notes for inference-nv-pytorch 25.11.

Main features and bug fixes

Main features

  • Images are available for two CUDA versions: CUDA 12.8 and CUDA 13.0.

    • The CUDA 12.8 image supports only the amd64 architecture.

    • The CUDA 13.0 image supports the amd64 and aarch64 architectures. It can be used with L20A and L20C instance types.

  • PyTorch is upgraded to 2.9.0.

  • For the CUDA 12.8 image, deepgpu-comfyui is upgraded to 1.3.2 and the deepgpu-torch optimization component is upgraded to 0.1.12+torch2.9.0cu128.

  • For the CUDA 12.8 and CUDA 13.0 images, vLLM is upgraded to v0.11.2 and SGLang is upgraded to v0.5.5.post3.

Bug fixes

None

Contents

Image name

inference-nv-pytorch

Image tag

25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless

25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless

25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless

25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless

Supported architecture

amd64

amd64

amd64

aarch64

amd64

aarch64

Application scenario

Large model inference

Large model inference

Large model inference

Large model inference

Large model inference

Large model inference

Framework

pytorch

pytorch

pytorch

pytorch

pytorch

pytorch

Requirements

NVIDIA Driver release >= 570

NVIDIA Driver release >= 570

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

NVIDIA Driver release >= 580

System components

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu128

  • CUDA 12.8

  • diffusers 0.35.2

  • deepgpu-comfyui 1.3.2

  • deepgpu-torch 0.1.12+torch2.9.0cu128

  • flash_attn 2.8.3

  • flashinfer-python 0.5.2

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ray 2.51.1

  • transformers 4.57.1

  • triton 3.4.0

  • torchaudio 2.8.0+cu128

  • torchvision 0.24.0+cu128

  • vllm 0.11.1

  • xfuser 0.4.5

  • xgrammar 0.1.25

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu128

  • CUDA 12.8

  • diffusers 0.35.2

  • decord 0.6.0

  • decord2 2.0.0

  • deepgpu-comfyui 1.3.2

  • deepgpu-torch 0.1.12+torch2.9.0cu128

  • flash_attn 2.8.3

  • flash_mla 1.0.0+1408756

  • flashinfer-python 0.5.2

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ray 2.51.1

  • transformers 4.57.1

  • sgl-kernel 0.3.17.post1

  • sglang 0.5.5.post3

  • xgrammar 0.1.25

  • triton 3.5.0

  • torchao 0.9.0

  • torchaudio 2.8.0+cu128

  • torchvision 0.24.0+cu128

  • xfuser 0.4.5

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.2

  • diffusers 0.35.2

  • flash_attn 2.8.3

  • flashinfer-python 0.5.2

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ray 2.51.1

  • transformers 4.57.1

  • triton 3.5.0

  • torchaudio 2.9.0+cu130

  • torchvision 0.24.0+cu130

  • vllm 0.11.2

  • xfuser 0.4.5

  • xgrammar 0.1.25

  • ljperf 0.1.0+d0e4a408

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.2

  • diffusers 0.35.2

  • flash_attn 2.8.3

  • flashinfer-python 0.5.2

  • transformers 4.57.1

  • ray 2.51.1

  • vllm 0.11.1

  • triton 3.5.0

  • torchaudio 2.9.0

  • torchvision 0.24.0

  • xfuser 0.4.5

  • xgrammar 0.1.25

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.2

  • diffusers 0.35.2

  • decord 0.6.0

  • decord2 2.0.0

  • flash_attn 2.8.3

  • flashinfer-python 0.5.2

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ray 2.51.1

  • transformers 4.57.1

  • sgl-kernel 0.3.17.post1

  • sglang 0.5.5.post3

  • xgrammar 0.1.25

  • triton 3.5.0

  • torchao 0.9.0

  • torchaudio 2.9.0

  • torchvision 0.24.0

  • xfuser 0.4.5

  • ljperf 0.1.0+d0e4a408

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.2

  • diffusers 0.35.2

  • decord2 2.0.0

  • flashinfer-python 0.5.2

  • imageio 2.37.2

  • flash_attn 2.8.3

  • imageio-ffmpeg 0.6.0

  • transformers 4.57.1

  • sgl-kernel 0.3.17.post1

  • sglang 0.5.5.post3

  • xgrammar 0.1.25

  • triton 3.5.0

  • torchao 0.9.0

  • torchaudio 2.9.0

  • torchvision 0.24.0

  • xfuser 0.4.5

Assets

Public network images

CUDA 12.8 assets

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless

CUDA 13.0 assets

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless

VPC images

To quickly pull an ACS AI container image within a VPC, replace the specified AI container image asset URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

  • {region-id}: The region ID of an available region for ACS products. Examples: cn-beijing and cn-wulanchabu.

  • {image:tag}: The name and tag of the AI container image. Examples: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

These images are applicable to ACS products and Lingjun multi-tenant products. Do not use them in Lingjun single-tenant scenarios.

Driver requirements

  • CUDA 12.8: NVIDIA driver version >= 570

  • CUDA 13.0: NVIDIA driver version >= 580

Quick start

The following example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use the inference-nv-pytorch image in ACS, you can select the image on the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file. For more information, see the following topics about how to build a model inference service using ACS GPU computing power:

  1. Pull the inference container image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the open source model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Run the following command to enter the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run an inference test on the vLLM conversational inference feature.

    1. Start the server-side service.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Run a test on the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",  
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Tell me about deep learning."}
          ]}'

      For more information about how to use vLLM, see vLLM.

Known issues

  • The deepgpu-comfyui plugin, which accelerates video generation for the Wanx model, supports only the GN8IS, G49E, and G59 instance types.

  • SGLang 0.5.5.post3 fails to run the DeepSeek-R1 model on L20A/L20C instances and returns the error TypeError: Mismatched type on argument #17 when calling: `trtllm_fp8_block_scale_moe. For more information, see the related issue in the open source community.