All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.10

Last Updated:Jan 06, 2026

This topic provides the release notes for inference-nv-pytorch version 25.10.

Key features and bug fixes

Key features

  • Dual CUDA version support
    Images for two different CUDA versions are now provided:

    • The CUDA 12.8 image supports the amd64 architecture.

    • The CUDA 13.0 image supports both amd64 and aarch64 architectures.

  • Core component upgrades

    • For the CUDA 12.8 image:

      • deepgpu-comfyui has been upgraded to 1.3.0.

      • The deepgpu-torch optimization component has been upgraded to 0.1.6+torch2.8.0cu128.

    • For the CUDA 13.0 image:

      • PyTorch has been upgraded to version 2.9.0.

    • For both the CUDA 12.8 and CUDA 13.0 images:

      • vLLM has been upgraded to version 0.11.0.

      • SGLang has been upgraded to version 0.5.4.

Bug fixes

No bug fixes in this release.

Contents

inference-nv-pytorch

Tag

25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless

25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless

25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless

25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless

Supported architectures

amd64

amd64

amd64

aarch64

amd64

aarch64

Use case

Large model inference

Large model inference

Large model inference

Large model inference

Large model inference

Large model inference

Framework

PyTorch

PyTorch

PyTorch

PyTorch

PyTorch

PyTorch

Requirements

NVIDIA driver release ≥ 570

NVIDIA driver release ≥ 570

NVIDIA driver release ≥ 580

NVIDIA driver release ≥ 580

NVIDIA driver release ≥ 580

NVIDIA driver release ≥ 580

System components

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.8.0+cu128

  • CUDA 12.8

  • diffusers 0.35.2

  • deepgpu-comfyui 1.3.0

  • deepgpu-torch 0.1.6+torch2.8.0cu128

  • flash_attn 2.8.3

  • imageio 2.37.0

  • imageio-ffmpeg 0.6.0

  • ray 2.50.1

  • transformers 4.57.1

  • triton 3.4.0

  • tokenizers 0.22.1

  • torchaudio 2.8.0+cu128

  • torchsde 0.2.6

  • torchvision 0.23.0+cu128

  • vllm 0.11.0

  • xfuser 0.4.4

  • xgrammar 0.1.25

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.8.0+cu128

  • CUDA 12.8

  • diffusers 0.35.2

  • decord 0.6.0

  • decord2 2.0.0

  • deepgpu-comfyui 1.3.0

  • deepgpu-torch 0.1.6+torch2.8.0cu128

  • flash_attn 2.8.3

  • flash_mla 1.0.0+1858932

  • flashinfer-python 0.4.1

  • imageio 2.37.0

  • imageio-ffmpeg 0.6.0

  • transformers 4.57.1

  • sgl-kernel 0.3.16.post3

  • sglang 0.5.4

  • xgrammar 0.1.25

  • triton 3.4.0

  • torchao 0.9.0

  • torchaudio 2.8.0+cu128

  • torchsde 0.2.6

  • torchvision 0.23.0+cu128

  • xfuser 0.4.4

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.1

  • diffusers 0.35.2

  • flash_attn 2.8.3

  • imageio 2.37.0

  • imageio-ffmpeg 0.6.0

  • ray 2.50.1

  • transformers 4.57.1

  • triton 3.5.0

  • tokenizers 0.22.1

  • torchvision 0.24.0+cu130

  • vllm 0.11.0

  • xfuser 0.4.4

  • xgrammar 0.1.25

  • ljperf 0.1.0+d0e4a408

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.1

  • diffusers 0.35.2

  • flash_attn 2.8.3

  • transformers 4.57.1

  • ray 2.50.1

  • vllm 0.11.0

  • triton 3.5.0

  • tokenizers 0.22.1

  • torchaudio 2.9.0

  • torchvision 0.24.0

  • xfuser 0.3

  • xgrammar 0.1.25

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.1

  • diffusers 0.35.2

  • decord 0.6.0

  • decord2 2.0.0

  • flash_attn 2.8.3

  • flash_mla 1.0.0+1858932

  • flashinfer-python 0.4.1

  • imageio 2.37.0

  • imageio-ffmpeg 0.6.0

  • transformers 4.57.1

  • sgl-kernel 0.3.16.post3

  • sglang 0.5.4

  • xgrammar 0.1.25

  • triton 3.5.0

  • torchao 0.9.0

  • torchaudio 2.9.0+cu130

  • torchvision 0.24.0+cu130

  • xfuser 0.4.4

  • ljperf 0.1.0+477686c5

  • Ubuntu 24.04

  • Python 3.12

  • Torch 2.9.0+cu130

  • CUDA 13.0.1

  • diffusers 0.35.2

  • decord2 2.0.0

  • flashinfer-python 0.4.1

  • imageio 2.37.0

  • imageio-ffmpeg 0.6.0

  • transformers 4.57.1

  • sgl-kernel 0.3.16.post3

  • sglang 0.5.4

  • xgrammar 0.1.25

  • triton 3.5.0

  • torchao 0.9.0

  • torchaudio 2.9.0

  • torchvision 0.24.0

  • xfuser 0.4.4

Assets

Public images

CUDA 12.8

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.8-cu128-20251027-serverless

CUDA 13.0

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.9-cu130-20251028-serverless

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.10-sglang0.5.4-pytorch2.9-cu130-20251028-serverless

VPC image

  • acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

    {region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
    {image:tag} indicates the name and tag of the image.
Important

Currently, only images in the China (Beijing) region can be pulled over a VPC.

Driver requirements

  • CUDA 12.8: Requires NVIDIA driver version 570 or later.

  • CUDA 13.0: Requires NVIDIA driver version 580 or later.

Quick start

The following example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.

Note

To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For more information, see the following topics about building model inference services with ACS GPU computing power:

  1. Pull the image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the open-source model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Run the following command to start the container and enter its shell.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Run an inference test on the vLLM conversation feature.

    1. Start the server-side service.

      python3 -m vllm.entrypoints.openai.api_server \
      --model /mnt/Qwen2.5-7B-Instruct \
      --trust-remote-code --disable-custom-all-reduce \
      --tensor-parallel-size 1
    2. Run a test on the client.

      curl http://localhost:8000/v1/chat/completions \
          -H "Content-Type: application/json" \
          -d '{
          "model": "/mnt/Qwen2.5-7B-Instruct",  
          "messages": [
          {"role": "system", "content": "You are a friendly AI assistant."},
          {"role": "user", "content": "Tell me about deep learning."}
          ]}'

      For more information about how to use vLLM, see vLLM.

Known issues

  • The deepgpu-comfyui plugin, which accelerates Wan model video generation, currently supports only the GN8IS and G49E instance types.