All Products
Search
Document Center

Container Compute Service:inference-nv-pytorch 25.12

Last Updated:Mar 26, 2026

This release adds CUDA 13.0 support with aarch64 architecture coverage, and upgrades vLLM to v0.12.0 and SGLang to v0.5.6.post2 across both CUDA variants.

What's new

Dual CUDA version support

Starting with this release, images are published for two CUDA versions:

  • CUDA 12.8 — supports amd64 only

  • CUDA 13.0 — supports both amd64 and aarch64

Core component upgrades

Component Version Images
vLLM v0.12.0 CUDA 12.8 and 13.0
SGLang v0.5.6.post2 CUDA 12.8 and 13.0
PyTorch 2.9.0 (vLLM images) / 2.9.1 (SGLang images) CUDA 12.8 and 13.0
deepgpu-comfyui 1.3.2 CUDA 12.8 only
deepgpu-torch 0.1.12+torch2.9.0cu128 CUDA 12.8 only

Bug fixes

No bug fixes in this release.

Image contents

All images use PyTorch as the framework and are designed for large model inference.

Image name inference-nv-pytorch
Tag 25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless 25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless 25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless 25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless
Supported architectures amd64 amd64 amd64 aarch64 amd64 aarch64
Use case Large model inference Large model inference Large model inference Large model inference Large model inference Large model inference
Framework pytorch pytorch pytorch pytorch pytorch pytorch
Requirements NVIDIA Driver release ≥ 570 NVIDIA Driver release ≥ 570 NVIDIA Driver release ≥ 580 NVIDIA Driver release ≥ 580 NVIDIA Driver release ≥ 580 NVIDIA Driver release ≥ 580
System components

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 12.8

Inference frameworks

  • Torch 2.9.0+cu128

  • vllm 0.12.0

  • flash_attn 2.8.3

  • flashinfer-python 0.5.3

  • triton 3.5.0

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • ray 2.52.1

  • transformers 4.57.3

  • diffusers 0.36.0

  • torchaudio 2.9.0+cu128

  • torchvision 0.24.0+cu128

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

Alibaba Cloud components

  • deepgpu-comfyui 1.3.2

  • deepgpu-torch 0.1.12+torch2.9.0cu128

  • ljperf 0.1.0+477686c5

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 12.8

Inference frameworks

  • Torch 2.9.1+cu128

  • sglang 0.5.6.post2

  • sgl-kernel 0.3.19

  • flash_attn 2.8.3

  • flash_mla 1.0.0+1408756

  • flashinfer-python 0.5.3

  • triton 3.5.1

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • torchao 0.9.0

  • ray 2.52.1

  • transformers 4.57.1

  • diffusers 0.36.0

  • decord 0.6.0

  • decord2 2.0.0

  • torchaudio 2.9.1

  • torchvision 0.24.1

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

Alibaba Cloud components

  • deepgpu-comfyui 1.3.2

  • deepgpu-torch 0.1.12+torch2.9.0cu128

  • ljperf 0.1.0+477686c5

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 13.0.2

Inference frameworks

  • Torch 2.9.0+cu130

  • vllm 0.12.0

  • flash_attn 2.8.3

  • flashinfer-python 0.5.3

  • triton 3.5.0

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • ray 2.52.1

  • transformers 4.57.3

  • diffusers 0.36.0

  • torchaudio 2.9.0+cu130

  • torchvision 0.24.0+cu130

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ljperf 0.1.0+d0e4a408

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 13.0.2

Inference frameworks

  • Torch 2.9.0+cu130

  • vllm 0.12.0

  • flash_attn 2.8.3

  • flashinfer-python 0.5.3

  • triton 3.5.0

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • ray 2.53.0

  • transformers 4.57.1

  • diffusers 0.36.0

  • torchaudio 2.9.0

  • torchvision 0.24.0

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 13.0.2

Inference frameworks

  • Torch 2.9.1+cu130

  • sglang 0.5.6.post2

  • sgl-kernel 0.3.19

  • flash_attn 2.8.3

  • flashinfer-python 0.5.3

  • triton 3.5.1

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • torchao 0.9.0

  • ray 2.52.1

  • transformers 4.57.3

  • diffusers 0.36.0

  • decord 0.6.0

  • decord2 2.0.0

  • torchaudio 2.9.1

  • torchvision 0.24.1+cu130

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

  • ljperf 0.1.0+d0e4a408

Base environment

  • Ubuntu 24.04

  • Python 3.12

  • CUDA 13.0.2

Inference frameworks

  • Torch 2.9.1+cu130

  • sglang 0.5.6.post2

  • sgl-kernel 0.3.19

  • flash_attn 2.8.3

  • flashinfer-python 0.5.3

  • triton 3.5.1

  • xgrammar 0.1.27

  • xfuser 0.4.5

  • torchao 0.9.0

  • transformers 4.57.1

  • diffusers 0.36.0

  • decord2 2.0.0

  • torchaudio 2.9.1

  • torchvision 0.24.1

  • imageio 2.37.2

  • imageio-ffmpeg 0.6.0

Assets

Public images

CUDA 12.8

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless

CUDA 13.0

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless

VPC images

To speed up image pulls from within your virtual private cloud (VPC), replace the standard registry hostname with a region-specific VPC endpoint.

Change the image path from:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

To:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Placeholder Description Example
{region-id} The ID of the region where your ACS service is deployed cn-beijing, cn-wulanchabu
{image:tag} The name and tag of the target AI container image inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
VPC images are compatible with standard ACS products and multi-tenant Lingjun environments. Do not use them in single-tenant Lingjun environments.

Driver requirements

CUDA version Minimum NVIDIA Driver version
CUDA 12.8 570
CUDA 13.0 580

Quick start

The following example pulls the inference-nv-pytorch image and runs a conversational inference test using the Qwen2.5-7B-Instruct model.

Note

To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For step-by-step deployment guides, see:

  1. Pull the image.

    docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  2. Download the Qwen2.5-7B-Instruct model from ModelScope.

    pip install modelscope
    cd /mnt
    modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct
  3. Start the container.

    docker run -d -t --network=host --privileged --init --ipc=host \
    --ulimit memlock=-1 --ulimit stack=67108864  \
    -v /mnt/:/mnt/ \
    egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]
  4. Start the vLLM inference server inside the container.

    python3 -m vllm.entrypoints.openai.api_server \
    --model /mnt/Qwen2.5-7B-Instruct \
    --trust-remote-code --disable-custom-all-reduce \
    --tensor-parallel-size 1
  5. Send a test request to the server.

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
        "model": "/mnt/Qwen2.5-7B-Instruct",
        "messages": [
        {"role": "system", "content": "You are a friendly AI assistant."},
        {"role": "user", "content": "Tell me about deep learning."}
        ]}'

    For more information about vLLM usage, see the vLLM documentation.

Known issues

Issue Affected scope Workaround
The deepgpu-comfyui plugin for Wanx model video generation acceleration supports only the GN8IS, G49E, and G59 instance types. CUDA 12.8 images Use a GN8IS, G49E, or G59 instance when running Wanx model video generation workloads with deepgpu-comfyui.