Dual CUDA 12.8/13.0 & PyTorch 2.9 GPU Inference Image Overview - Container Compute Service

What's new

Dual CUDA version support

Starting with this release, images are published for two CUDA versions:

CUDA 12.8 — supports amd64 only
CUDA 13.0 — supports both amd64 and aarch64

Core component upgrades

Component	Version	Images
vLLM	v0.12.0	CUDA 12.8 and 13.0
SGLang	v0.5.6.post2	CUDA 12.8 and 13.0
PyTorch	2.9.0 (vLLM images) / 2.9.1 (SGLang images)	CUDA 12.8 and 13.0
deepgpu-comfyui	1.3.2	CUDA 12.8 only
deepgpu-torch	0.1.12+torch2.9.0cu128	CUDA 12.8 only

Bug fixes

No bug fixes in this release.

Image contents

All images use PyTorch as the framework and are designed for large model inference.

Image name	inference-nv-pytorch
Tag	25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless	25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless	25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless		25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless
Supported architectures	amd64	amd64	amd64	aarch64	amd64	aarch64
Use case	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference	Large model inference
Framework	pytorch	pytorch	pytorch	pytorch	pytorch	pytorch
Requirements	NVIDIA Driver release ≥ 570	NVIDIA Driver release ≥ 570	NVIDIA Driver release ≥ 580	NVIDIA Driver release ≥ 580	NVIDIA Driver release ≥ 580	NVIDIA Driver release ≥ 580
System components	Base environment Ubuntu 24.04 Python 3.12 CUDA 12.8 Inference frameworks Torch 2.9.0+cu128 vllm 0.12.0 flash_attn 2.8.3 flashinfer-python 0.5.3 triton 3.5.0 xgrammar 0.1.27 xfuser 0.4.5 ray 2.52.1 transformers 4.57.3 diffusers 0.36.0 torchaudio 2.9.0+cu128 torchvision 0.24.0+cu128 imageio 2.37.2 imageio-ffmpeg 0.6.0 Alibaba Cloud components deepgpu-comfyui 1.3.2 deepgpu-torch 0.1.12+torch2.9.0cu128 ljperf 0.1.0+477686c5	Base environment Ubuntu 24.04 Python 3.12 CUDA 12.8 Inference frameworks Torch 2.9.1+cu128 sglang 0.5.6.post2 sgl-kernel 0.3.19 flash_attn 2.8.3 flash_mla 1.0.0+1408756 flashinfer-python 0.5.3 triton 3.5.1 xgrammar 0.1.27 xfuser 0.4.5 torchao 0.9.0 ray 2.52.1 transformers 4.57.1 diffusers 0.36.0 decord 0.6.0 decord2 2.0.0 torchaudio 2.9.1 torchvision 0.24.1 imageio 2.37.2 imageio-ffmpeg 0.6.0 Alibaba Cloud components deepgpu-comfyui 1.3.2 deepgpu-torch 0.1.12+torch2.9.0cu128 ljperf 0.1.0+477686c5	Base environment Ubuntu 24.04 Python 3.12 CUDA 13.0.2 Inference frameworks Torch 2.9.0+cu130 vllm 0.12.0 flash_attn 2.8.3 flashinfer-python 0.5.3 triton 3.5.0 xgrammar 0.1.27 xfuser 0.4.5 ray 2.52.1 transformers 4.57.3 diffusers 0.36.0 torchaudio 2.9.0+cu130 torchvision 0.24.0+cu130 imageio 2.37.2 imageio-ffmpeg 0.6.0 ljperf 0.1.0+d0e4a408	Base environment Ubuntu 24.04 Python 3.12 CUDA 13.0.2 Inference frameworks Torch 2.9.0+cu130 vllm 0.12.0 flash_attn 2.8.3 flashinfer-python 0.5.3 triton 3.5.0 xgrammar 0.1.27 xfuser 0.4.5 ray 2.53.0 transformers 4.57.1 diffusers 0.36.0 torchaudio 2.9.0 torchvision 0.24.0	Base environment Ubuntu 24.04 Python 3.12 CUDA 13.0.2 Inference frameworks Torch 2.9.1+cu130 sglang 0.5.6.post2 sgl-kernel 0.3.19 flash_attn 2.8.3 flashinfer-python 0.5.3 triton 3.5.1 xgrammar 0.1.27 xfuser 0.4.5 torchao 0.9.0 ray 2.52.1 transformers 4.57.3 diffusers 0.36.0 decord 0.6.0 decord2 2.0.0 torchaudio 2.9.1 torchvision 0.24.1+cu130 imageio 2.37.2 imageio-ffmpeg 0.6.0 ljperf 0.1.0+d0e4a408	Base environment Ubuntu 24.04 Python 3.12 CUDA 13.0.2 Inference frameworks Torch 2.9.1+cu130 sglang 0.5.6.post2 sgl-kernel 0.3.19 flash_attn 2.8.3 flashinfer-python 0.5.3 triton 3.5.1 xgrammar 0.1.27 xfuser 0.4.5 torchao 0.9.0 transformers 4.57.1 diffusers 0.36.0 decord2 2.0.0 torchaudio 2.9.1 torchvision 0.24.1 imageio 2.37.2 imageio-ffmpeg 0.6.0

Assets

Public images

CUDA 12.8

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless

CUDA 13.0

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless

VPC images

To speed up image pulls from within your virtual private cloud (VPC), replace the standard registry hostname with a region-specific VPC endpoint.

Change the image path from:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

To:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Placeholder	Description	Example
`{region-id}`	The ID of the region where your ACS service is deployed	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	The name and tag of the target AI container image	`inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless`

VPC images are compatible with standard ACS products and multi-tenant Lingjun environments. Do not use them in single-tenant Lingjun environments.

Driver requirements

CUDA version	Minimum NVIDIA Driver version
CUDA 12.8	570
CUDA 13.0	580

Quick start

The following example pulls the inference-nv-pytorch image and runs a conversational inference test using the Qwen2.5-7B-Instruct model.

Note

To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For step-by-step deployment guides, see:

Pull the image.

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Download the Qwen2.5-7B-Instruct model from ModelScope.

pip install modelscope
cd /mnt
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Start the container.

docker run -d -t --network=host --privileged --init --ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864  \
-v /mnt/:/mnt/ \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]

Start the vLLM inference server inside the container.

python3 -m vllm.entrypoints.openai.api_server \
--model /mnt/Qwen2.5-7B-Instruct \
--trust-remote-code --disable-custom-all-reduce \
--tensor-parallel-size 1

Send a test request to the server.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/mnt/Qwen2.5-7B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me about deep learning."}
    ]}'

For more information about vLLM usage, see the vLLM documentation.

Known issues

Issue	Affected scope	Workaround
The `deepgpu-comfyui` plugin for Wanx model video generation acceleration supports only the GN8IS, G49E, and G59 instance types.	CUDA 12.8 images	Use a GN8IS, G49E, or G59 instance when running Wanx model video generation workloads with `deepgpu-comfyui`.