Version 25.11 introduces dual CUDA version support and upgrades key inference framework components across all images.
Key features and bug fixes
Key features
Dual CUDA version support
Two sets of images are now available, each targeting a different CUDA version:
-
CUDA 12.8 image: supports amd64 architecture. Requires NVIDIA driver 570 or later.
-
CUDA 13.0 image: supports both amd64 and aarch64 architectures. Requires NVIDIA driver 580 or later.
Core component upgrades
PyTorch has been upgraded to 2.9.0 across all images. Additional upgrades by image:
| Component | CUDA 12.8 image | CUDA 13.0 image |
|---|---|---|
| vLLM | 0.11.1 | 0.11.2 |
| SGLang | 0.5.5.post3 | 0.5.5.post3 |
| deepgpu-comfyui | 1.3.2 | — |
| deepgpu-torch | 0.1.12+torch2.9.0cu128 | — |
Bug fixes
No bug fixes in this release.
Contents
The following table lists the image tags and their system components.
All four tags follow the pattern 25.11-{framework}{version}-pytorch2.9-{cuda}-{date}-serverless.
CUDA 12.8 images
Image tag: 25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless
| Attribute | Value |
|---|---|
| Supported architectures | amd64 |
| Use case | Large model inference |
| Framework | PyTorch |
| NVIDIA driver requirement | ≥ 570 |
System components:
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu128 |
| CUDA | 12.8 |
| vLLM | 0.11.1 |
| diffusers | 0.35.2 |
| deepgpu-comfyui | 1.3.2 |
| deepgpu-torch | 0.1.12+torch2.9.0cu128 |
| flash_attn | 2.8.3 |
| flashinfer-python | 0.5.2 |
| imageio | 2.37.2 |
| imageio-ffmpeg | 0.6.0 |
| ray | 2.51.1 |
| transformers | 4.57.1 |
| triton | 3.4.0 |
| torchaudio | 2.8.0+cu128 |
| torchvision | 0.24.0+cu128 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
| ljperf | 0.1.0+477686c5 |
Image tag: 25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless
| Attribute | Value |
|---|---|
| Supported architectures | amd64 |
| Use case | Large model inference |
| Framework | PyTorch |
| NVIDIA driver requirement | ≥ 570 |
System components:
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu128 |
| CUDA | 12.8 |
| SGLang | 0.5.5.post3 |
| sgl-kernel | 0.3.17.post1 |
| diffusers | 0.35.2 |
| decord | 0.6.0 |
| decord2 | 2.0.0 |
| deepgpu-comfyui | 1.3.2 |
| deepgpu-torch | 0.1.12+torch2.9.0cu128 |
| flash_attn | 2.8.3 |
| flash_mla | 1.0.0+1408756 |
| flashinfer-python | 0.5.2 |
| imageio | 2.37.2 |
| imageio-ffmpeg | 0.6.0 |
| ray | 2.51.1 |
| transformers | 4.57.1 |
| triton | 3.5.0 |
| torchao | 0.9.0 |
| torchaudio | 2.8.0+cu128 |
| torchvision | 0.24.0+cu128 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
| ljperf | 0.1.0+477686c5 |
CUDA 13.0 images
Image tag: 25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless
| Attribute | Value |
|---|---|
| Supported architectures | amd64, aarch64 |
| Use case | Large model inference |
| Framework | PyTorch |
| NVIDIA driver requirement | ≥ 580 |
System components (amd64):
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu130 |
| CUDA | 13.0.2 |
| vLLM | 0.11.2 |
| diffusers | 0.35.2 |
| flash_attn | 2.8.3 |
| flashinfer-python | 0.5.2 |
| imageio | 2.37.2 |
| imageio-ffmpeg | 0.6.0 |
| ray | 2.51.1 |
| transformers | 4.57.1 |
| triton | 3.5.0 |
| torchaudio | 2.9.0+cu130 |
| torchvision | 0.24.0+cu130 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
| ljperf | 0.1.0+d0e4a408 |
System components (aarch64):
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu130 |
| CUDA | 13.0.2 |
| vLLM | 0.11.1 |
| diffusers | 0.35.2 |
| flash_attn | 2.8.3 |
| flashinfer-python | 0.5.2 |
| ray | 2.51.1 |
| transformers | 4.57.1 |
| triton | 3.5.0 |
| torchaudio | 2.9.0 |
| torchvision | 0.24.0 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
Image tag: 25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless
| Attribute | Value |
|---|---|
| Supported architectures | amd64, aarch64 |
| Use case | Large model inference |
| Framework | PyTorch |
| NVIDIA driver requirement | ≥ 580 |
System components (amd64):
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu130 |
| CUDA | 13.0.2 |
| SGLang | 0.5.5.post3 |
| sgl-kernel | 0.3.17.post1 |
| diffusers | 0.35.2 |
| decord | 0.6.0 |
| decord2 | 2.0.0 |
| flash_attn | 2.8.3 |
| flashinfer-python | 0.5.2 |
| imageio | 2.37.2 |
| imageio-ffmpeg | 0.6.0 |
| ray | 2.51.1 |
| transformers | 4.57.1 |
| triton | 3.5.0 |
| torchao | 0.9.0 |
| torchaudio | 2.9.0 |
| torchvision | 0.24.0 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
| ljperf | 0.1.0+d0e4a408 |
System components (aarch64):
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12 |
| Torch | 2.9.0+cu130 |
| CUDA | 13.0.2 |
| SGLang | 0.5.5.post3 |
| sgl-kernel | 0.3.17.post1 |
| diffusers | 0.35.2 |
| decord2 | 2.0.0 |
| flash_attn | 2.8.3 |
| flashinfer-python | 0.5.2 |
| imageio | 2.37.2 |
| imageio-ffmpeg | 0.6.0 |
| transformers | 4.57.1 |
| triton | 3.5.0 |
| torchao | 0.9.0 |
| torchaudio | 2.9.0 |
| torchvision | 0.24.0 |
| xfuser | 0.4.5 |
| xgrammar | 0.1.25 |
Assets
Public images
CUDA 12.8
-
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless -
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu128-20251121-serverless
CUDA 13.0
-
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu130-20251120-serverless -
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.11-sglang0.5.5.post3-pytorch2.9-cu130-20251121-serverless
VPC images
To speed up image pulls from within your virtual private cloud (VPC), use a region-specific VPC endpoint instead of the public registry.
Replace the public image URI format:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}
With the VPC endpoint format:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Where:
-
{region-id}: The ID of the region where your ACS service is deployed. Examples:cn-beijing,cn-wulanchabu. -
{image:tag}: The name and tag of the target container image. Examples:inference-nv-pytorch:25.11-vllm0.11.1-pytorch2.9-cu128-20251120-serverless,training-nv-pytorch:25.10-serverless.
These images are for standard ACS and the multi-tenant Lingjun environment only. Do not use them in a single-tenant Lingjun setup.
Driver requirements
| CUDA version | Minimum NVIDIA driver version |
|---|---|
| CUDA 12.8 | 570 |
| CUDA 13.0 | 580 |
Quick start
The following example shows how to pull an inference-nv-pytorch image using Docker and run an inference service with the Qwen2.5-7B-Instruct model.
To use this image in ACS, select it from Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For more information, see:
-
Pull the image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Download the model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Test vLLM inference.
-
Start the vLLM API server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a test request.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Introduce deep learning."} ]}'For more information about how to use vLLM, see vLLM.
-
Known issues
-
The
deepgpu-comfyuiplugin for accelerating Wanx model video generation supports only the GN8IS, G49E, and G59 instance types.