This release adds CUDA 13.0 support with aarch64 architecture coverage, and upgrades vLLM to v0.12.0 and SGLang to v0.5.6.post2 across both CUDA variants.
What's new
Dual CUDA version support
Starting with this release, images are published for two CUDA versions:
-
CUDA 12.8 — supports amd64 only
-
CUDA 13.0 — supports both amd64 and aarch64
Core component upgrades
| Component | Version | Images |
|---|---|---|
| vLLM | v0.12.0 | CUDA 12.8 and 13.0 |
| SGLang | v0.5.6.post2 | CUDA 12.8 and 13.0 |
| PyTorch | 2.9.0 (vLLM images) / 2.9.1 (SGLang images) | CUDA 12.8 and 13.0 |
| deepgpu-comfyui | 1.3.2 | CUDA 12.8 only |
| deepgpu-torch | 0.1.12+torch2.9.0cu128 | CUDA 12.8 only |
Bug fixes
No bug fixes in this release.
Image contents
All images use PyTorch as the framework and are designed for large model inference.
| Image name | inference-nv-pytorch | |||||
|---|---|---|---|---|---|---|
| Tag | 25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless | 25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless | 25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless | 25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless | ||
| Supported architectures | amd64 | amd64 | amd64 | aarch64 | amd64 | aarch64 |
| Use case | Large model inference | Large model inference | Large model inference | Large model inference | Large model inference | Large model inference |
| Framework | pytorch | pytorch | pytorch | pytorch | pytorch | pytorch |
| Requirements | NVIDIA Driver release ≥ 570 | NVIDIA Driver release ≥ 570 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 |
| System components |
Base environment
Inference frameworks
Alibaba Cloud components
|
Base environment
Inference frameworks
Alibaba Cloud components
|
Base environment
Inference frameworks
|
Base environment
Inference frameworks
|
Base environment
Inference frameworks
|
Base environment
Inference frameworks
|
Assets
Public images
CUDA 12.8
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless
CUDA 13.0
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless
VPC images
To speed up image pulls from within your virtual private cloud (VPC), replace the standard registry hostname with a region-specific VPC endpoint.
Change the image path from:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}
To:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
The ID of the region where your ACS service is deployed | cn-beijing, cn-wulanchabu |
{image:tag} |
The name and tag of the target AI container image | inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless |
VPC images are compatible with standard ACS products and multi-tenant Lingjun environments. Do not use them in single-tenant Lingjun environments.
Driver requirements
| CUDA version | Minimum NVIDIA Driver version |
|---|---|
| CUDA 12.8 | 570 |
| CUDA 13.0 | 580 |
Quick start
The following example pulls the inference-nv-pytorch image and runs a conversational inference test using the Qwen2.5-7B-Instruct model.
To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For step-by-step deployment guides, see:
-
Pull the image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Download the
Qwen2.5-7B-Instructmodel from ModelScope.pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct -
Start the container.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag] -
Start the vLLM inference server inside the container.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1 -
Send a test request to the server.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Tell me about deep learning."} ]}'For more information about vLLM usage, see the vLLM documentation.
Known issues
| Issue | Affected scope | Workaround |
|---|---|---|
The deepgpu-comfyui plugin for Wanx model video generation acceleration supports only the GN8IS, G49E, and G59 instance types. |
CUDA 12.8 images | Use a GN8IS, G49E, or G59 instance when running Wanx model video generation workloads with deepgpu-comfyui. |