This topic provides the release notes for inference-nv-pytorch version 25.12.
Key features and bug fixes
Key features
Dual CUDA version support
Images for two different CUDA versions are now provided:The CUDA 12.8 image supports the amd64 architecture.
The CUDA 13.0 image supports both amd64 and aarch64 architectures.
Core component upgrades
The PyTorch version has been upgraded to 2.9.0 in the vLLM image and 2.9.1 in the SGLang image.
For the CUDA 12.8 image:
deepgpu-comfyui has been upgraded to 1.3.2.
The deepgpu-torch optimization component has been upgraded to 0.1.12+torch2.9.0cu128.
For both the CUDA 12.8 and CUDA 13.0 images:
vLLM has been upgraded to version v0.12.0.
SGLang has been upgraded to version v0.5.6.post2.
Bug fixes
No bug fixes in this release.
Contents
Image name | inference-nv-pytorch | |||||
Tag | 25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless | 25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless | 25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless | 25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless | ||
Supported architectures | amd64 | amd64 | amd64 | aarch64 | amd64 | aarch64 |
Use case | Large model inference | Large model inference | Large model inference | Large model inference | Large model inference | Large model inference |
Framework | pytorch | pytorch | pytorch | pytorch | pytorch | pytorch |
Requirements | NVIDIA Driver release ≥ 570 | NVIDIA Driver release ≥ 570 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 | NVIDIA Driver release ≥ 580 |
System components |
|
|
|
|
|
|
Assets
Public images
CUDA 12.8
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu128-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu128-20251215-serverless
CUDA 13.0
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-vllm0.12.0-pytorch2.9-cu130-20251215-serverless
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:25.12-sglang0.5.6.post2-pytorch2.9-cu130-20251215-serverless
VPC images
To speed up image pulls from within your virtual private cloud (VPC), replace the standard image asset URI with a region-specific VPC endpoint.
Change the image path from this format:egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}
To this format:acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id}: The ID of the region where your ACS service is deployed. Examples:cn-beijing,cn-wulanchabu.{image:tag}: The name and tag of the target AI container image. Examples:inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverlessandtraining-nv-pytorch:25.10-serverless.
These images are suitable for standard ACS products and multi-tenant Lingjun environments. These images are not suitable for single-tenant Lingjun environments. Do not use these images in a single-tenant Lingjun setup.
Driver requirements
CUDA 12.8: Requires NVIDIA driver version 570 or later.
CUDA 13.0: Requires NVIDIA driver version 580 or later.
Quick start
The following example shows how to pull the inference-nv-pytorch image using Docker and test the inference service with the Qwen2.5-7B-Instruct model.
To use this image in ACS, select the it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest. For more information, see the following topics about building model inference services with ACS GPU computing power:
Pull the image.
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Download the open-source model from ModelScope.
pip install modelscope cd /mnt modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-InstructRun the following commands to start the container and enter its shell.
docker run -d -t --network=host --privileged --init --ipc=host \ --ulimit memlock=-1 --ulimit stack=67108864 \ -v /mnt/:/mnt/ \ egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:[tag]Run an inference test to check the vLLM conversational inference feature.
Start the server.
python3 -m vllm.entrypoints.openai.api_server \ --model /mnt/Qwen2.5-7B-Instruct \ --trust-remote-code --disable-custom-all-reduce \ --tensor-parallel-size 1Test from the client side.
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/Qwen2.5-7B-Instruct", "messages": [ {"role": "system", "content": "You are a friendly AI assistant."}, {"role": "user", "content": "Tell me about deep learning."} ]}'For more information about how to use vLLM, see vLLM.
Known issues
The
deepgpu-comfyuiplugin, which accelerates video generation for Wanx models, currently supports only the GN8IS, G49E, and G59 instance types.