This topic describes the release notes for training-nv-pytorch 25.12.
Main features and bug fixes
Main features
vLLM is upgraded to 0.12.0, and flashinfer-python is upgraded to 0.5.3.
Bug fixes
None.
Contents
Image name | training-nv-pytorch | |
Tag | 25.12-cu130-serverless | 25.12-cu128-serverless |
Scenarios | Training/Inference | |
Framework | PyTorch | |
Requirements | NVIDIA Driver release >= 580 | NVIDIA Driver release >= 575 |
Supported architectures | amd64 & aarch64 | amd64 |
Core components |
|
|
Assets
Public images
CUDA 13.0.2 (Driver >= 580, amd64 & aarch64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu130-serverless
CUDA 12.8 (Driver >= 575, amd64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu128-serverless
VPC images
To quickly pull ACS AI container images within a VPC, replace the specified AI container image URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.
{region-id}: The region ID of the ACS product. For more information, see Available regions. Examples:cn-beijingandcn-wulanchabu.{image:tag}: The name and tag of the AI container image. Examples:inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverlessandtraining-nv-pytorch:25.10-serverless.
This image is suitable for ACS products and Lingjun multi-tenant products. This image is not suitable for Lingjun single-tenant products. Do not use it in a Lingjun single-tenant scenario.
Driver requirements
The 25.12 release supports CUDA 12.8.0 and CUDA 13.0.2 based on different driver versions. CUDA 13.0.2 requires NVIDIA driver version 580 or later. CUDA 12.8.0 requires NVIDIA driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
Key features and enhancements
PyTorch compiling optimization
The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.
Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.
After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.
GPU memory optimization for recomputation
We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.
E2E performance benefit evaluation
Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive end-to-end performance comparison. We used mainstream open source models and framework configurations against a standard base image. We also performed ablation experiments to evaluate the contribution of each optimization component to the overall model training performance.
Image comparison: Base image and iteration evaluation

E2E performance contribution analysis of core GPU components
The following tests are based on version 25.12. They show an E2E performance evaluation and comparison for training on a multi-node GPU cluster. The comparison items include the following:
Base: NGC PyTorch Image
ACS AI Image: Base+ACCL: The image uses the ACCL communication library.
ACS AI Image: AC2+ACCL: This image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image: AC2+ACCL+CompilerOpt: This image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: This image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start
The following examples show how to pull the training-nv-pytorch image using Docker.
To use the training-nv-pytorch image in ACS, you can select it from the Artifacts page in the console when you create a workload, or specify the image reference in a YAML file.
1. Select an image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Call the API to enable the compiler and recomputation for GPU memory optimization
Enable compilation optimization
Use the transformers Trainer API:

Enable recomputation for GPU memory optimization
export CHECKPOINT_OPTIMIZATION=true
3. Start the container
The image includes a built-in model training tool named ljperf. The following steps show how to use this tool to start a container and run a training job.
LLM class
# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 4. Recommendations
The changes in the image involve libraries such as PyTorch and DeepSpeed. Do not reinstall them.
Leave `zero_optimization.stage3_prefetch_bucket_size` in the DeepSpeed configuration empty or set it to `auto`.
The built-in environment variable
NCCL_SOCKET_IFNAMEin this image must be dynamically adjusted based on the scenario:When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set
NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.When a single pod requests all 16 cards on a machine for a training or inference task, you can use the High-Performance Network (HPN). Set
NCCL_SOCKET_IFNAME=hpn0.
Known issues
Compiling fa3 directly on the CUDA 13.0.2 image causes an error. This is a known community issue.