This page covers release notes for the training-nv-pytorch 25.12 image: new features, image contents, quick start instructions, and known issues.
What's new
New features
-
vLLM upgraded to 0.12.0
-
flashinfer-python upgraded to 0.5.3
Bug fixes
None.
Image contents
The following table lists the two image variants in this release, including CUDA versions, driver requirements, supported architectures, and pre-installed components.
|
Image name |
training-nv-pytorch |
|
|
Tag |
25.12-cu130-serverless |
25.12-cu128-serverless |
|
Scenarios |
Training/Inference |
|
|
Framework |
PyTorch |
|
|
Requirements |
NVIDIA Driver release >= 580 |
NVIDIA Driver release >= 575 |
|
Supported architectures |
amd64 & aarch64 |
amd64 |
|
Core components |
|
|
Image assets
Public images
CUDA 13.0.2 (Driver >= 580, amd64 & aarch64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu130-serverless
CUDA 12.8 (Driver >= 575, amd64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu128-serverless
VPC images
To pull ACS AI container images faster within a VPC, replace the image URI prefix:
-
Replace:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} -
With:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
Region ID of the ACS product. For a full list, see Available regions. | cn-beijing, cn-wulanchabu |
{image:tag} |
Image name and tag | training-nv-pytorch:25.12-cu130-serverless |
This image is compatible with ACS products and Lingjun multi-tenant products. It is not compatible with Lingjun single-tenant products.
Driver requirements
The 25.12 release provides two CUDA variants with different driver requirements:
| Image tag | CUDA version | Minimum driver version |
|---|---|---|
| 25.12-cu130-serverless | CUDA 13.0.2 | NVIDIA Driver 580 |
| 25.12-cu128-serverless | CUDA 12.8.0 | NVIDIA Driver 575 |
For driver compatibility details, see:
-
CUDA Application Compatibility — compatibility between CUDA versions and driver releases
-
CUDA Compatibility and Upgrades — upgrade guidance and best practices
Key features and enhancements
PyTorch compiler optimization
torch.compile(), introduced in PyTorch 2.0, works well for single-GPU training. For LLM training on distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed, the compiler cannot capture the full compute graph, which limits or negates its benefit.
This release addresses that limitation with two improvements:
-
Finer communication granularity in DeepSpeed: By controlling the granularity of communication operations, the compiler can obtain a wider compute graph scope and apply more aggressive optimization.
-
Frontend compiler improvements: The PyTorch compiler frontend is updated to handle graph breaks without stopping compilation. Mode matching and dynamic shape support are also enhanced.
These optimizations deliver a 20% end-to-end (E2E) throughput improvement when training an 8B LLM.
GPU memory optimization for recomputation
This release includes an automated activation recomputation tuner. It analyzes GPU memory consumption across model configurations and cluster deployments by collecting metrics such as GPU memory utilization. Based on the analysis, it determines the optimal number of activation recomputation layers and integrates the recommendation directly into PyTorch.
This feature is currently available in the DeepSpeed framework.
E2E performance evaluation
Performance was measured using CNP (Cloud-Native AI Performance evaluation and analysis tool). The evaluation compares E2E training throughput across the following configurations on a multi-node GPU cluster, using mainstream open-source models.
Image comparison: base image and iterative improvements
E2E performance contribution analysis of core GPU components
The following tests are based on version 25.12:
-
Base: NGC PyTorch image
-
ACS AI image: Base+ACCL: Uses the ACCL (Alibaba Cloud Communication Library) communication library
-
ACS AI image: AC2+ACCL: Uses AC2 BaseOS with no optimizations enabled
-
ACS AI image: AC2+ACCL+CompilerOpt: Uses AC2 BaseOS with only the torch compile optimization enabled
-
ACS AI image: AC2+ACCL+CompilerOpt+CkptOpt: Uses AC2 BaseOS with both torch compile and selective gradient checkpointing enabled
Quick start
The following examples show how to pull and run the training-nv-pytorch image using Docker.
To use this image in ACS, select it from the Artifacts page in the console when creating a workload, or specify the image reference in a YAML file.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
Replace [tag] with the target image tag, such as 25.12-cu130-serverless or 25.12-cu128-serverless.
2. Enable compiler and memory optimizations
Enable compiler optimization
Use the Hugging Face Transformers Trainer API:
Enable GPU memory optimization for recomputation
export CHECKPOINT_OPTIMIZATION=true
3. Start the container
The image includes a built-in model training tool, ljperf. The following example starts the container and runs a training job.
LLM workloads
# Start the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b
4. Configuration recommendations
Follow these recommendations when using this image:
-
Do not reinstall pre-bundled libraries such as PyTorch and DeepSpeed. The image is tuned with specific library versions; reinstalling them may break optimizations.
-
Leave
zero_optimization.stage3_prefetch_bucket_sizein your DeepSpeed configuration blank or set it toauto. -
Set
NCCL_SOCKET_IFNAMEbased on the number of GPUs requested per pod:GPU count per pod Setting 1, 2, 4, or 8 GPUs NCCL_SOCKET_IFNAME=eth0(default)16 GPUs (all GPUs on the machine, using High-Performance Network (HPN)) NCCL_SOCKET_IFNAME=hpn0
Known issues
Compiling fa3 on CUDA 13.0.2 fails
Condition: Using the CUDA 13.0.2 (25.12-cu130-serverless) image and compiling flash-attention 3 (fa3) directly inside the container.
Impact: The compilation fails with an error.
Workaround: This is a known community issue. Do not compile fa3 directly on the CUDA 13.0.2 image. Use the CUDA 12.8 (25.12-cu128-serverless) image if fa3 compilation is required.