training-nv-pytorch 25.04 is an ACS AI training container image based on NGC PyTorch 25.03, with Alibaba Cloud optimizations for large-scale LLM training and inference on GPU clusters.
What's new
Base image aligned with NGC 25.03. CUDA upgraded to 12.8.1, TransformerEngine upgraded to 2.1.
Triton adapted to 3.2.0, Accelerate upgraded to 1.6.0+ali, with corresponding version features and bug fixes integrated.
vLLM upgraded to 0.8.5, flashinfer-python upgraded to 0.2.5, Transformers upgraded to 4.51.2+ali, adding Qwen3 support.
Bug fixes: None
Announcements
The image includes modified PyTorch and DeepSpeed libraries. Do not reinstall it.
In your DeepSpeed configuration, leave
zero_optimization.stage3_prefetch_bucket_sizeblank or set it toauto.
Components
| Scenarios | Training, inference |
|---|---|
| Framework | PyTorch |
| NVIDIA driver | >= 570 |
Core components:
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| Torch | 2.6.0.7 |
| CUDA | 12.8.0 |
| ACCL-N | 2.23.4.12 |
| Triton | 3.2.0 |
| TransformerEngine | 2.1 |
| DeepSpeed | 0.15.4+ali |
| flash-attn | 2.7.2 |
| flashattn-hopper | 3.0.0b1 |
| Transformers | 4.51.2+ali |
| megatron-core | 0.9.0 |
| grouped_gemm | 1.1.4 |
| Accelerate | 1.6.0+ali |
| diffusers | 0.31.0 |
| openmim | 0.3.9 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| opencv-python-headless | 4.10.0.84 |
| ultralytics | 8.2.74 |
| timm | 1.0.13 |
| vLLM | 0.8.5+cu128 |
| flashinfer | 0.2.5 |
| pytorch-dynamic-profiler | 0.24.11 |
| perf | 5.4.30 |
| gdb | 15.0.50 |
| peft | 0.13.2 |
| ray | 2.43.0 |
Image assets
25.04
Public image:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.04-serverlessVPC image:
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}{region-id}: the region where your ACS is activated, such as cn-beijing or cn-wulanchabu.{image:tag}: the name and tag of the image.
Currently, you can pull VPC images only in the China (Beijing) region.
training-nv-pytorch:25.04-serverless is compatible with the ACS product form and the Lingjun multi-tenant product form. It is not compatible with the Lingjun single-tenant product form.training-nv-pytorch:25.04 (no suffix) is for Lingjun single-tenant scenarios.Driver requirements
This release is based on CUDA 12.8.1.012.
| Scenario | Minimum driver version |
|---|---|
| Standard GPUs | 570 or later |
| Data center GPUs (T4 and similar) | 470.57 (R470), 525.85 (R525), 535.86 (R535), or 545.23 (R545) |
Drivers not forward-compatible with CUDA 12.8 — upgrade if you are on any of these: R418, R440, R450, R460, R510, R520, R530, R545, R555, R560.
For the full list of supported drivers, see CUDA application compatibility. For background on compatibility and upgrades, see CUDA compatibility and upgrades.
Key features and enhancements
PyTorch compilation optimization
torch.compile(), introduced in PyTorch 2.0, works well for single-GPU training but provides limited benefit—or even negative benefit—for LLM training, which requires GPU memory optimization and distributed frameworks such as FSDP or DeepSpeed.
This release addresses that gap with two compiler improvements:
Communication granularity control in DeepSpeed: controls the communication granularity in the DeepSpeed framework, giving the compiler a wider scope of computation graph for optimization.
Improved PyTorch compiler frontend: ensures compilation proceeds even when graph breaks occur in a computation graph, with enhanced mode matching and dynamic shapes support.
After these optimizations, end-to-end throughput increases by 20% when training an 8B LLM.
GPU memory optimization for recomputation
This release analyzes GPU memory consumption across different cluster configurations and parameter settings, then derives the optimal number of activation recomputation layers and integrates them into PyTorch. This makes selective activation checkpointing available with minimal configuration. Currently supported in the DeepSpeed framework.
ACCL
ACCL is Alibaba Cloud's in-house High-Performance Network (HPN) communication library for Lingjun. ACCL-N, the GPU acceleration variant, is built on NCCL with full NCCL compatibility, additional bug fixes, and improved performance and stability.
End-to-end performance contribution analysis
The following tests use Golden-25.04 on multi-node GPU clusters, comparing the contribution of each optimization component to end-to-end throughput. The cloud-native AI performance tool CNP runs evaluations with mainstream open-source models and frameworks alongside standard base images.
Test configurations:
Base: NGC PyTorch image
ACS AI image: Base + ACCL: image with the ACCL communication library
ACS AI image: AC2 + ACCL: golden image with AC2 BaseOS, no optimizations enabled
ACS AI image: AC2 + ACCL + CompilerOpt: golden image with AC2 BaseOS, PyTorch compilation optimization enabled
ACS AI image: AC2 + ACCL + CompilerOpt + CkptOpt: golden image with AC2 BaseOS, both compilation optimization and selective activation checkpointing enabled

Quick start
The following steps use Docker to pull and run the training-nv-pytorch image.
To use training-nv-pytorch in ACS, pull the image from the artifact center page in the console when creating workloads, or specify the image in a YAML file.
Prerequisites
Before you begin, make sure that you have:
Docker installed on your host machine
An NVIDIA driver that meets the driver requirements
Access to the image registry (public image or VPC image, depending on your network)
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Enable compilation optimization and GPU memory optimization for recomputation
Enable compilation optimization — use the Transformers Trainer API:

Enable GPU memory optimization for recomputation:
export CHECKPOINT_OPTIMIZATION=true3. Launch a container and run a training task
The image includes a built-in model training tool, ljperf, for launching containers and running training tasks.
# Launch a container and log in.
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8bKnown issues
After upgrading to PyTorch 2.6, the performance benefit of recomputation memory optimization for LLM-type models is lower than in previous releases. Optimization is ongoing.