Release notes for the training-nv-pytorch 25.05 image, including what's new, component versions, image tags, driver requirements, key features, and quick start instructions.
What's new
Features
The base image CUDA is upgraded to 12.9.0.
Bug fixes
PyTorch is upgraded to 2.6.0.7.post1, which fixes the profile crash issue in the open-source community.
Image contents
| Scenario | Training/Inference |
|---|---|
| Framework | PyTorch |
| Requirements | NVIDIA driver release >= 575 |
Core components:
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| Torch | 2.6.0.7.post1 |
| CUDA | 12.9.0 |
| ACCL-N | 2.26.5.12 |
| triton | 3.2.0 |
| TransformerEngine | 2.1 |
| deepspeed | 0.15.4+ali |
| flash-attn | 2.7.2 |
| flashattn-hopper | 3.0.0b1 |
| transformers | 4.51.2+ali |
| megatron-core | 0.9.0 |
| grouped_gemm | 1.1.4 |
| accelerate | 1.6.0+ali |
| diffusers | 0.31.0 |
| openmim | 0.3.9 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| opencv-python-headless | 4.10.0.84 |
| ultralytics | 8.2.74 |
| timm | 1.0.13 |
| vLLM | 0.8.5+cu128 |
| flashinfer | 0.2.5 |
| pytorch-dynamic-profiler | 0.24.11 |
| perf | 5.4.30 |
| gdb | 15.0.50 |
| peft | 0.13.2 |
| ray | 2.46.0 |
Image tags
25.05
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.05-serverlessVPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}Replace the placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
{region-id} | The region where your ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} | The image name and tag | training-nv-pytorch:25.05 |
Currently, VPC image pulls are only available in the China (Beijing) region.
training-nv-pytorch:25.05-serverlessis for ACS services and Lingjun multi-tenant services. Do not use it in Lingjun single-tenant scenarios.training-nv-pytorch:25.05(without-serverless) is for Lingjun single-tenant scenarios.
Driver requirements
Release 25.05 is based on CUDA 12.9.0 and requires NVIDIA driver version 575 or higher.
Data center GPU exception: For data center GPUs such as T4, you can use driver versions 470.57 (R470+), 525.85 (R525+), 535.86 (R535+), or 545.23 (R545+).
Drivers that require upgrading: R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 are not forward compatible with CUDA 12.8 and must be upgraded.
For complete driver compatibility information, see:
Release 25.05 aligns with the NGC PyTorch 25.04 image version. NGC releases images at the end of each month, so Golden image development is based on the previous month's NGC version.
Key features
PyTorch compiling optimization
torch.compile(), introduced in PyTorch 2.0, works well for small-scale single-GPU training. For large language model (LLM) training, which requires GPU memory optimization and distributed frameworks like Fully Sharded Data Parallel (FSDP) or DeepSpeed, standard torch.compile() provides limited or negative benefit.
This release addresses that limitation with two optimizations:
Communication granularity control in DeepSpeed: Controlling communication granularity in the DeepSpeed framework gives the compiler access to a complete compute graph, enabling wider-scope compiling optimization.
PyTorch compiler frontend improvements: The frontend is optimized to ensure compilation even when a graph break occurs. Mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.
Result: end-to-end (E2E) throughput increases by 20% when training an 8B LLM.
GPU memory optimization for activation recomputation
This feature profiles GPU memory consumption across models deployed in different clusters or with different parameters by running performance tests and collecting system metrics such as GPU memory utilization. Based on the results, it automatically determines the optimal number of activation recomputation layers and integrates the setting into PyTorch — providing GPU memory savings without manual tuning. Currently supported in the DeepSpeed framework.
ACCL
Alibaba Cloud Communication Library (ACCL) is an in-house High-Performance Network (HPN) communication library provided by Alibaba Cloud for Lingjun. ACCL-N is the GPU acceleration variant — an HPN library built on NVIDIA Collective Communications Library (NCCL) that is fully compatible with NCCL, fixes known NCCL bugs, and delivers higher performance and stability.
E2E performance assessment
The following tests use the cloud-native AI performance assessment and analysis tool CNP with mainstream open-source models and frameworks together with standard base images to analyze E2E performance. An ablation study is used to assess how each optimization component contributes to overall model training performance. Tests are run on Golden-25.05 on multi-node GPU clusters:
| Configuration | Description |
|---|---|
| Base | NGC PyTorch image |
| ACS AI Image: Base+ACCL | ACS AI image with the ACCL communication library |
| ACS AI Image: AC2+ACCL | Golden image on AC2 BaseOS, no optimizations enabled |
| ACS AI Image: AC2+ACCL+CompilerOpt | Golden image on AC2 BaseOS with torch compile optimization |
| ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt | Golden image on AC2 BaseOS with torch compile and selective gradient checkpoint optimization |

Quick start
The following example uses Docker to pull and run the training-nv-pytorch image.
To use the training-nv-pytorch image in ACS, pull it from the artifact center page in the console when creating workloads, or specify the image in a YAML file.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Enable optimizations
Enable compiling optimization
Use the transformers Trainer API:

Enable GPU memory optimization for activation recomputation
export CHECKPOINT_OPTIMIZATION=true3. Launch a container
The image includes ljperf, a built-in model training tool for launching containers and running training tasks.
LLM example:
# Launch a container and log on to the container.
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8bUsage notes
This release modifies the PyTorch and DeepSpeed libraries. Do not reinstall them.
Leave
zero_optimization.stage3_prefetch_bucket_sizeblank or set it toautoin your DeepSpeed configuration.
Known issues
PyTorch is upgraded to 2.6 in this release. The performance benefit of activation recomputation memory optimization for LLM models is lower than in previous images. Optimization is ongoing.