This release updates core training and inference components, introduces PyTorch compiler optimizations that increase end-to-end (E2E) throughput by up to 20% in large language model (LLM) training, and upgrades ACCL-N for higher communication performance.
Important notices
-
Do not reinstall PyTorch or DeepSpeed. This image includes customized versions of both libraries; reinstalling them from PyPI overwrites the optimizations.
-
In your DeepSpeed configuration, set
zero_optimization.stage3_prefetch_bucket_sizetoautoor leave it blank. -
The
25.03-serverlessimage is not compatible with Lingjun single-tenant products. -
VPC image pulls are currently supported only in the China (Beijing) region.
What's new
| Component | Version |
|---|---|
| Base image | NGC 25.02 |
| PyTorch (Torch) | 2.6.0.7 |
| TransformerEngine (TE) | 2.1 |
| accelerate | 1.5.2 |
| ACCL-N | 2.23.4.12 |
| vLLM | 0.8.2.dev0 |
| ray | 2.44.0 |
| flashinfer | 0.2.3 |
| Transformers | 4.49.0+ali |
| flash-attn | 2.7.2 |
Bugs fixed
Upgraded vLLM to 0.8.2.dev0 to fix the illegal memory access for Mixture of Experts (MoE) on H20 (#13693) issue.
Image details
Applicable scenarios
| Attribute | Value |
|---|---|
| Applicable scenario | Training/inference |
| Framework | PyTorch |
| Minimum NVIDIA driver | 570 |
Core components
Training and inference frameworks
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| Torch | 2.6.0.7 |
| CUDA | 12.8.0 |
| ACCL-N | 2.23.4.12 |
| triton | 3.1.0 |
| TransformerEngine | 2.1 |
| deepspeed | 0.15.4+ali |
| flash-attn | 2.7.2 |
| flashattn-hopper | 3.0.0b1 |
| transformers | 4.49.0+ali |
| megatron-core | 0.9.0 |
| grouped_gemm | 1.1.4 |
| accelerate | 1.5.2 |
| peft | 0.13.2 |
| vllm | 0.8.2.dev0+g61c7a1b8.d20250325.cu128 |
| flashinfer | 0.2.3 |
| ray | 2.44.0 |
CV tools
| Component | Version |
|---|---|
| diffusers | 0.31.0 |
| timm | 1.0.13 |
| ultralytics | 8.2.74 |
| opencv-python-headless | 4.10.0.84 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| openmim | 0.3.9 |
Debugging and profiling
| Component | Version |
|---|---|
| pytorch-dynamic-profiler | 0.24.11 |
| perf | 5.4.30 |
| gdb | 15.0.50 |
Available images
Public image
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace {region-id} with the region where your ACS is activated, for example, cn-beijing. Replace {image:tag} with the name and tag of the image.
VPC image pulls are currently supported only in the China (Beijing) region.
Choose the right image
| Image tag | Use with |
|---|---|
25.03-serverless |
ACS products and Lingjun multi-tenant products |
25.03 |
Lingjun single-tenant scenarios |
The 25.03-serverless image is not compatible with Lingjun single-tenant products.
Driver requirements
This release is based on CUDA 12.8.0.38 and requires NVIDIA driver 570 or later.
Exception for data center GPUs (such as T4): you can use any of the following driver versions instead.
| Driver branch | Minimum version |
|---|---|
| R470 | 470.57 |
| R525 | 525.85 |
| R535 | 535.86 |
| R545 | 545.23 |
Drivers that must be updated: R418, R440, R450, R460, R510, R520, R530, R545, and R555 are not forward-compatible with CUDA 12.8. Update to a supported driver before using this image.
For details, see CUDA application compatibility and CUDA compatibility and updates.
Key features and enhancements
PyTorch compiler optimization
torch.compile() delivers clear throughput gains for single-GPU workloads, but distributed training (Fully Sharded Data Parallel (FSDP), DeepSpeed) historically prevented the compiler from seeing a complete computation graph, limiting or even negating those gains. This release addresses that with two optimizations:
-
Communication granularity control in DeepSpeed: exposes a complete computation graph to the compiler, enabling broader optimization scope.
-
Compiler frontend improvements: the PyTorch compiler frontend now handles graph breaks gracefully, and mode matching and dynamic shape handling are improved for better runtime performance.
Result: up to 20% higher E2E throughput in 8B LLM training.
GPU memory optimization for recomputation
This release integrates automatic activation recomputation layer recommendations directly into PyTorch. The optimal number of recomputation layers is determined by running performance tests across different cluster configurations and collecting GPU memory utilization metrics — no manual tuning required.
Currently supported in the DeepSpeed framework.
ACCL communication library
ACCL-N is Alibaba Cloud's High-Performance Networking (HPN) communication library for Lingjun, built on NCCL with full NCCL API compatibility. ACCL-N 2.23.4.12 delivers higher throughput and stability than stock NCCL and includes additional bug fixes.
E2E performance benefit assessment
The cloud-native AI performance assessment tool CNP measures E2E training performance using mainstream open-source models and standard base images, with ablation study support to isolate the contribution of each optimization.
The following chart shows the cumulative E2E benefit of each optimization layer in version 25.03, measured on a multi-node GPU cluster:
-
Base: NGC PyTorch image (baseline)
-
Base + ACCL: ACCL-N substituted for NCCL
-
AC2 + ACCL: AC2 BaseOS, no additional optimizations
-
AC2 + ACCL + CompilerOpt: AC2 BaseOS with PyTorch compiler optimization
-
AC2 + ACCL + CompilerOpt + CkptOpt: AC2 BaseOS with both PyTorch compiler optimization and selective gradient checkpointing
Quick start
To use this image in ACS, pull it from the artifact center page of the console when creating workloads, or specify the image URI directly in your YAML file.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
2. Enable optimizations
Compiler optimization
Call the transformers Trainer API to enable compiler optimization:
GPU memory optimization for recomputation
export CHECKPOINT_OPTIMIZATION=true
3. Launch a container and run a training demo
The image includes ljperf, a built-in model training tool. The following example launches a container and runs an LLM training demo.
# Launch the container
docker run --rm -it --ipc=host --net=host --privileged \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the LLM training demo
ljperf --action train --model_name deepspeed/llama3-8b
Known issues
None.