This topic describes the release notes for training-nv-pytorch 25.03.
Main features and bug fixes
Main features
The base image is updated to NGC 25.02.
PyTorch and related components are updated to 2.6.0.7, TE is updated to 2.1, and accelerate is updated to 1.5.2 to provide new features and bug fixes.
ACCL-N is updated to 2.23.4.12 to provide new features and bug fixes.
vLLM is updated to 0.8.2.dev0 and ray is updated to 2.44. flash-infer 0.2.3 is supported. Transformers is updated to 4.49.0+ali and flash_attn is updated to 2.7.2 to provide new features and bug fixes.
Bugs fixed
Upgraded vLLM version to 0.8.2.dev0, fixed Illegal memory access for MoE On H20 #13693 issue.
Content
Applicable scenario | Training/inference |
Framework | PyTorch |
Requirements | NVIDIA driver release >= 570 |
Core components |
|
Assets
25.03 public image
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id}indicates the region where your ACS is activated, such as cn-beijing.{image:tag}indicates the name and tag of the image.
Currently, you can pull only images in the China (Beijing) region over a VPC.
The
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverlessimage is suitable for ACS products and Lingjun multi-tenant products, but not suitable for Lingjun single-tenant products.The
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03image is suitable for Lingjun single-tenant scenarios.
Driver requirements
The 25.03 release is consistent with the NGC pytorch 25.02 image update because NGC releases images at the end of each month but the Golden image update must be based on the version of the previous month. Therefore, the Golden-gpu driver meets the requirements of the corresponding NGC image version. This release is based on CUDA 12.8.0.38 and requires NVIDIA driver version 570 or later. However, if your workloads are running on data center GPUs (such as T4 or any other data center GPUs), you can use NVIDIA driver version 470.57 (or a later version of R470), 525.85 (or a later version of R525), 535.86 (or a later version of R535), or 545.23 (or a later version of R545).
The CUDA driver compatibility package only supports specific drivers. Therefore, you must update from the R418, R440, R450, R460, R510, R520, R530, R545, and R555 versions. These drivers are not forward-compatible with CUDA 12.8. For more information about the supported drivers, see CUDA application compatibility. For more information, see CUDA compatibility and updates.
Key features and enhancements
PyTorch compiling optimization
You can benefit from the compiling optimization capability in PyTorch 2.0 when a single GPU is used for small-scale computing. However, in LLM training, memory optimization, FSDP/DeepSpeed, and other distributed frameworks are needed. Consequently, torch.compile() cannot simply benefit from this capability or the capability may even provide negative benefits:
Control the communication granularity in the DeepSpeed framework to help the compiler obtain a complete compute graph and perform compiling optimization on a wider scope.
Optimized PyTorch:
Optimize frontend of the PyTorch compiler to ensure compiling even when graph break occurs in the compute graph.
Augment mode match and dynamic shape capabilities to improve the performance of the compiled code.
With these optimizations, the E2E throughput can be increased by 20% in 8B LLM training scenarios.
GPU memory optimization for recomputation
We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.
Accl communication library
ACCL is an in-house HPN communication library provided by Alibaba Cloud for Lingjun. It provides ACCL-N for GPU acceleration scenarios. ACCL-N is an HPN library customized based on NCCL. It is completely compatible with NCCL and fixes some bugs in NCCL. ACCL-N also provides higher performance and stability.
E2E performance benefit assessment
With the cloud-native AI performance assessment and analysis tool CNP, we can use mainstream open source models and frameworks together with standard base images to analyze E2E performance. In addition, we can use ablation study to further assess how each optimization component benefits the overall model training.
GPU core component E2E performance benefit analysis
The following E2E performance assessment is based on version 25.03 and a cluster that contains muliple GPU-accelerated nodes. The comparision items include:
Base: The NGC PyTorch image.
ACS AI Image: Base+ACCL: The ACCL used by the image.
ACS AI Image: AC2+ACCL: The Golden image uses AC2 BaseOS, without any optimizations.
ACS AI Image: AC2+ACCL+CompilerOpt: The Golden image uses AC2 BaseOS, with only PyTorch compiling optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The Golden image uses AC2 BaseOS, with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start
The following example shows how to use Docker to pull the training-nv-pytorch image.
To use the training-nv-pytorch image in ACS, you must pull it from the artifact center page of the console where you create workloads or specify the image in a YAML file.
1. Select an image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Call API to enable compiling optimization and GPU memory optimization for recomputation
Enable compiling optimization
Call the transformers Trainer API:

Enable GPU memory optimization for recomputation
export CHECKPOINT_OPTIMIZATION=true
3. Launch containers
The image provides a built-in model training tool named ljperf to demonstrate the procedure for launching containers and running training tasks.
LLM
# Launch and log on to the container.
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo.
ljperf --action train --model_name deepspeed/llama3-8b 4. Usage notes
Changes in the image involve the PyTorch and DeepSpeed libraries. Do not reinstall it.
Leave
zero_optimization.stage3_prefetch_bucket_sizein the DeepSpeed configuration empty or set it toauto.
Known issues
None