training-nv-pytorch 25.03 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for training-nv-pytorch 25.03.

Main features and bug fixes

Main features

The base image is updated to NGC 25.02.
PyTorch and related components are updated to 2.6.0.7, TE is updated to 2.1, and accelerate is updated to 1.5.2 to provide new features and bug fixes.
ACCL-N is updated to 2.23.4.12 to provide new features and bug fixes.
vLLM is updated to 0.8.2.dev0 and ray is updated to 2.44. flash-infer 0.2.3 is supported. Transformers is updated to 4.49.0+ali and flash_attn is updated to 2.7.2 to provide new features and bug fixes.

Bugs fixed

Upgraded vLLM version to 0.8.2.dev0, fixed Illegal memory access for MoE On H20 #13693 issue.

Content

Applicable scenario	Training/inference
Framework	PyTorch
Requirements	NVIDIA driver release >= 570
Core components	Ubuntu 24.04 Python 3.12.7+gc Torch 2.6.0.7 CUDA 12.8.0 ACCL-N 2.23.4.12 triton 3.1.0 TransformerEngine 2.1 deepspeed 0.15.4+ali flash-attn 2.7.2 flashattn-hopper 3.0.0b1 transformers 4.49.0+ali megatron-core 0.9.0 grouped_gemm 1.1.4 accelerate 1.5.2 diffusers 0.31.0 openmim 0.3.9 mmengine 0.10.3 mmcv 2.1.0 mmdet 3.3.0 opencv-python-headless 4.10.0.84 ultralytics 8.2.74 timm 1.0.13 mmdet 3.3.0 vllm 0.8.2.dev0+g61c7a1b8.d20250325.cu128 flashinfer 0.2.3 pytorch-dynamic-profiler 0.24.11 perf 5.4.30 gdb 15.0.50 peft 0.13.2 ray 2.44.0

Assets

25.03 public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverless image is suitable for ACS products and Lingjun multi-tenant products, but not suitable for Lingjun single-tenant products.
The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03 image is suitable for Lingjun single-tenant scenarios.

Driver requirements

The 25.03 release is consistent with the NGC pytorch 25.02 image update because NGC releases images at the end of each month but the Golden image update must be based on the version of the previous month. Therefore, the Golden-gpu driver meets the requirements of the corresponding NGC image version. This release is based on CUDA 12.8.0.38 and requires NVIDIA driver version 570 or later. However, if your workloads are running on data center GPUs (such as T4 or any other data center GPUs), you can use NVIDIA driver version 470.57 (or a later version of R470), 525.85 (or a later version of R525), 535.86 (or a later version of R535), or 545.23 (or a later version of R545).
The CUDA driver compatibility package only supports specific drivers. Therefore, you must update from the R418, R440, R450, R460, R510, R520, R530, R545, and R555 versions. These drivers are not forward-compatible with CUDA 12.8. For more information about the supported drivers, see CUDA application compatibility. For more information, see CUDA compatibility and updates.

Key features and enhancements

PyTorch compiling optimization

You can benefit from the compiling optimization capability in PyTorch 2.0 when a single GPU is used for small-scale computing. However, in LLM training, memory optimization, FSDP/DeepSpeed, and other distributed frameworks are needed. Consequently, torch.compile() cannot simply benefit from this capability or the capability may even provide negative benefits:

Control the communication granularity in the DeepSpeed framework to help the compiler obtain a complete compute graph and perform compiling optimization on a wider scope.
Optimized PyTorch:
- Optimize frontend of the PyTorch compiler to ensure compiling even when graph break occurs in the compute graph.
- Augment mode match and dynamic shape capabilities to improve the performance of the compiled code.

With these optimizations, the E2E throughput can be increased by 20% in 8B LLM training scenarios.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

Accl communication library

ACCL is an in-house HPN communication library provided by Alibaba Cloud for Lingjun. It provides ACCL-N for GPU acceleration scenarios. ACCL-N is an HPN library customized based on NCCL. It is completely compatible with NCCL and fixes some bugs in NCCL. ACCL-N also provides higher performance and stability.

E2E performance benefit assessment

With the cloud-native AI performance assessment and analysis tool CNP, we can use mainstream open source models and frameworks together with standard base images to analyze E2E performance. In addition, we can use ablation study to further assess how each optimization component benefits the overall model training.

GPU core component E2E performance benefit analysis

The following E2E performance assessment is based on version 25.03 and a cluster that contains muliple GPU-accelerated nodes. The comparision items include:

Base: The NGC PyTorch image.
ACS AI Image: Base+ACCL: The ACCL used by the image.
ACS AI Image: AC2+ACCL: The Golden image uses AC2 BaseOS, without any optimizations.
ACS AI Image: AC2+ACCL+CompilerOpt: The Golden image uses AC2 BaseOS, with only PyTorch compiling optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The Golden image uses AC2 BaseOS, with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start

The following example shows how to use Docker to pull the training-nv-pytorch image.

Note

To use the training-nv-pytorch image in ACS, you must pull it from the artifact center page of the console where you create workloads or specify the image in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call API to enable compiling optimization and GPU memory optimization for recomputation

Enable compiling optimization
Call the transformers Trainer API:
Enable GPU memory optimization for recomputation
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Launch containers

The image provides a built-in model training tool named ljperf to demonstrate the procedure for launching containers and running training tasks.

LLM

# Launch and log on to the container.
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf --action train --model_name deepspeed/llama3-8b

4. Usage notes

Changes in the image involve the PyTorch and DeepSpeed libraries. Do not reinstall it.
Leave zero_optimization.stage3_prefetch_bucket_size in the DeepSpeed configuration empty or set it to auto.

Known issues

None