training-nv-pytorch 25.02 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for training-nv-pytorch 25.02.

Main Features and Bug Fix Lists

Main Features

The base image is updated to support NGC 25.01, CUDA is updated to 12.8.0, and cuDNN is updated to 9.7.0.66.
ACCL-N is updated to 2.23.4.11 and supports ACCL-Barex.
Transformers is updated to 4.48.3+ali, vLLM is updated to 0.7.2, and Ray is updated to 2.42.1 to support new features and fix bugs.

Bugs Fix

None

Use scenarios	Training/inference
Framework	pytorch
Requirements	NVIDIA Driver release >= 570
Key components	Ubuntu 24.04 Python 3.12.7+gc Torch 2.5.1.6.post2 CUDA 12.8.0 ACCL-N 2.23.4.11 triton 3.1.0 TransformerEngine 1.13.0 deepspeed 0.15.4+ali flash-attn 2.5.8 flashattn-hopper 3.0.0b1 transformers 4.48.3+ali megatron-core 0.9.0 grouped_gemm 1.1.4 accelerate 1.1.0 diffusers 0.31.0 openmim 0.3.9 mmengine 0.10.3 mmcv 2.1.0 mmdet 3.3.0 opencv-python-headless 4.10.0.84 ultralytics 8.2.74 timm 1.0.13 mmdet 3.3.0 vllm 0.7.2 pytorch-dynamic-profiler 0.24.11 perf 5.4.30 gdb 15.0.50 peft 0.13.2 ray 2.42.1

Assets

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.02-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.02-serverless image is suitable for ACS products and Lingjun multi-tenant products. It is not suitable for Lingjun single-tenant products.
The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.02 image is suitable for Lingjun single-tenant scenarios.

Driver Requirements

25.02 The release is based on the NGC Pytorch 25.01 image update. Therefore, the Golden-gpu driver meets the requirement on NGC image versions. This release is based on CUDA 12.8.0 and requires NVIDIA driver 570 or later. If you run it on a GPU used in your data center, such as T4, you can choose NVIDIA driver 470.57 (or R470), 525.85 (or R525), 535.86 (or R535), or 545.23 (or R545).
The CUDA driver is only compatible with certain drivers. Therefore, you must update the R418, R440, R450, R460, R510, R520, R530, R545, and R555 versions. These drivers are not forward-compatible with CUDA 12.8. For more information about the supported drivers, see CUDA application compatibility. For more information, see CUDA compatibility and upgrade.

Key Features and Enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
- The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
- The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

ACCL

ACCL is an in-house HPN communication library provided by Alibaba Cloud for Lingjun. It provides ACCL-N for GPU acceleration scenarios. ACCL-N is an HPN library customized based on NCCL. It is completely compatible with NCCL and fixes some bugs in NCCL. ACCL-N also provides higher performance and stability.

E2E performance benefit assessment

With the cloud-native AI performance assessment and analysis tool CNP, we can use mainstream open source models and frameworks together with standard base images to analyze E2E performance. In addition, we can use ablation study to further assess how each optimization component benefits the overall model training.

GPU core component E2E performance benefit analysis

The following test is based on 25.02. E2E performance assessment and comparison is conducted for multi-node GPU-accelerated clusters.

Base: The NGC PyTorch image.
ACS AI Image: Base+ACCL: The ACCL used by the image.
ACS AI Image: AC2+ACCL: The Golden image uses AC2 BaseOS, without any optimizations.
ACS AI Image: AC2+ACCL+CompilerOpt: The Golden image uses AC2 BaseOS, with only PyTorch compiling optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The Golden image uses AC2 BaseOS, with PyTorch compiling optimization and selective gradient checkpoint optimization enabled.

Quick Start

The following example uses only Docker to pull the training-nv-pytorch image.

Note

To use the training-nv-pytorch image in ACS, you must pull it from the artifact center page of the console where you create workloads or specify the image in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call the API to enable compiling optimization and GPU memory optimization for recomputation

Enable compiling optimization
Use the transformers Trainer API:
Enable GPU memory optimization for recomputation
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Launch containers

The image provides a built-in model training tool named ljperf to demonstrate the procedure for launching containers and running training tasks.

LLM

# Launch a container and log on to the container.
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8b

4. Suggestions

Changes in the image involve the PyTorch and DeepSpeed libraries. Do not reinstall it.
Leave zero_optimization.stage3_prefetch_bucket_size in the DeepSpeed configuration empty or set it to auto.

Known Issues

Illegal memory access for MoE On H20 #13693. We recommend that you update vLLM.

Container Compute Service:training-nv-pytorch 25.02