training-nv-pytorch 25.09 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for training-nv-pytorch version 25.09.

Main features and bug fixes

Main features

PyTorch and its related components are upgraded to 2.8.0.
Transformers is upgraded to 4.56.1+ali. This version incorporates features and bug fixes from the corresponding open-source version.

Bug fixes

Fixed an error that occurred when torch.compile() was enabled for open-source Transformers on Qwen2-VL.

Scenarios	Training/Inference
Framework	PyTorch
Requirements	NVIDIA Driver release >= 575
Core components	Ubuntu: 24.04 Python: 3.12.7+gc CUDA: 12.8 perf: 5.4.30 gdb: 15.0.50.20240403-git torch: 2.8.0.9+nv25.3 triton: 3.4.0 transformer_engine: 2.3.0+5de3e148 deepspeed: 0.16.9+ali flash_attn: 2.8.3 flash_attn_3: 3.0.0b1 transformers: 4.56.1+ali grouped_gemm: 1.1.4 accelerate: 1.7.0+ali diffusers: 0.34.0 mmengine: 0.10.3 mmcv: 2.1.0 mmdet: 3.3.0 opencv-python-headless: 4.11.0.86 ultralytics: 8.3.96 timm: 1.0.20 vllm: 0.10.1.1 flashinfer-python: 0.2.5 pytorch-dynamic-profiler: 0.24.11 peft: 0.16.0 ray: 2.49.2 megatron-core: 0.12.1

Assets

25.09

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.09-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note

The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.09-serverless image is suitable for ACS products and Lingjun multi-tenant products. This image is not suitable for Lingjun single-tenant products. Do not use it in Lingjun single-tenant scenarios.

Driver requirements

The 25.09 release is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later. However, if you are running on a data center GPU, such as a T4, you can use NVIDIA driver version 470.57 (or a later R470 version), 525.85 (or a later R525 version), 535.86 (or a later R535 version), or 545.23 (or a later R545 version).
The CUDA driver compatibility package supports only specific drivers. Therefore, you must upgrade any R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560 drivers because they are not forward-compatible with CUDA 12.8. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
- The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
- The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

ACCL

ACCL is an in-house HPN communication library provided by Alibaba Cloud for Lingjun. It provides ACCL-N for GPU acceleration scenarios. ACCL-N is an HPN library customized based on NCCL. It is completely compatible with NCCL and fixes some bugs in NCCL. ACCL-N also provides higher performance and stability.

E2E performance evaluation

Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive E2E performance comparison against standard base images using mainstream open-source models and framework configurations. We also performed ablation experiments to further evaluate the contribution of each optimization component to the overall model training performance.

Image and iteration comparison against the base image

E2E performance contribution analysis of core GPU components

The following tests, based on version 25.09, were conducted on a multi-node GPU cluster to evaluate and compare E2E training performance. The comparison items include the following:

Base: NGC PyTorch Image.
ACS AI Image: Base+ACCL: The image uses the ACCL communication library.
ACS AI Image: AC2+ACCL: The Golden image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image: AC2+ACCL+CompilerOpt: The Golden image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The Golden image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start

The following example shows how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, select it from the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call APIs to enable compiler and recomputation for GPU memory optimization

Enable compilation optimization
Use the Transformers Trainer API:
Enable recomputation for GPU memory optimization
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Start the container

The image includes a built-in model training tool named ljperf. The following steps describe how to use this tool to start a container and run a training task.

LLM

# Start and enter the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

4. Suggestions

This image contains customized versions of libraries such as PyTorch and DeepSpeed. Do not reinstall these libraries.
In the DeepSpeed configuration, leave `zero_optimization.stage3_prefetch_bucket_size` empty or set it to `auto`.
The built-in environment variable NCCL_SOCKET_IFNAME in this image must be dynamically adjusted based on the scenario:
- When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.
- When a single pod requests all 16 cards of a machine for a training or inference task, you can use the High-Performance Network (HPN). In this case, set NCCL_SOCKET_IFNAME=hpn0.

Container Compute Service:training-nv-pytorch 25.09