All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.05

Last Updated:May 23, 2025

This topic describes the release notes for training-nv-pytorch 25.05.

Main features and bug fixes

Main features

  • The base image CUDA is upgraded to 12.9.0.

Bugs fixed

Content

Scenario

Training/Inference

Framework

PyTorch

Requirements

NVIDIA driver release >= 575

Core components

  • Ubuntu 24.04

  • Python 3.12.7+gc

  • Torch 2.6.0.7.post1

  • CUDA 12.9.0

  • ACCL-N 2.26.5.12

  • triton 3.2.0

  • TransformerEngine 2.1

  • deepspeed 0.15.4+ali

  • flash-attn 2.7.2

  • flashattn-hopper 3.0.0b1

  • transformers 4.51.2+ali

  • megatron-core 0.9.0

  • grouped_gemm 1.1.4

  • accelerate 1.6.0+ali

  • diffusers 0.31.0

  • openmim 0.3.9

  • mmengine 0.10.3

  • mmcv 2.1.0

  • mmdet 3.3.0

  • opencv-python-headless 4.10.0.84

  • ultralytics 8.2.74

  • timm 1.0.13

  • mmdet 3.3.0

  • vllm 0.8.5+cu128

  • flashinfer 0.2.5

  • pytorch-dynamic-profiler 0.24.11

  • perf 5.4.30

  • gdb 15.0.50

  • peft 0.13.2

  • ray 2.46.0

Assets

25.05

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.05-serverless

VPC image

  • acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

    {region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
    {image:tag} indicates the name and tag of the image.
Important

Currently, you can pull only images in the China (Beijing) region over a VPC.

Note
  • The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.05-serverless image is applicable to ACS services and Lingjun multi-tenant services. This image is not applicable to Lingjun single-tenant services. Do not use it in Lingjun single-tenant scenarios.

  • The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.05 image is applicable to Lingjun single-tenant scenarios.

Driver requirements

  • The 25.05 Release aligns with the NGC pytorch 25.04 image version update (because NGC releases images at the end of each month, Golden image development can only be based on the previous month's NGC version). Therefore, the Golden-gpu driver follows the requirements of the corresponding NGC image version. This release is based on CUDA 12.9.0 and requires NVIDIA driver version 575 or higher. However, if you are running on a data center GPU (such as T4 or any other data center GPU), you can use NVIDIA driver version 470.57 (or higher R470), 525.85 (or higher R525), 535.86 (or higher R535), or 545.23 (or higher R545).

  • The CUDA driver compatibility package only supports specific drivers. Therefore, users should upgrade from all R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 drivers, which are not forward compatible with CUDA 12.8. For a complete list of supported drivers, see the CUDA application compatibility topic. For more information, see CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

  • Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.

  • Optimized PyTorch:

    • The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.

    • The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

ACCL

ACCL is an in-house HPN communication library provided by Alibaba Cloud for Lingjun. It provides ACCL-N for GPU acceleration scenarios. ACCL-N is an HPN library customized based on NCCL. It is completely compatible with NCCL and fixes some bugs in NCCL. ACCL-N also provides higher performance and stability.

E2E performance benefit assessment

With the cloud-native AI performance assessment and analysis tool CNP, we can use mainstream open source models and frameworks together with standard base images to analyze E2E performance. In addition, we can use ablation study to further assess how each optimization component benefits the overall model training.

Analysis of E2E training performance contribution of GPU core components

The following tests are based on Golden-25.05 and perform E2E performance testing and comparative analysis on multi-node GPU clusters. The comparison items include the following:

  1. Base: NGC PyTorch Image

  2. ACS AI Image: Base+ACCL: The image uses the ACCL communication library

  3. ACS AI Image: AC2+ACCL: The Golden image uses AC2 BaseOS without enabling any optimization

  4. ACS AI Image: AC2+ACCL+CompilerOpt: The Golden image uses AC2 BaseOS and only enables torch compile optimization

  5. ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The Golden image uses AC2 BaseOS and enables both torch compile and selective gradient checkpoint optimization

image.png

Quick Start

The following example uses only Docker to pull the training-nv-pytorch image.

Note

To use the training-nv-pytorch image in ACS, you must pull it from the artifact center page of the console where you create workloads or specify the image in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call the API to enable compiling optimization and GPU memory optimization for recomputation

  • Enable compiling optimization

    Use the transformers Trainer API:

    image.png

  • Enable GPU memory optimization for recomputation

    export CHECKPOINT_OPTIMIZATION=true

3. Launch containers

The image provides a built-in model training tool named ljperf to demonstrate the procedure for launching containers and running training tasks.

LLM

# Launch a container and log on to the container.
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8b 

4. Suggestions

  • Changes in the image involve the PyTorch and DeepSpeed libraries. Do not reinstall it.

  • Leave zero_optimization.stage3_prefetch_bucket_size in the DeepSpeed configuration empty or set it to auto.

Known issues

  • The image upgrades PyTorch to 2.6. The performance benefit of recomputation memory optimization for LLM models is not as good as in previous images. Continuous optimization is in progress.