PyTorch 25.05 GPU Training Upgrades with CUDA 12.9.0 - Container Compute Service

Release notes for the training-nv-pytorch 25.05 image, including what's new, component versions, image tags, driver requirements, key features, and quick start instructions.

What's new

Features

The base image CUDA is upgraded to 12.9.0.

Bug fixes

PyTorch is upgraded to 2.6.0.7.post1, which fixes the profile crash issue in the open-source community.

Image contents

Scenario	Training/Inference
Framework	PyTorch
Requirements	NVIDIA driver release >= 575

Core components:

Component	Version
Ubuntu	24.04
Python	3.12.7+gc
Torch	2.6.0.7.post1
CUDA	12.9.0
ACCL-N	2.26.5.12
triton	3.2.0
TransformerEngine	2.1
deepspeed	0.15.4+ali
flash-attn	2.7.2
flashattn-hopper	3.0.0b1
transformers	4.51.2+ali
megatron-core	0.9.0
grouped_gemm	1.1.4
accelerate	1.6.0+ali
diffusers	0.31.0
openmim	0.3.9
mmengine	0.10.3
mmcv	2.1.0
mmdet	3.3.0
opencv-python-headless	4.10.0.84
ultralytics	8.2.74
timm	1.0.13
vLLM	0.8.5+cu128
flashinfer	0.2.5
pytorch-dynamic-profiler	0.24.11
perf	5.4.30
gdb	15.0.50
peft	0.13.2
ray	2.46.0

Image tags

25.05

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.05-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with actual values:

Placeholder	Description	Example
`{region-id}`	The region where your ACS is activated	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	The image name and tag	`training-nv-pytorch:25.05`

Important

Currently, VPC image pulls are only available in the China (Beijing) region.

Note

training-nv-pytorch:25.05-serverless is for ACS services and Lingjun multi-tenant services. Do not use it in Lingjun single-tenant scenarios.
training-nv-pytorch:25.05 (without -serverless) is for Lingjun single-tenant scenarios.

Driver requirements

Release 25.05 is based on CUDA 12.9.0 and requires NVIDIA driver version 575 or higher.

Data center GPU exception: For data center GPUs such as T4, you can use driver versions 470.57 (R470+), 525.85 (R525+), 535.86 (R535+), or 545.23 (R545+).

Drivers that require upgrading: R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 are not forward compatible with CUDA 12.8 and must be upgraded.

For complete driver compatibility information, see:

Note

Release 25.05 aligns with the NGC PyTorch 25.04 image version. NGC releases images at the end of each month, so Golden image development is based on the previous month's NGC version.

Key features

PyTorch compiling optimization

torch.compile(), introduced in PyTorch 2.0, works well for small-scale single-GPU training. For large language model (LLM) training, which requires GPU memory optimization and distributed frameworks like Fully Sharded Data Parallel (FSDP) or DeepSpeed, standard torch.compile() provides limited or negative benefit.

This release addresses that limitation with two optimizations:

Communication granularity control in DeepSpeed: Controlling communication granularity in the DeepSpeed framework gives the compiler access to a complete compute graph, enabling wider-scope compiling optimization.
PyTorch compiler frontend improvements: The frontend is optimized to ensure compilation even when a graph break occurs. Mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

Result: end-to-end (E2E) throughput increases by 20% when training an 8B LLM.

GPU memory optimization for activation recomputation

This feature profiles GPU memory consumption across models deployed in different clusters or with different parameters by running performance tests and collecting system metrics such as GPU memory utilization. Based on the results, it automatically determines the optimal number of activation recomputation layers and integrates the setting into PyTorch — providing GPU memory savings without manual tuning. Currently supported in the DeepSpeed framework.

ACCL

Alibaba Cloud Communication Library (ACCL) is an in-house High-Performance Network (HPN) communication library provided by Alibaba Cloud for Lingjun. ACCL-N is the GPU acceleration variant — an HPN library built on NVIDIA Collective Communications Library (NCCL) that is fully compatible with NCCL, fixes known NCCL bugs, and delivers higher performance and stability.

E2E performance assessment

The following tests use the cloud-native AI performance assessment and analysis tool CNP with mainstream open-source models and frameworks together with standard base images to analyze E2E performance. An ablation study is used to assess how each optimization component contributes to overall model training performance. Tests are run on Golden-25.05 on multi-node GPU clusters:

Configuration	Description
Base	NGC PyTorch image
ACS AI Image: Base+ACCL	ACS AI image with the ACCL communication library
ACS AI Image: AC2+ACCL	Golden image on AC2 BaseOS, no optimizations enabled
ACS AI Image: AC2+ACCL+CompilerOpt	Golden image on AC2 BaseOS with torch compile optimization
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt	Golden image on AC2 BaseOS with torch compile and selective gradient checkpoint optimization

Quick start

The following example uses Docker to pull and run the training-nv-pytorch image.

Note

To use the training-nv-pytorch image in ACS, pull it from the artifact center page in the console when creating workloads, or specify the image in a YAML file.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable optimizations

Enable compiling optimization

Use the transformers Trainer API:

Enable GPU memory optimization for activation recomputation

export CHECKPOINT_OPTIMIZATION=true

3. Launch a container

The image includes ljperf, a built-in model training tool for launching containers and running training tasks.

LLM example:

# Launch a container and log on to the container.
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8b

Usage notes

This release modifies the PyTorch and DeepSpeed libraries. Do not reinstall them.
Leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto in your DeepSpeed configuration.

Known issues

PyTorch is upgraded to 2.6 in this release. The performance benefit of activation recomputation memory optimization for LLM models is lower than in previous images. Optimization is ongoing.