PyTorch 25.04 NV Container for GPU-Accelerated LLM Training - Container Compute Service

training-nv-pytorch 25.04 is an ACS AI training container image based on NGC PyTorch 25.03, with Alibaba Cloud optimizations for large-scale LLM training and inference on GPU clusters.

What's new

Base image aligned with NGC 25.03. CUDA upgraded to 12.8.1, TransformerEngine upgraded to 2.1.
Triton adapted to 3.2.0, Accelerate upgraded to 1.6.0+ali, with corresponding version features and bug fixes integrated.
vLLM upgraded to 0.8.5, flashinfer-python upgraded to 0.2.5, Transformers upgraded to 4.51.2+ali, adding Qwen3 support.

Bug fixes: None

Announcements

The image includes modified PyTorch and DeepSpeed libraries. Do not reinstall it.
In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

Components

Scenarios	Training, inference
Framework	PyTorch
NVIDIA driver	>= 570

Core components:

Component	Version
Ubuntu	24.04
Python	3.12.7+gc
Torch	2.6.0.7
CUDA	12.8.0
ACCL-N	2.23.4.12
Triton	3.2.0
TransformerEngine	2.1
DeepSpeed	0.15.4+ali
flash-attn	2.7.2
flashattn-hopper	3.0.0b1
Transformers	4.51.2+ali
megatron-core	0.9.0
grouped_gemm	1.1.4
Accelerate	1.6.0+ali
diffusers	0.31.0
openmim	0.3.9
mmengine	0.10.3
mmcv	2.1.0
mmdet	3.3.0
opencv-python-headless	4.10.0.84
ultralytics	8.2.74
timm	1.0.13
vLLM	0.8.5+cu128
flashinfer	0.2.5
pytorch-dynamic-profiler	0.24.11
perf	5.4.30
gdb	15.0.50
peft	0.13.2
ray	2.43.0

Image assets

25.04

Public image:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.04-serverless

VPC image:

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

{region-id}: the region where your ACS is activated, such as cn-beijing or cn-wulanchabu.
{image:tag}: the name and tag of the image.

Important

Currently, you can pull VPC images only in the China (Beijing) region.

training-nv-pytorch:25.04-serverless is compatible with the ACS product form and the Lingjun multi-tenant product form. It is not compatible with the Lingjun single-tenant product form.

training-nv-pytorch:25.04 (no suffix) is for Lingjun single-tenant scenarios.

Driver requirements

This release is based on CUDA 12.8.1.012.

Scenario	Minimum driver version
Standard GPUs	570 or later
Data center GPUs (T4 and similar)	470.57 (R470), 525.85 (R525), 535.86 (R535), or 545.23 (R545)

Drivers not forward-compatible with CUDA 12.8 — upgrade if you are on any of these: R418, R440, R450, R460, R510, R520, R530, R545, R555, R560.

For the full list of supported drivers, see CUDA application compatibility. For background on compatibility and upgrades, see CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compilation optimization

torch.compile(), introduced in PyTorch 2.0, works well for single-GPU training but provides limited benefit—or even negative benefit—for LLM training, which requires GPU memory optimization and distributed frameworks such as FSDP or DeepSpeed.

This release addresses that gap with two compiler improvements:

Communication granularity control in DeepSpeed: controls the communication granularity in the DeepSpeed framework, giving the compiler a wider scope of computation graph for optimization.
Improved PyTorch compiler frontend: ensures compilation proceeds even when graph breaks occur in a computation graph, with enhanced mode matching and dynamic shapes support.

After these optimizations, end-to-end throughput increases by 20% when training an 8B LLM.

GPU memory optimization for recomputation

This release analyzes GPU memory consumption across different cluster configurations and parameter settings, then derives the optimal number of activation recomputation layers and integrates them into PyTorch. This makes selective activation checkpointing available with minimal configuration. Currently supported in the DeepSpeed framework.

ACCL

ACCL is Alibaba Cloud's in-house High-Performance Network (HPN) communication library for Lingjun. ACCL-N, the GPU acceleration variant, is built on NCCL with full NCCL compatibility, additional bug fixes, and improved performance and stability.

End-to-end performance contribution analysis

The following tests use Golden-25.04 on multi-node GPU clusters, comparing the contribution of each optimization component to end-to-end throughput. The cloud-native AI performance tool CNP runs evaluations with mainstream open-source models and frameworks alongside standard base images.

Test configurations:

Base: NGC PyTorch image
ACS AI image: Base + ACCL: image with the ACCL communication library
ACS AI image: AC2 + ACCL: golden image with AC2 BaseOS, no optimizations enabled
ACS AI image: AC2 + ACCL + CompilerOpt: golden image with AC2 BaseOS, PyTorch compilation optimization enabled
ACS AI image: AC2 + ACCL + CompilerOpt + CkptOpt: golden image with AC2 BaseOS, both compilation optimization and selective activation checkpointing enabled

Quick start

The following steps use Docker to pull and run the training-nv-pytorch image.

To use training-nv-pytorch in ACS, pull the image from the artifact center page in the console when creating workloads, or specify the image in a YAML file.

Prerequisites

Before you begin, make sure that you have:

Docker installed on your host machine
An NVIDIA driver that meets the driver requirements
Access to the image registry (public image or VPC image, depending on your network)

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable compilation optimization and GPU memory optimization for recomputation

Enable compilation optimization — use the Transformers Trainer API:

Enable GPU memory optimization for recomputation:

export CHECKPOINT_OPTIMIZATION=true

3. Launch a container and run a training task

The image includes a built-in model training tool, ljperf, for launching containers and running training tasks.

# Launch a container and log in.
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8b

Known issues

After upgrading to PyTorch 2.6, the performance benefit of recomputation memory optimization for LLM-type models is lower than in previous releases. Optimization is ongoing.