All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.02

Last Updated:Mar 26, 2026

This release updates the base image to NGC PyTorch 25.01, upgrades CUDA to 12.8.0 and cuDNN to 9.7.0.66, and delivers PyTorch compiling optimization and GPU memory optimization for recomputation.

Announcements

NVIDIA driver upgrade required for CUDA 12.8. If you are running any of the following driver versions, upgrade before using this image: R418, R440, R450, R460, R510, R520, R530, R545, or R555. These versions are not forward-compatible with CUDA 12.8. See CUDA application compatibility for details.

What's new

Component Previous 25.02
Base image NGC 25.01
CUDA 12.8.0
cuDNN 9.7.0.66
ACCL-N 2.23.4.11 (adds ACCL-Barex support)
Transformers 4.48.3+ali
vLLM 0.7.2
Ray 2.42.1

Bug fixes: None.

Key components

Component Version
Ubuntu 24.04
Python 3.12.7+gc
Torch 2.5.1.6.post2
CUDA 12.8.0
cuDNN 9.7.0.66
ACCL-N 2.23.4.11
triton 3.1.0
TransformerEngine 1.13.0
deepspeed 0.15.4+ali
flash-attn 2.5.8
flashattn-hopper 3.0.0b1
transformers 4.48.3+ali
megatron-core 0.9.0
grouped_gemm 1.1.4
accelerate 1.1.0
diffusers 0.31.0
openmim 0.3.9
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.10.0.84
ultralytics 8.2.74
timm 1.0.13
vllm 0.7.2
pytorch-dynamic-profiler 0.24.11
perf 5.4.30
gdb 15.0.50
peft 0.13.2
ray 2.42.1

Use scenarios: Training / inference Framework: PyTorch

Driver requirements

This release is based on CUDA 12.8.0 and requires NVIDIA driver 570 or later.

If you are running on a data center GPU (such as T4), the following driver versions are also supported:

  • 470.57 (R470)

  • 525.85 (R525)

  • 535.86 (R535)

  • 545.23 (R545)

Drivers that require an upgrade: Users should upgrade from R418, R440, R450, R460, R510, R520, R530, R545, and R555, which are not forward-compatible with CUDA 12.8. For more information, see CUDA application compatibility and CUDA compatibility and upgrades.

The Golden-gpu driver satisfies the driver requirement for NGC image versions.

Image assets

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.02-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated (for example, cn-beijing), and {image:tag} with the image name and tag.

Important

VPC image pulls are currently only supported in the China (Beijing) region.

Image compatibility

Image tag Suitable for
training-nv-pytorch:25.02-serverless ACS products and Lingjun multi-tenant products
training-nv-pytorch:25.02 Lingjun single-tenant scenarios
The 25.02-serverless image is not suitable for Lingjun single-tenant products.

Key features and enhancements

PyTorch compiling optimization

torch.compile(), introduced in PyTorch 2.0, is effective for single-GPU training but provides limited or negative benefit for large language model (LLM) training, which depends on GPU memory optimization and distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.

This release improves torch.compile() for distributed LLM training through two optimizations:

  • Communication granularity control in DeepSpeed: Controlling communication granularity gives the compiler a complete compute graph, enabling wider compiling optimization.

  • Frontend improvements: The PyTorch compiler frontend now compiles even when a graph break occurs, with enhanced mode matching and dynamic shape capabilities.

Result: 20% higher end-to-end throughput when training an 8B LLM.

GPU memory optimization for recomputation

Based on performance tests across different clusters and parameter configurations, this release integrates the optimal number of activation recomputation layers directly into PyTorch. Enable it with a single environment variable — no manual tuning required.

This feature is currently available in the DeepSpeed framework only.

ACCL

ACCL (Alibaba Cloud Communication Library) is an in-house high-performance networking (HPN) communication library for Lingjun. ACCL-N, the GPU acceleration variant, is built on top of NCCL (NVIDIA Collective Communications Library) and is fully compatible with it, while fixing known NCCL bugs and delivering higher performance and stability.

This release updates ACCL-N to 2.23.4.11 and adds support for ACCL-Barex.

End-to-end performance assessment

The following benchmark compares end-to-end performance across five configuration levels for multi-node GPU-accelerated clusters. All results are based on 25.02, using mainstream open-source models and frameworks through the CNP (Cloud-Native Performance) tool.

Configuration Description
Base NGC PyTorch image
ACS AI Image: Base+ACCL NGC base image with ACCL
ACS AI Image: AC2+ACCL Golden image with AC2 BaseOS, no optimizations
ACS AI Image: AC2+ACCL+CompilerOpt Golden image with AC2 BaseOS and PyTorch compiling optimization
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt Golden image with AC2 BaseOS, PyTorch compiling optimization, and selective gradient checkpoint optimization
Performance benchmark chart showing end-to-end throughput across the five configuration levels

Quick start

The following steps use Docker to pull and run the training-nv-pytorch image.

To use this image in ACS, pull it from the artifact center page in the console or specify the image in a YAML file — not with the Docker CLI.

Step 1: Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Step 2: Enable optional optimizations

Compiling optimization — use the Transformers Trainer API:

Screenshot showing how to enable compiling optimization via the Transformers Trainer API

GPU memory optimization for recomputation:

export CHECKPOINT_OPTIMIZATION=true

Step 3: Launch a container and run a training task

The image includes ljperf, a built-in model training tool. The following example launches a container and runs an LLM training demo:

# Launch the container.
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo.
ljperf benchmark --model deepspeed/llama3-8b

Usage notes

  • Do not reinstall PyTorch or DeepSpeed. Changes in the image involve these libraries.

  • In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

Known issues