All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.03

Last Updated:Mar 26, 2026

This release updates core training and inference components, introduces PyTorch compiler optimizations that increase end-to-end (E2E) throughput by up to 20% in large language model (LLM) training, and upgrades ACCL-N for higher communication performance.

Important notices

  • Do not reinstall PyTorch or DeepSpeed. This image includes customized versions of both libraries; reinstalling them from PyPI overwrites the optimizations.

  • In your DeepSpeed configuration, set zero_optimization.stage3_prefetch_bucket_size to auto or leave it blank.

  • The 25.03-serverless image is not compatible with Lingjun single-tenant products.

  • VPC image pulls are currently supported only in the China (Beijing) region.

What's new

Component Version
Base image NGC 25.02
PyTorch (Torch) 2.6.0.7
TransformerEngine (TE) 2.1
accelerate 1.5.2
ACCL-N 2.23.4.12
vLLM 0.8.2.dev0
ray 2.44.0
flashinfer 0.2.3
Transformers 4.49.0+ali
flash-attn 2.7.2

Bugs fixed

Upgraded vLLM to 0.8.2.dev0 to fix the illegal memory access for Mixture of Experts (MoE) on H20 (#13693) issue.

Image details

Applicable scenarios

Attribute Value
Applicable scenario Training/inference
Framework PyTorch
Minimum NVIDIA driver 570

Core components

Training and inference frameworks

Component Version
Ubuntu 24.04
Python 3.12.7+gc
Torch 2.6.0.7
CUDA 12.8.0
ACCL-N 2.23.4.12
triton 3.1.0
TransformerEngine 2.1
deepspeed 0.15.4+ali
flash-attn 2.7.2
flashattn-hopper 3.0.0b1
transformers 4.49.0+ali
megatron-core 0.9.0
grouped_gemm 1.1.4
accelerate 1.5.2
peft 0.13.2
vllm 0.8.2.dev0+g61c7a1b8.d20250325.cu128
flashinfer 0.2.3
ray 2.44.0

CV tools

Component Version
diffusers 0.31.0
timm 1.0.13
ultralytics 8.2.74
opencv-python-headless 4.10.0.84
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
openmim 0.3.9

Debugging and profiling

Component Version
pytorch-dynamic-profiler 0.24.11
perf 5.4.30
gdb 15.0.50

Available images

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.03-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace {region-id} with the region where your ACS is activated, for example, cn-beijing. Replace {image:tag} with the name and tag of the image.

Important

VPC image pulls are currently supported only in the China (Beijing) region.

Choose the right image

Image tag Use with
25.03-serverless ACS products and Lingjun multi-tenant products
25.03 Lingjun single-tenant scenarios
Note

The 25.03-serverless image is not compatible with Lingjun single-tenant products.

Driver requirements

This release is based on CUDA 12.8.0.38 and requires NVIDIA driver 570 or later.

Exception for data center GPUs (such as T4): you can use any of the following driver versions instead.

Driver branch Minimum version
R470 470.57
R525 525.85
R535 535.86
R545 545.23

Drivers that must be updated: R418, R440, R450, R460, R510, R520, R530, R545, and R555 are not forward-compatible with CUDA 12.8. Update to a supported driver before using this image.

For details, see CUDA application compatibility and CUDA compatibility and updates.

Key features and enhancements

PyTorch compiler optimization

torch.compile() delivers clear throughput gains for single-GPU workloads, but distributed training (Fully Sharded Data Parallel (FSDP), DeepSpeed) historically prevented the compiler from seeing a complete computation graph, limiting or even negating those gains. This release addresses that with two optimizations:

  • Communication granularity control in DeepSpeed: exposes a complete computation graph to the compiler, enabling broader optimization scope.

  • Compiler frontend improvements: the PyTorch compiler frontend now handles graph breaks gracefully, and mode matching and dynamic shape handling are improved for better runtime performance.

Result: up to 20% higher E2E throughput in 8B LLM training.

GPU memory optimization for recomputation

This release integrates automatic activation recomputation layer recommendations directly into PyTorch. The optimal number of recomputation layers is determined by running performance tests across different cluster configurations and collecting GPU memory utilization metrics — no manual tuning required.

Currently supported in the DeepSpeed framework.

ACCL communication library

ACCL-N is Alibaba Cloud's High-Performance Networking (HPN) communication library for Lingjun, built on NCCL with full NCCL API compatibility. ACCL-N 2.23.4.12 delivers higher throughput and stability than stock NCCL and includes additional bug fixes.

E2E performance benefit assessment

The cloud-native AI performance assessment tool CNP measures E2E training performance using mainstream open-source models and standard base images, with ablation study support to isolate the contribution of each optimization.

The following chart shows the cumulative E2E benefit of each optimization layer in version 25.03, measured on a multi-node GPU cluster:

  1. Base: NGC PyTorch image (baseline)

  2. Base + ACCL: ACCL-N substituted for NCCL

  3. AC2 + ACCL: AC2 BaseOS, no additional optimizations

  4. AC2 + ACCL + CompilerOpt: AC2 BaseOS with PyTorch compiler optimization

  5. AC2 + ACCL + CompilerOpt + CkptOpt: AC2 BaseOS with both PyTorch compiler optimization and selective gradient checkpointing

image.png

Quick start

Note

To use this image in ACS, pull it from the artifact center page of the console when creating workloads, or specify the image URI directly in your YAML file.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable optimizations

Compiler optimization

Call the transformers Trainer API to enable compiler optimization:

image.png

GPU memory optimization for recomputation

export CHECKPOINT_OPTIMIZATION=true

3. Launch a container and run a training demo

The image includes ljperf, a built-in model training tool. The following example launches a container and runs an LLM training demo.

# Launch the container
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the LLM training demo
ljperf --action train --model_name deepspeed/llama3-8b

Known issues

None.