All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.09

Last Updated:Mar 26, 2026

Release notes for training-nv-pytorch version 25.09. This release upgrades PyTorch to 2.8.0 and Transformers to 4.56.1+ali, and delivers two new optimizations—compile-level throughput improvements and automatic activation recomputation—that reduce the effort required for LLM training on Alibaba Cloud GPU infrastructure.

What's new

Updated components

  • PyTorch and its related components are upgraded to 2.8.0.

  • Transformers is upgraded to 4.56.1+ali, incorporating features and bug fixes from the corresponding open-source version.

Bug fixes

  • Fixed an error that occurred when torch.compile() was enabled for open-source Transformers on Qwen2-VL.

Key features and enhancements

PyTorch compile optimization

torch.compile(), introduced in PyTorch 2.0, works well for small-scale single-GPU training but provides limited or negative benefit for large language model (LLM) training, which requires GPU memory optimization and a distributed framework such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.

This release addresses that gap with two compiler improvements:

  • DeepSpeed communication granularity control: Controlling communication granularity in the DeepSpeed framework gives the compiler a complete compute graph, enabling optimization across a wider scope.

  • PyTorch compiler frontend improvements: The frontend is optimized to continue compiling when a graph break occurs. Mode matching and dynamic shape capabilities are also enhanced.

Together, these optimizations deliver a 20% increase in end-to-end (E2E) throughput when training an 8B LLM.

GPU memory optimization for activation recomputation

This release forecasts and analyzes GPU memory consumption across models deployed in different clusters and parameter configurations, collecting system metrics such as GPU memory utilization. Based on the results, it suggests the optimal number of activation recomputation layers and integrates this into PyTorch—so you get GPU memory savings without manual tuning. This feature is currently available in the DeepSpeed framework.

ACCL

ACCL is Alibaba Cloud's in-house High-Performance Network (HPN) communication library for Lingjun. It includes ACCL-N, an HPN library built on top of NVIDIA Collective Communications Library (NCCL) that is fully compatible with NCCL, fixes several NCCL bugs, and delivers higher performance and stability.

E2E performance evaluation

The following performance comparison was conducted on a multi-node GPU cluster using the cloud-native AI performance evaluation and analysis tool CNP. The baseline is the NGC PyTorch image.

Image and iteration comparison against the base image

image.png

E2E performance contribution by component

The tests measure the cumulative impact of each optimization layer:

ConfigurationDescription
BaseNGC PyTorch image
ACS AI image: Base+ACCLAdds the ACCL communication library
ACS AI image: AC2+ACCLUses AC2 BaseOS with no additional optimizations
ACS AI image: AC2+ACCL+CompilerOptAdds torch.compile() optimization
ACS AI image: AC2+ACCL+CompilerOpt+CkptOptAdds both compile optimization and selective gradient checkpointing
image.png

System requirements

ItemDetails
ScenariosTraining / Inference
FrameworkPyTorch
NVIDIA driverRelease 575 or later

Driver compatibility

The 25.09 release is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later. For data center GPUs such as the T4, the following driver versions are also supported:

  • 470.57 or later (R470)

  • 525.85 or later (R525)

  • 535.86 or later (R535)

  • 545.23 or later (R545)

The following driver versions are not forward-compatible with CUDA 12.8 and must be upgraded before using this image: R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560. For the complete list of supported drivers, see CUDA Application Compatibility. For upgrade guidance, see CUDA Compatibility and Upgrades.

Core components

ComponentVersion
Ubuntu24.04
Python3.12.7+gc
CUDA12.8
perf5.4.30
gdb15.0.50.20240403-git
torch2.8.0.9+nv25.3
triton3.4.0
transformer_engine2.3.0+5de3e148
deepspeed0.16.9+ali
flash_attn2.8.3
flash_attn_33.0.0b1
transformers4.56.1+ali
grouped_gemm1.1.4
accelerate1.7.0+ali
diffusers0.34.0
mmengine0.10.3
mmcv2.1.0
mmdet3.3.0
opencv-python-headless4.11.0.86
ultralytics8.3.96
timm1.0.20
vllm0.10.1.1
flashinfer-python0.2.5
pytorch-dynamic-profiler0.24.11
peft0.16.0
ray2.49.2
megatron-core0.12.1

Image assets

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.09-serverless
Note

This image is suitable for ACS products and Lingjun multi-tenant products. Do not use it in Lingjun single-tenant scenarios.

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with the following values:

PlaceholderDescriptionExample
{region-id}Region where your ACS is activatedcn-beijing, cn-wulanchabu
{image:tag}Image name and tagSee the public image above
Important

Currently, you can pull VPC images only in the China (Beijing) region.

Quick start

The following example shows how to pull and run the training-nv-pytorch image using Docker.

Note

To use this image in ACS, select it from the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file—do not use Docker directly.

Step 1: Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Step 2: Enable compile and recomputation optimizations

Enable compile optimization

Use the Transformers Trainer API:

image.png

Enable activation recomputation for GPU memory optimization

export CHECKPOINT_OPTIMIZATION=true

Step 3: Start the container and run a training task

The image includes a built-in model training tool named ljperf. The following example starts a container and runs an LLM training task:

# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Usage notes

  • Do not reinstall the customized versions of libraries bundled in this image, such as PyTorch and DeepSpeed. Reinstalling them overwrites the Alibaba Cloud optimizations.

  • In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

  • Set NCCL_SOCKET_IFNAME based on the number of GPUs requested per pod:

    GPUs per podNCCL_SOCKET_IFNAME value
    1, 2, 4, or 8eth0 (default)
    16 (full node, using HPN)hpn0