All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.06

Last Updated:Aug 11, 2025

This topic outlines the release notes for training-nv-pytorch 25.06.

Main features and bug fixes

Updated frameworks

  • PyTorch and related components upgraded to V2.7.1.8

  • Triton Inference Server upgraded to V3.3.0

  • vLLM version compatibility enhanced to support 0.9.1

  • Added support for NVIDIA's Blackwell GPU architecture, enabling forward-looking development on next-generation hardware

Bug fix

  • Resolved degraded video random access memory (VRAM) optimization efficiency in legacy container images by upgrading PyTorch to V2.7.1.8.

Image details

Scenario

Training/Inference

Framework

PyTorch

Driver requirement

NVIDIA Driver ≥ 575 (see below for data center GPU compatibility)

Core components

  • Ubuntu 24.04

  • Python 3.12.7+gc

  • Torch 2.7.1.8+nv25.3

  • CUDA 12.8.0

  • ACCL-N 2.23.4.12

  • triton 3.3.0

  • TransformerEngine 2.3.0+5de3e14

  • deepspeed 0.16.9+ali

  • flash-attn 2.7.2

  • flashattn-hopper 3.0.0b1

  • transformers 4.51.2+ali

  • megatron-core 0.12.1

  • grouped_gemm 1.1.4

  • accelerate 1.7.0+ali

  • diffusers 0.31.0

  • mmengine 0.10.3

  • mmcv 2.1.0

  • mmdet 3.3.0

  • opencv-python-headless 4.10.0.84

  • ultralytics 8.3.96

  • timm 1.0.15

  • vllm 0.9.1

  • flashinfer-python 0.2.5

  • pytorch-dynamic-profiler 0.24.11

  • perf 5.4.30

  • gdb 15.0.50

  • peft 0.13.2

  • ray 2.47.1

Available images

V25.06

  • Public image: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.06-serverless

VPC image

  • acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

    {region-id}: The region where your Alibaba Cloud Container Compute Service (ACS) is activated (examples: cn-beijing, cn-wulanchabu).
    {image:tag}: The image name and tag.
Important

VPC image pulling is currently supported only in the China (Beijing) region.

Note

This image is suitable for ACS clusters and Lingjun multi-tenant clusters, but not supported on Lingjun single-tenant clusters.

Driver requirements

  • The V25.06 release is based on CUDA 12.8.0 and requires NVIDIA Driver 575 or later. For data center GPUs (such as T4), the following driver versions are compatible:

    • 470.57+ (R470 branch)

    • 525.85+ (R525 branch)

    • 535.86+ (R535 branch)

    • 545.23+ (R545 branch)

  • Important: The CUDA driver compatibility package only supports specific driver branches. Users on incompatible branches (R418, R440, R450, R460, R510, R520, R530, R545, R555, R560) must upgrade, because they lack forward compatibility with CUDA 12.8. For full details, see CUDA compatibility and CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compilation optimization

While torch.compile() delivers strong performance gains in single-GPU scenarios, its benefits are limited in large-scale LLM training due to distributed frameworks like FSDP and DeepSpeed.

  • To unlock broader compiler optimizations:

    • We optimized communication granularity within DeepSpeed, exposing larger, more coherent computation graphs to the compiler.

    • Enhanced the compiler frontend to handle arbitrary graph breaks.

    • Improved pattern matching and dynamic shape support for stable compiled performance.

Result: Consistent ~20% end-to-end (E2E) throughput improvement in 8B-parameter LLM training.

Gradient checkpointing optimization

Through extensive benchmarking across models, cluster configurations, and system metrics (including memory utilization), we developed a predictive model to identify optimal activation recomputation layers. This optimization is now natively integrated into PyTorch and supported in DeepSpeed, enabling low-effort adoption of advanced memory optimization techniques.

ACCL

The Alibaba Cloud Communication Library (ACCL) is a suite of high-performance networking (HPN) libraries designed for Lingjun.

One of its key components is ACCL-N, a GPU-accelerated communication library customized from the NVIDIA Collective Communications Library (NCCL). While maintaining full API compatibility with NCCL, ACCL-N provides several enhancements:

  • Improved performance: Delivers significantly higher throughput and greater stability, especially in large-scale, multi-node training environments.

  • Enhanced stability: Includes targeted bug fixes not yet available in the standard NCCL versions.

E2E performance gain evaluation

Using the Cloud Native Platform (CNP) AI performance analysis tool, we conducted comprehensive E2E comparisons against standard base images (such as NGC PyTorch). Tests used mainstream open-source models and frameworks, with ablation studies to quantify each optimization's contribution.

Test configuration (multi-node GPU clusters)

Test case

Configuration

1. Baseline

NGC PyTorch Image

2. ACS AI Image: Base + ACCL

Base image with ACCL communication library

3. ACS AI Image: AC2+ACCL

Golden image with AC2 BaseOS (no optimizations)

4. ACS AI Image: AC2 + ACCL + CompilerOpt

AC2 BaseOS with torch.compile optimization

5. ACS AI Image: AC2 + ACCL + CompilerOpt + CkptOpt

AC2 BaseOS with both torch.compile and selective gradient checkpointing

image.png

Quick start

This example uses Docker to pull and run the training-nv-pytorch image.

Note

For ACS users: When deploying in ACS, select the image from the Artifact Center in the console or specify it in your YAML configuration.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable compiler and memory optimization

  • Compilation optimization with Transformers Trainer API

    image.png

  • Enable gradient checkpointing optimization

    export CHECKPOINT_OPTIMIZATION=true

3. Launch the container

The image includes a built-in training tool: ljperf.

LLM training example

# Start the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 

4. Usage recommendations

  • Do not reinstall PyTorch, DeepSpeed, or related libraries—the image includes pre-optimized binaries.

  • In the deepspeed configuration, leave zero_optimization.stage3_prefetch_bucket_size empty or set it to auto.

  • The image pre-sets NCCL_SOCKET_IFNAME:

    • When a single pod requests 1/2/4/8 cards for training/inference tasks, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in the image.

    • For 16-GPU node training: Manually set NCCL_SOCKET_IFNAME=hpn0 to leverage HPN.

Known issues

None reported.