All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.08

Last Updated:Mar 26, 2026

Version 25.08 of the training-nv-pytorch image upgrades three core libraries and delivers end-to-end compiler and memory optimizations for large language model (LLM) training on Alibaba Cloud ACS and Lingjun GPU clusters.

What's new

Upgraded components:

  • transformers upgraded to 4.53.3+ali

  • vLLM upgraded to 0.10.0

  • Ray upgraded to 2.48.0

Bug fixes: None

Contents

Application scenario

Training/Inference

Framework

PyTorch

Requirements

NVIDIA Driver release >= 575

Core components

  • Ubuntu: 24.04

  • Python: 3.12.7+gc

  • CUDA: 12.8

  • perf: 5.4.30

  • gdb: 15.0.50.20240403-git

  • torch: 2.7.1.8+nv25.3

  • triton: 3.3.0

  • transformer_engine: 2.3.0+5de3e14

  • deepspeed: 0.16.9+ali

  • flash_attn: 2.7.2

  • flashattn-hopper: 3.0.0b1

  • transformers: 4.53.3+ali

  • grouped_gemm: 1.1.4

  • accelerate: 1.7.0+ali

  • diffusers: 0.34.0

  • mmengine: 0.10.3

  • mmcv: 2.1.0

  • mmdet: 3.3.0

  • opencv-python-headless: 4.11.0.86

  • ultralytics: 8.3.96

  • timm: 1.0.19

  • vllm: 0.10.0

  • flashinfer-python: 0.2.5

  • pytorch-dynamic-profiler: 0.24.11

  • peft: 0.16.0

  • ray: 2.48.0

  • accl-n: 2.27.5.14

  • megatron-core: 0.12.1

Assets

25.08

Public image:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.08-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders with actual values:

Placeholder Description Example
{region-id} The region where ACS is activated cn-beijing, cn-wulanchabu
{image:tag} The image name and tag training-nv-pytorch:25.08-serverless
Important

Currently, VPC image pulls are supported only in the China (Beijing) region.

Note

The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.08-serverless image is designed for ACS and Lingjun multi-tenant products. Do not use this image in Lingjun single-tenant scenarios.

Driver requirements

Version 25.08 is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later.

Exception for data center GPUs (e.g., T4): You can use driver version 470.57 (R470 or later), 525.85 (R525 or later), 535.86 (R535 or later), or 545.23 (R545 or later).

Drivers that require upgrading: R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 are not forward-compatible with CUDA 12.8 and must be upgraded. For the complete list of supported drivers and compatibility details, see CUDA application compatibility and CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compiling optimization

torch.compile(), introduced in PyTorch 2.0, improves performance for single-GPU training. For LLM training, it provides limited or negative benefit because distributed frameworks like Fully Sharded Data Parallel (FSDP) or DeepSpeed interrupt the compiler's view of the compute graph.

Version 25.08 addresses this with two targeted optimizations:

  • Communication granularity control in DeepSpeed: The compiler can now see a complete compute graph across a wider scope, enabling more effective optimization.

  • PyTorch compiler frontend improvements: The frontend now handles graph breaks without stopping compilation, and enhanced mode matching and dynamic shape support generate more efficient compiled code.

Result: End-to-end (E2E) throughput for 8B LLM training increases by 20%.

To enable compilation optimization, use the transformers Trainer API:

image.png

GPU memory optimization for recomputation

For LLM training, activation recomputation reduces GPU memory pressure by recomputing intermediate activations during the backward pass instead of storing them. Choosing the right number of recomputation layers requires careful tuning across cluster configurations and model parameters.

Version 25.08 automates this decision. The optimization layer runs performance tests, collects GPU memory utilization metrics across different cluster and parameter configurations, and derives the optimal number of activation recomputation layers. This value is integrated directly into PyTorch, so gradient checkpointing is applied without manual tuning.

To enable recomputation GPU memory optimization:

export CHECKPOINT_OPTIMIZATION=true
Note

This feature is currently available in the DeepSpeed framework only.

ACCL

ACCL is Alibaba Cloud's in-house High-Performance Network (HPN) communication library for Lingjun. ACCL-N is the GPU acceleration variant, built on NCCL with full compatibility and additional bug fixes, delivering higher performance and stability than standard NCCL.

End-to-end performance gain evaluation

The following evaluation was conducted using CNP, a cloud-native AI performance evaluation and analysis tool, on a multi-node GPU cluster. It compares training throughput across five configurations using mainstream open-source models and framework settings:

  1. Base: NGC PyTorch image (baseline)

  2. ACS AI Image: Base+ACCL: Adds the ACCL communication library

  3. ACS AI Image: AC2+ACCL: AC2 BaseOS with no additional optimizations

  4. ACS AI Image: AC2+ACCL+CompilerOpt: AC2 BaseOS with torch compile optimization enabled

  5. ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: AC2 BaseOS with both torch compile and selective gradient checkpointing enabled

image.png

Quick start

The following steps show how to pull and run the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, select it from the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file.

Step 1: Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Step 2: Enable compiler and recomputation GPU memory optimization

Enable compilation optimization

Use the transformers Trainer API:

image.png

Enable recomputation GPU memory optimization

export CHECKPOINT_OPTIMIZATION=true

Step 3: Start the container and run a training task

The image includes a built-in model training tool, ljperf. The following commands start the container and run a training demo for an LLM workload:

# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Step 4: Usage notes

  • Do not reinstall PyTorch, DeepSpeed, or related libraries included in the image.

  • In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

  • Set the NCCL_SOCKET_IFNAME environment variable based on the number of GPUs per pod:

    GPUs per pod Setting Notes
    1, 2, 4, or 8 NCCL_SOCKET_IFNAME=eth0 Default in this image
    16 (full machine) NCCL_SOCKET_IFNAME=hpn0 Enables High-Performance Network (HPN)

Known issues

None