Version 25.08 of the training-nv-pytorch image upgrades three core libraries and delivers end-to-end compiler and memory optimizations for large language model (LLM) training on Alibaba Cloud ACS and Lingjun GPU clusters.
What's new
Upgraded components:
-
transformers upgraded to 4.53.3+ali
-
vLLM upgraded to 0.10.0
-
Ray upgraded to 2.48.0
Bug fixes: None
Contents
|
Application scenario |
Training/Inference |
|
Framework |
PyTorch |
|
Requirements |
NVIDIA Driver release >= 575 |
|
Core components |
|
Assets
25.08
Public image:
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.08-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace the placeholders with actual values:
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
The region where ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} |
The image name and tag | training-nv-pytorch:25.08-serverless |
Currently, VPC image pulls are supported only in the China (Beijing) region.
The egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.08-serverless image is designed for ACS and Lingjun multi-tenant products. Do not use this image in Lingjun single-tenant scenarios.
Driver requirements
Version 25.08 is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later.
Exception for data center GPUs (e.g., T4): You can use driver version 470.57 (R470 or later), 525.85 (R525 or later), 535.86 (R535 or later), or 545.23 (R545 or later).
Drivers that require upgrading: R418, R440, R450, R460, R510, R520, R530, R545, R555, and R560 are not forward-compatible with CUDA 12.8 and must be upgraded. For the complete list of supported drivers and compatibility details, see CUDA application compatibility and CUDA compatibility and upgrades.
Key features and enhancements
PyTorch compiling optimization
torch.compile(), introduced in PyTorch 2.0, improves performance for single-GPU training. For LLM training, it provides limited or negative benefit because distributed frameworks like Fully Sharded Data Parallel (FSDP) or DeepSpeed interrupt the compiler's view of the compute graph.
Version 25.08 addresses this with two targeted optimizations:
-
Communication granularity control in DeepSpeed: The compiler can now see a complete compute graph across a wider scope, enabling more effective optimization.
-
PyTorch compiler frontend improvements: The frontend now handles graph breaks without stopping compilation, and enhanced mode matching and dynamic shape support generate more efficient compiled code.
Result: End-to-end (E2E) throughput for 8B LLM training increases by 20%.
To enable compilation optimization, use the transformers Trainer API:
GPU memory optimization for recomputation
For LLM training, activation recomputation reduces GPU memory pressure by recomputing intermediate activations during the backward pass instead of storing them. Choosing the right number of recomputation layers requires careful tuning across cluster configurations and model parameters.
Version 25.08 automates this decision. The optimization layer runs performance tests, collects GPU memory utilization metrics across different cluster and parameter configurations, and derives the optimal number of activation recomputation layers. This value is integrated directly into PyTorch, so gradient checkpointing is applied without manual tuning.
To enable recomputation GPU memory optimization:
export CHECKPOINT_OPTIMIZATION=true
This feature is currently available in the DeepSpeed framework only.
ACCL
ACCL is Alibaba Cloud's in-house High-Performance Network (HPN) communication library for Lingjun. ACCL-N is the GPU acceleration variant, built on NCCL with full compatibility and additional bug fixes, delivering higher performance and stability than standard NCCL.
End-to-end performance gain evaluation
The following evaluation was conducted using CNP, a cloud-native AI performance evaluation and analysis tool, on a multi-node GPU cluster. It compares training throughput across five configurations using mainstream open-source models and framework settings:
-
Base: NGC PyTorch image (baseline)
-
ACS AI Image: Base+ACCL: Adds the ACCL communication library
-
ACS AI Image: AC2+ACCL: AC2 BaseOS with no additional optimizations
-
ACS AI Image: AC2+ACCL+CompilerOpt: AC2 BaseOS with torch compile optimization enabled
-
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: AC2 BaseOS with both torch compile and selective gradient checkpointing enabled
Quick start
The following steps show how to pull and run the training-nv-pytorch image using Docker.
To use the training-nv-pytorch image in ACS, select it from the Artifacts page when creating a workload in the console, or specify the image reference in a YAML file.
Step 1: Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
Step 2: Enable compiler and recomputation GPU memory optimization
Enable compilation optimization
Use the transformers Trainer API:
Enable recomputation GPU memory optimization
export CHECKPOINT_OPTIMIZATION=true
Step 3: Start the container and run a training task
The image includes a built-in model training tool, ljperf. The following commands start the container and run a training demo for an LLM workload:
# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b
Step 4: Usage notes
-
Do not reinstall PyTorch, DeepSpeed, or related libraries included in the image.
-
In your DeepSpeed configuration, leave
zero_optimization.stage3_prefetch_bucket_sizeblank or set it toauto. -
Set the
NCCL_SOCKET_IFNAMEenvironment variable based on the number of GPUs per pod:GPUs per pod Setting Notes 1, 2, 4, or 8 NCCL_SOCKET_IFNAME=eth0Default in this image 16 (full machine) NCCL_SOCKET_IFNAME=hpn0Enables High-Performance Network (HPN)
Known issues
None