training-nv-pytorch 25.10 - Container Compute Service - Alibaba Cloud Documentation Center

This topic provides the release notes for training-nv-pytorch version 25.10.

Key features and bug fixes

Key features

Multi-architecture support:
- The image now supports both amd64 and aarch64 architectures.
Core component upgrades:
- Megatron-Core has been upgraded to version 0.14.0, incorporating the latest features from the community.
- Transformer Engine has been upgraded to version 2.4, incorporating the latest features from the community.
- vLLM has been upgraded to version 0.11.0, incorporating the latest features from the community.

Bug fixes

No bug fixes in this release.

Architecture	aarch64	amd64
Use case	Training / Inference	Training / Inference
Framework	PyTorch	PyTorch
Requirements	NVIDIA Driver release ≥ 575	NVIDIA Driver release ≥ 575
Core components	Ubuntu : 24.04 CUDA : 12.8 Python : 3.12.7+gc torch : 2.8.0.9+nv25.3 accelerate : 1.7.0+ali deepspeed : 0.16.9+ali diffusers : 0.34.0 flash_attn : 2.8.3 flash_attn_3 : 3.0.0b1 flashinfer-python : 0.2.5 gdb : 15.0.50.20240403-git grouped_gemm : 1.1.4 megatron-core : 0.14.0 mmcv : 2.1.0 mmdet : 3.3.0 mmengine : 0.10.3 opencv-python-headless : 4.11.0.86 peft : 0.16.0 pytorch-dynamic-profiler : 0.24.11 pytorch-triton : 3.4.0 ray : 2.50.1 timm : 1.0.20 transformer_engine : 2.4.0+3cd6870c transformers : 4.56.1+ali ultralytics : 8.3.96 vllm : 0.11.0	Ubuntu : 24.04 CUDA : 12.8 Python : 3.12.7+gc torch : 2.8.0.9+nv25.3 accelerate : 1.7.0+ali deepspeed : 0.16.9+ali diffusers : 0.34.0 flash_attn : 2.8.3 flash_attn_3 : 3.0.0b1 flashinfer-python : 0.2.5 gdb : 15.0.50.20240403-git grouped_gemm : 1.1.4 megatron-core : 0.14.0 mmcv : 2.1.0 mmdet : 3.3.0 mmengine : 0.10.3 opencv-python-headless : 4.11.0.86 peft : 0.16.0 perf : 5.4.30 pytorch-dynamic-profiler : 0.24.11 ray : 2.50.1 timm : 1.0.20 transformer_engine : 2.4.0+3cd6870c transformers : 4.56.1+ali triton : 3.4.0 ultralytics : 8.3.96 vllm : 0.11.0

Assets

25.10

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.10-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id} indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.
{image:tag} indicates the name and tag of the image.

Important

Currently, only images in the China (Beijing) region can be pulled over a VPC.

Note

This image is suitable for standard ACS product and the multi-tenant Lingjun environment. This image is not suitable for the single-tenant Lingjun environment. Do not use it in a single-tenant Lingjun setup.

Driver requirements

The 25.10 release is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later.
However, if you are running on a data center GPU (such as a T4 or any other data center GPU), use one of the following NVIDIA driver versions:
- 470.57 (or a later R470 release)
- 525.85 (or a later R525 release)
- 535.86 (or a later R535 release)
- 545.23 (or a later R545 release)
The CUDA driver's forward compatibility package supports only specific driver series. Therefore, you must upgrade if you are using R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560 drivers. These versions are not forward-compatible with CUDA 12.8.
For a complete list of supported drivers, see CUDA application compatibility. For more information, see CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable for small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
- The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
- The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when an 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

End-to-end performance evaluation

Using the cloud-native AI performance benchmarking tool CNP, we conducted a comprehensive end-to-end (E2E) performance comparison of this image against a standard base image, using mainstream open-source models and framework configurations. Additionally, through a series of ablation studies, we evaluated the performance contribution of each individual optimization component to the overall model training performance.

Image performance evaluation: Baseline comparison and iterative analysis

E2E performance contribution of core GPU components

The following tests are based on version 25.10. We conducted E2E training performance evaluation and comparative analysis on a multi-node GPU cluster. The configurations compared include:

Base: The standard NGC PyTorch Image.
ACS AI Image (AC2): The Golden image using the AC2 BaseOS with no optimizations enabled.
ACS AI Image (AC2 + CompilerOpt): The Golden image using the AC2 BaseOS with only the torch.compile optimization enabled.
ACS AI Image (AC2 + CompilerOpt + CkptOpt): The Golden image using the AC2 BaseOS with both torch.compile and selective gradient checkpoint optimizations enabled.

Quick start

The following example shows how to pull the training-nv-pytorch image using Docker.

Note

To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable optimizations (compiler + gradient checkpointing)

To enable the compile optimization using the transformers trainer API:
To enable the gradient checkpointing optimization for memory:
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Start the container

The image includes a built-in model training tool ljperf. The following steps show how to use this tool to start a container and run a training job.

LLMs

# Start and enter the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

4. Recommendations

The image includes modifications to core libraries such as PyTorch and DeepSpeed. Do not reinstall these packages, as doing so may break the optimizations.
In your DeepSpeed configuration, leave the zero_optimization.stage3_prefetch_bucket_size parameter empty or set it to auto.
The NCCL_SOCKET_IFNAME environment variable is pre-configured in this image and must be adjusted based on your use case:
- For single-pod training/inference tasks using 1, 2, 4, or 8 GPUs, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.
- For single-pod training/inference tasks using all 16 GPUs on a full node, set NCCL_SOCKET_IFNAME=hpn0. This setting lets you use the High-Performance Network (HPN).

Container Compute Service:training-nv-pytorch 25.10