This topic provides the release notes for training-nv-pytorch version 25.10.
Key features and bug fixes
Key features
Multi-architecture support:
The image now supports both amd64 and aarch64 architectures.
Core component upgrades:
Megatron-Core has been upgraded to version 0.14.0, incorporating the latest features from the community.
Transformer Engine has been upgraded to version 2.4, incorporating the latest features from the community.
vLLM has been upgraded to version 0.11.0, incorporating the latest features from the community.
Bug fixes
No bug fixes in this release.
Contents
Architecture | aarch64 | amd64 |
Use case | Training / Inference | Training / Inference |
Framework | PyTorch | PyTorch |
Requirements | NVIDIA Driver release ≥ 575 | NVIDIA Driver release ≥ 575 |
Core components | Ubuntu : 24.04 CUDA : 12.8 Python : 3.12.7+gc torch : 2.8.0.9+nv25.3 accelerate : 1.7.0+ali deepspeed : 0.16.9+ali diffusers : 0.34.0 flash_attn : 2.8.3 flash_attn_3 : 3.0.0b1 flashinfer-python : 0.2.5 gdb : 15.0.50.20240403-git grouped_gemm : 1.1.4 megatron-core : 0.14.0 mmcv : 2.1.0 mmdet : 3.3.0 mmengine : 0.10.3 opencv-python-headless : 4.11.0.86 peft : 0.16.0 pytorch-dynamic-profiler : 0.24.11 pytorch-triton : 3.4.0 ray : 2.50.1 timm : 1.0.20 transformer_engine : 2.4.0+3cd6870c transformers : 4.56.1+ali ultralytics : 8.3.96 vllm : 0.11.0 | Ubuntu : 24.04 CUDA : 12.8 Python : 3.12.7+gc torch : 2.8.0.9+nv25.3 accelerate : 1.7.0+ali deepspeed : 0.16.9+ali diffusers : 0.34.0 flash_attn : 2.8.3 flash_attn_3 : 3.0.0b1 flashinfer-python : 0.2.5 gdb : 15.0.50.20240403-git grouped_gemm : 1.1.4 megatron-core : 0.14.0 mmcv : 2.1.0 mmdet : 3.3.0 mmengine : 0.10.3 opencv-python-headless : 4.11.0.86 peft : 0.16.0 perf : 5.4.30 pytorch-dynamic-profiler : 0.24.11 ray : 2.50.1 timm : 1.0.20 transformer_engine : 2.4.0+3cd6870c transformers : 4.56.1+ali triton : 3.4.0 ultralytics : 8.3.96 vllm : 0.11.0 |
Assets
25.10
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.10-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
{region-id}indicates the region where your ACS is activated, such as cn-beijing and cn-wulanchabu.{image:tag}indicates the name and tag of the image.
Currently, only images in the China (Beijing) region can be pulled over a VPC.
This image is suitable for standard ACS product and the multi-tenant Lingjun environment. This image is not suitable for the single-tenant Lingjun environment. Do not use it in a single-tenant Lingjun setup.
Driver requirements
The 25.10 release is based on CUDA 12.8.0 and requires NVIDIA driver version 575 or later.
However, if you are running on a data center GPU (such as a T4 or any other data center GPU), use one of the following NVIDIA driver versions:
470.57 (or a later R470 release)
525.85 (or a later R525 release)
535.86 (or a later R535 release)
545.23 (or a later R545 release)
The CUDA driver's forward compatibility package supports only specific driver series. Therefore, you must upgrade if you are using R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560 drivers. These versions are not forward-compatible with CUDA 12.8.
For a complete list of supported drivers, see CUDA application compatibility. For more information, see CUDA compatibility and upgrades.
Key features and enhancements
PyTorch compiling optimization
The compiling optimization feature introduced in PyTorch 2.0 is suitable for small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.
Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.
After the preceding optimizations, the E2E throughput is increased by 20% when an 8B LLM is trained.
GPU memory optimization for recomputation
We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.
End-to-end performance evaluation
Using the cloud-native AI performance benchmarking tool CNP, we conducted a comprehensive end-to-end (E2E) performance comparison of this image against a standard base image, using mainstream open-source models and framework configurations. Additionally, through a series of ablation studies, we evaluated the performance contribution of each individual optimization component to the overall model training performance.
Image performance evaluation: Baseline comparison and iterative analysis

E2E performance contribution of core GPU components
The following tests are based on version 25.10. We conducted E2E training performance evaluation and comparative analysis on a multi-node GPU cluster. The configurations compared include:
Base: The standard NGC PyTorch Image.
ACS AI Image (AC2): The Golden image using the AC2 BaseOS with no optimizations enabled.
ACS AI Image (AC2 + CompilerOpt): The Golden image using the AC2 BaseOS with only the
torch.compileoptimization enabled.ACS AI Image (AC2 + CompilerOpt + CkptOpt): The Golden image using the AC2 BaseOS with both
torch.compileand selective gradient checkpoint optimizations enabled.

Quick start
The following example shows how to pull the training-nv-pytorch image using Docker.
To use this image in ACS, select it from the Artifact Center in the console when creating a workload, or specify the image reference in a YAML manifest.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Enable optimizations (compiler + gradient checkpointing)
To enable the compile optimization using the
transformerstrainer API:
To enable the gradient checkpointing optimization for memory:
export CHECKPOINT_OPTIMIZATION=true
3. Start the container
The image includes a built-in model training tool ljperf. The following steps show how to use this tool to start a container and run a training job.
LLMs
# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 4. Recommendations
The image includes modifications to core libraries such as PyTorch and DeepSpeed. Do not reinstall these packages, as doing so may break the optimizations.
In your DeepSpeed configuration, leave the
zero_optimization.stage3_prefetch_bucket_sizeparameter empty or set it toauto.The
NCCL_SOCKET_IFNAMEenvironment variable is pre-configured in this image and must be adjusted based on your use case:For single-pod training/inference tasks using 1, 2, 4, or 8 GPUs, set
NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.For single-pod training/inference tasks using all 16 GPUs on a full node, set
NCCL_SOCKET_IFNAME=hpn0. This setting lets you use the High-Performance Network (HPN).