training-nv-pytorch 25.12 - Container Compute Service - Alibaba Cloud Documentation Center

This topic describes the release notes for training-nv-pytorch 25.12.

Main features and bug fixes

Main features

vLLM is upgraded to 0.12.0, and flashinfer-python is upgraded to 0.5.3.

Bug fixes

None.

Image name	training-nv-pytorch
Tag	25.12-cu130-serverless	25.12-cu128-serverless
Scenarios	Training/Inference
Framework	PyTorch
Requirements	NVIDIA Driver release >= 580	NVIDIA Driver release >= 575
Supported architectures	amd64 & aarch64	amd64
Core components	Ubuntu : 24.04 Python : 3.12.7+gc CUDA : 13.0 perf : 5.4.30 gdb : 15.0.50.20240403-git torch : 2.9.0+ali.10.nv25.10 triton : 3.5.0 transformer_engine : 2.9.0+70f53666 deepspeed : 0.18.1+ali flash_attn : 2.8.3 flash_attn_3 : not found transformers : 4.57.1+ali grouped_gemm : 1.1.4 accelerate : 1.11.0+ali diffusers : 0.34.0 mmengine : 0.10.3 mmcv : 2.1.0 mmdet : 3.3.0 opencv-python-headless : 4.11.0.86 ultralytics : 8.3.96 timm : 1.0.22 vllm : 0.12.0+cu130 flashinfer-python : 0.5.3 pytorch-dynamic-profiler : 0.24.11 peft : 0.16.0 ray : 2.52.1 megatron-core : 0.14.0	Ubuntu : 24.04 Python : 3.12.7+gc CUDA : 12.8 perf : 5.4.30 gdb : 15.0.50.20240403-git torch : 2.8.0.9+nv25.3 triton : 3.4.0 transformer_engine : 2.9.0+70f53666 deepspeed : 0.18.1+ali flash_attn : 2.8.3 flash_attn_3 : 3.0.0b1 transformers : 4.57.1+ali grouped_gemm : 1.1.4 accelerate : 1.11.0+ali diffusers : 0.34.0 mmengine : 0.10.3 mmcv : 2.1.0 mmdet : 3.3.0 opencv-python-headless : 4.11.0.86 ultralytics : 8.3.96 timm : 1.0.22 vllm : 0.12.0+cu128 flashinfer-python : 0.5.3 pytorch-dynamic-profiler : 0.24.11 peft : 0.16.0 ray : 2.52.1 megatron-core : 0.14.0

Assets

Public images

CUDA 13.0.2 (Driver >= 580, amd64 & aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu130-serverless

CUDA 12.8 (Driver >= 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu128-serverless

VPC images

To quickly pull ACS AI container images within a VPC, replace the specified AI container image URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

{region-id}: The region ID of the ACS product. For more information, see Available regions. Examples: cn-beijing and cn-wulanchabu.
{image:tag}: The name and tag of the AI container image. Examples: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

This image is suitable for ACS products and Lingjun multi-tenant products. This image is not suitable for Lingjun single-tenant products. Do not use it in a Lingjun single-tenant scenario.

Driver requirements

The 25.12 release supports CUDA 12.8.0 and CUDA 13.0.2 based on different driver versions. CUDA 13.0.2 requires NVIDIA driver version 580 or later. CUDA 12.8.0 requires NVIDIA driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
- The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
- The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

E2E performance benefit evaluation

Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive end-to-end performance comparison. We used mainstream open source models and framework configurations against a standard base image. We also performed ablation experiments to evaluate the contribution of each optimization component to the overall model training performance.

Image comparison: Base image and iteration evaluation

E2E performance contribution analysis of core GPU components

The following tests are based on version 25.12. They show an E2E performance evaluation and comparison for training on a multi-node GPU cluster. The comparison items include the following:

Base: NGC PyTorch Image
ACS AI Image: Base+ACCL: The image uses the ACCL communication library.
ACS AI Image: AC2+ACCL: This image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image: AC2+ACCL+CompilerOpt: This image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: This image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start

The following examples show how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, you can select it from the Artifacts page in the console when you create a workload, or specify the image reference in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call the API to enable the compiler and recomputation for GPU memory optimization

Enable compilation optimization
Use the transformers Trainer API:
Enable recomputation for GPU memory optimization
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Start the container

The image includes a built-in model training tool named ljperf. The following steps show how to use this tool to start a container and run a training job.

LLM class

# Start and enter the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

4. Recommendations

The changes in the image involve libraries such as PyTorch and DeepSpeed. Do not reinstall them.
Leave `zero_optimization.stage3_prefetch_bucket_size` in the DeepSpeed configuration empty or set it to `auto`.
The built-in environment variable NCCL_SOCKET_IFNAME in this image must be dynamically adjusted based on the scenario:
- When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.
- When a single pod requests all 16 cards on a machine for a training or inference task, you can use the High-Performance Network (HPN). Set NCCL_SOCKET_IFNAME=hpn0.

Known issues

Compiling fa3 directly on the CUDA 13.0.2 image causes an error. This is a known community issue.

Container Compute Service:training-nv-pytorch 25.12