All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.12

Last Updated:Dec 26, 2025

This topic describes the release notes for training-nv-pytorch 25.12.

Main features and bug fixes

Main features

  • vLLM is upgraded to 0.12.0, and flashinfer-python is upgraded to 0.5.3.

Bug fixes

None.

Contents

Image name

training-nv-pytorch

Tag

25.12-cu130-serverless

25.12-cu128-serverless

Scenarios

Training/Inference

Framework

PyTorch

Requirements

NVIDIA Driver release >= 580

NVIDIA Driver release >= 575

Supported architectures

amd64 & aarch64

amd64

Core components

  • Ubuntu : 24.04

  • Python : 3.12.7+gc

  • CUDA : 13.0

  • perf : 5.4.30

  • gdb : 15.0.50.20240403-git

  • torch : 2.9.0+ali.10.nv25.10

  • triton : 3.5.0

  • transformer_engine : 2.9.0+70f53666

  • deepspeed : 0.18.1+ali

  • flash_attn : 2.8.3

  • flash_attn_3 : not found

  • transformers : 4.57.1+ali

  • grouped_gemm : 1.1.4

  • accelerate : 1.11.0+ali

  • diffusers : 0.34.0

  • mmengine : 0.10.3

  • mmcv : 2.1.0

  • mmdet : 3.3.0

  • opencv-python-headless : 4.11.0.86

  • ultralytics : 8.3.96

  • timm : 1.0.22

  • vllm : 0.12.0+cu130

  • flashinfer-python : 0.5.3

  • pytorch-dynamic-profiler : 0.24.11

  • peft : 0.16.0

  • ray : 2.52.1

  • megatron-core : 0.14.0

  • Ubuntu : 24.04

  • Python : 3.12.7+gc

  • CUDA : 12.8

  • perf : 5.4.30

  • gdb : 15.0.50.20240403-git

  • torch : 2.8.0.9+nv25.3

  • triton : 3.4.0

  • transformer_engine : 2.9.0+70f53666

  • deepspeed : 0.18.1+ali

  • flash_attn : 2.8.3

  • flash_attn_3 : 3.0.0b1

  • transformers : 4.57.1+ali

  • grouped_gemm : 1.1.4

  • accelerate : 1.11.0+ali

  • diffusers : 0.34.0

  • mmengine : 0.10.3

  • mmcv : 2.1.0

  • mmdet : 3.3.0

  • opencv-python-headless : 4.11.0.86

  • ultralytics : 8.3.96

  • timm : 1.0.22

  • vllm : 0.12.0+cu128

  • flashinfer-python : 0.5.3

  • pytorch-dynamic-profiler : 0.24.11

  • peft : 0.16.0

  • ray : 2.52.1

  • megatron-core : 0.14.0

Assets

Public images

CUDA 13.0.2 (Driver >= 580, amd64 & aarch64)

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu130-serverless

CUDA 12.8 (Driver >= 575, amd64)

  • egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu128-serverless

VPC images

To quickly pull ACS AI container images within a VPC, replace the specified AI container image URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

  • {region-id}: The region ID of the ACS product. For more information, see Available regions. Examples: cn-beijing and cn-wulanchabu.

  • {image:tag}: The name and tag of the AI container image. Examples: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

This image is suitable for ACS products and Lingjun multi-tenant products. This image is not suitable for Lingjun single-tenant products. Do not use it in a Lingjun single-tenant scenario.

Driver requirements

  • The 25.12 release supports CUDA 12.8.0 and CUDA 13.0.2 based on different driver versions. CUDA 13.0.2 requires NVIDIA driver version 580 or later. CUDA 12.8.0 requires NVIDIA driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable or small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

  • Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.

  • Optimized PyTorch:

    • The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.

    • The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when a 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

E2E performance benefit evaluation

Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive end-to-end performance comparison. We used mainstream open source models and framework configurations against a standard base image. We also performed ablation experiments to evaluate the contribution of each optimization component to the overall model training performance.

Image comparison: Base image and iteration evaluation

image.png

E2E performance contribution analysis of core GPU components

The following tests are based on version 25.12. They show an E2E performance evaluation and comparison for training on a multi-node GPU cluster. The comparison items include the following:

  1. Base: NGC PyTorch Image

  2. ACS AI Image: Base+ACCL: The image uses the ACCL communication library.

  3. ACS AI Image: AC2+ACCL: This image uses AC2 BaseOS with no optimizations enabled.

  4. ACS AI Image: AC2+ACCL+CompilerOpt: This image uses AC2 BaseOS with only the torch compile optimization enabled.

  5. ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: This image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

image.png

Quick start

The following examples show how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, you can select it from the Artifacts page in the console when you create a workload, or specify the image reference in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call the API to enable the compiler and recomputation for GPU memory optimization

  • Enable compilation optimization

    Use the transformers Trainer API:

    image.png

  • Enable recomputation for GPU memory optimization

    export CHECKPOINT_OPTIMIZATION=true

3. Start the container

The image includes a built-in model training tool named ljperf. The following steps show how to use this tool to start a container and run a training job.

LLM class

# Start and enter the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 

4. Recommendations

  • The changes in the image involve libraries such as PyTorch and DeepSpeed. Do not reinstall them.

  • Leave `zero_optimization.stage3_prefetch_bucket_size` in the DeepSpeed configuration empty or set it to `auto`.

  • The built-in environment variable NCCL_SOCKET_IFNAME in this image must be dynamically adjusted based on the scenario:

    • When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.

    • When a single pod requests all 16 cards on a machine for a training or inference task, you can use the High-Performance Network (HPN). Set NCCL_SOCKET_IFNAME=hpn0.

Known issues

Compiling fa3 directly on the CUDA 13.0.2 image causes an error. This is a known community issue.