All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.12

Last Updated:Mar 26, 2026

This page covers release notes for the training-nv-pytorch 25.12 image: new features, image contents, quick start instructions, and known issues.

What's new

New features

  • vLLM upgraded to 0.12.0

  • flashinfer-python upgraded to 0.5.3

Bug fixes

None.

Image contents

The following table lists the two image variants in this release, including CUDA versions, driver requirements, supported architectures, and pre-installed components.

Image name

training-nv-pytorch

Tag

25.12-cu130-serverless

25.12-cu128-serverless

Scenarios

Training/Inference

Framework

PyTorch

Requirements

NVIDIA Driver release >= 580

NVIDIA Driver release >= 575

Supported architectures

amd64 & aarch64

amd64

Core components

  • Ubuntu : 24.04

  • Python : 3.12.7+gc

  • CUDA : 13.0

  • perf : 5.4.30

  • gdb : 15.0.50.20240403-git

  • torch : 2.9.0+ali.10.nv25.10

  • triton : 3.5.0

  • transformer_engine : 2.9.0+70f53666

  • deepspeed : 0.18.1+ali

  • flash_attn : 2.8.3

  • flash_attn_3 : not found

  • transformers : 4.57.1+ali

  • grouped_gemm : 1.1.4

  • accelerate : 1.11.0+ali

  • diffusers : 0.34.0

  • mmengine : 0.10.3

  • mmcv : 2.1.0

  • mmdet : 3.3.0

  • opencv-python-headless : 4.11.0.86

  • ultralytics : 8.3.96

  • timm : 1.0.22

  • vllm : 0.12.0+cu130

  • flashinfer-python : 0.5.3

  • pytorch-dynamic-profiler : 0.24.11

  • peft : 0.16.0

  • ray : 2.52.1

  • megatron-core : 0.14.0

  • Ubuntu : 24.04

  • Python : 3.12.7+gc

  • CUDA : 12.8

  • perf : 5.4.30

  • gdb : 15.0.50.20240403-git

  • torch : 2.8.0.9+nv25.3

  • triton : 3.4.0

  • transformer_engine : 2.9.0+70f53666

  • deepspeed : 0.18.1+ali

  • flash_attn : 2.8.3

  • flash_attn_3 : 3.0.0b1

  • transformers : 4.57.1+ali

  • grouped_gemm : 1.1.4

  • accelerate : 1.11.0+ali

  • diffusers : 0.34.0

  • mmengine : 0.10.3

  • mmcv : 2.1.0

  • mmdet : 3.3.0

  • opencv-python-headless : 4.11.0.86

  • ultralytics : 8.3.96

  • timm : 1.0.22

  • vllm : 0.12.0+cu128

  • flashinfer-python : 0.5.3

  • pytorch-dynamic-profiler : 0.24.11

  • peft : 0.16.0

  • ray : 2.52.1

  • megatron-core : 0.14.0

Image assets

Public images

CUDA 13.0.2 (Driver >= 580, amd64 & aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu130-serverless

CUDA 12.8 (Driver >= 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.12-cu128-serverless

VPC images

To pull ACS AI container images faster within a VPC, replace the image URI prefix:

  • Replace: egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

  • With: acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Placeholder Description Example
{region-id} Region ID of the ACS product. For a full list, see Available regions. cn-beijing, cn-wulanchabu
{image:tag} Image name and tag training-nv-pytorch:25.12-cu130-serverless
This image is compatible with ACS products and Lingjun multi-tenant products. It is not compatible with Lingjun single-tenant products.

Driver requirements

The 25.12 release provides two CUDA variants with different driver requirements:

Image tag CUDA version Minimum driver version
25.12-cu130-serverless CUDA 13.0.2 NVIDIA Driver 580
25.12-cu128-serverless CUDA 12.8.0 NVIDIA Driver 575

For driver compatibility details, see:

Key features and enhancements

PyTorch compiler optimization

torch.compile(), introduced in PyTorch 2.0, works well for single-GPU training. For LLM training on distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed, the compiler cannot capture the full compute graph, which limits or negates its benefit.

This release addresses that limitation with two improvements:

  • Finer communication granularity in DeepSpeed: By controlling the granularity of communication operations, the compiler can obtain a wider compute graph scope and apply more aggressive optimization.

  • Frontend compiler improvements: The PyTorch compiler frontend is updated to handle graph breaks without stopping compilation. Mode matching and dynamic shape support are also enhanced.

These optimizations deliver a 20% end-to-end (E2E) throughput improvement when training an 8B LLM.

GPU memory optimization for recomputation

This release includes an automated activation recomputation tuner. It analyzes GPU memory consumption across model configurations and cluster deployments by collecting metrics such as GPU memory utilization. Based on the analysis, it determines the optimal number of activation recomputation layers and integrates the recommendation directly into PyTorch.

This feature is currently available in the DeepSpeed framework.

E2E performance evaluation

Performance was measured using CNP (Cloud-Native AI Performance evaluation and analysis tool). The evaluation compares E2E training throughput across the following configurations on a multi-node GPU cluster, using mainstream open-source models.

Image comparison: base image and iterative improvements

image.png

E2E performance contribution analysis of core GPU components

The following tests are based on version 25.12:

  1. Base: NGC PyTorch image

  2. ACS AI image: Base+ACCL: Uses the ACCL (Alibaba Cloud Communication Library) communication library

  3. ACS AI image: AC2+ACCL: Uses AC2 BaseOS with no optimizations enabled

  4. ACS AI image: AC2+ACCL+CompilerOpt: Uses AC2 BaseOS with only the torch compile optimization enabled

  5. ACS AI image: AC2+ACCL+CompilerOpt+CkptOpt: Uses AC2 BaseOS with both torch compile and selective gradient checkpointing enabled

image.png

Quick start

The following examples show how to pull and run the training-nv-pytorch image using Docker.

To use this image in ACS, select it from the Artifacts page in the console when creating a workload, or specify the image reference in a YAML file.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Replace [tag] with the target image tag, such as 25.12-cu130-serverless or 25.12-cu128-serverless.

2. Enable compiler and memory optimizations

Enable compiler optimization

Use the Hugging Face Transformers Trainer API:

image.png

Enable GPU memory optimization for recomputation

export CHECKPOINT_OPTIMIZATION=true

3. Start the container

The image includes a built-in model training tool, ljperf. The following example starts the container and runs a training job.

LLM workloads

# Start the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

4. Configuration recommendations

Follow these recommendations when using this image:

  • Do not reinstall pre-bundled libraries such as PyTorch and DeepSpeed. The image is tuned with specific library versions; reinstalling them may break optimizations.

  • Leave zero_optimization.stage3_prefetch_bucket_size in your DeepSpeed configuration blank or set it to auto.

  • Set NCCL_SOCKET_IFNAME based on the number of GPUs requested per pod:

    GPU count per pod Setting
    1, 2, 4, or 8 GPUs NCCL_SOCKET_IFNAME=eth0 (default)
    16 GPUs (all GPUs on the machine, using High-Performance Network (HPN)) NCCL_SOCKET_IFNAME=hpn0

Known issues

Compiling fa3 on CUDA 13.0.2 fails

Condition: Using the CUDA 13.0.2 (25.12-cu130-serverless) image and compiling flash-attention 3 (fa3) directly inside the container.

Impact: The compilation fails with an error.

Workaround: This is a known community issue. Do not compile fa3 directly on the CUDA 13.0.2 image. Use the CUDA 12.8 (25.12-cu128-serverless) image if fa3 compilation is required.