All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.11

Last Updated:Mar 26, 2026

Release 25.11 adds multi-architecture support, upgrades PyTorch to 2.9, and delivers a 20% end-to-end throughput improvement for 8B LLM training.

What's new

Multi-architecture support

The image now supports both amd64 and aarch64 architectures, with two CUDA variants:

  • CUDA 13.0.2 — requires NVIDIA driver ≥ 580; supports amd64 and aarch64

  • CUDA 12.8 — requires NVIDIA driver ≥ 575; supports amd64 only

Core component upgrades

Component Version Notes
PyTorch 2.9
Transformers 4.57.1
DeepSpeed 0.18.1
TransformerEngine 2.8 Adds Qwen3-VL support
vLLM 0.11.2

This release contains no bug fixes.

Image contents

The image ships in two CUDA variants. Choose the variant that matches your NVIDIA driver version.

CUDA 13.0 image (amd64 and aarch64)

Requirements: NVIDIA driver ≥ 580

Component Version
Ubuntu 24.04
Python 3.12.7+gc
CUDA 13.0
torch 2.9.0+ali.10.nv25.10
triton 3.5.0
transformer_engine 2.9.0+70f53666
deepspeed 0.18.1+ali
flash_attn 2.8.3
transformers 4.57.1+ali
grouped_gemm 1.1.4
accelerate 1.11.0+ali
diffusers 0.34.0
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.11.0.86
ultralytics 8.3.96
timm 1.0.22
vllm 0.11.2+cu130
flashinfer-python 0.5.2
pytorch-dynamic-profiler 0.24.11
peft 0.16.0
ray 2.52.0
megatron-core 0.14.0
perf 5.4.30
gdb 15.0.50

CUDA 12.8 image (amd64 only)

Requirements: NVIDIA driver ≥ 575

Component Version
Ubuntu 24.04
Python 3.12.7+gc
CUDA 12.8
torch 2.8.0.9+nv25.3
triton 3.4.0
transformer_engine 2.9.0
deepspeed 0.18.1+ali
flash_attn 2.8.3
flash_attn_3 3.0.0b1
transformers 4.57.1+ali
grouped_gemm 1.1.4
accelerate 1.11.0+ali
diffusers 0.34.0
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.11.0.86
ultralytics 8.3.96
timm 1.0.22
vllm 0.11.2
flashinfer-python 0.5.2
pytorch-dynamic-profiler 0.24.11
peft 0.16.0
ray 2.52.0
megatron-core 0.14.0
perf 5.4.30
gdb 15.0.50

Assets

Public images

CUDA 13.0.2 (driver ≥ 580, amd64 and aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.11-cu130-serverless

CUDA 12.8 (driver ≥ 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.11-cu128-serverless

VPC images

To speed up image pulls from within your virtual private cloud (VPC), replace the standard registry hostname with a region-specific VPC endpoint.

Standard format:

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/<image:tag>

VPC format:

acs-registry-vpc.<region-id>.cr.aliyuncs.com/egslingjun/<image:tag>

Replace the placeholders with actual values:

Placeholder Description Example
<region-id> The ID of the region where your ACS service is deployed cn-beijing, cn-wulanchabu
<image:tag> The image name and tag training-nv-pytorch:25.11-cu130-serverless
Note

This image is compatible with standard ACS and the multi-tenant Lingjun environment. It is not compatible with the single-tenant Lingjun environment.

Driver requirements

CUDA version Minimum NVIDIA driver
CUDA 13.0.2 580
CUDA 12.8.0 575

For a complete driver compatibility matrix, see CUDA Application Compatibility. For information on managing driver versions, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiler optimization

torch.compile(), introduced in PyTorch 2.0, works well for single-GPU small-scale training. For LLM training, however, it provides limited benefit or can hurt performance because LLM training relies on GPU memory optimization and distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.

This release addresses that limitation with two optimizations:

  • Communication granularity control in DeepSpeed — the compiler now receives a complete compute graph over a wider scope, enabling broader compiler optimizations.

  • PyTorch compiler frontend improvements — compilation continues even when a graph break occurs. Mode matching and dynamic shape capabilities are also enhanced to produce more optimized compiled code.

After these optimizations, end-to-end (E2E) throughput increases by 20% when training an 8B LLM.

GPU memory optimization for recomputation

This release integrates automatic recommendation of the optimal number of activation recomputation layers directly into PyTorch. The recommendation is based on profiling models across different cluster configurations and parameter settings, collecting GPU memory utilization metrics, and analyzing the results. This feature is currently supported in the DeepSpeed framework.

End-to-end performance evaluation

Using CNP (cloud-native AI performance benchmarking tool), we ran a comprehensive E2E performance comparison of this image against a standard base image, using mainstream open-source models and framework configurations.

Image performance evaluation: baseline comparison and iterative analysis

image.png

E2E performance contribution of core GPU components

The following ablation study is based on Golden-25.11. All tests ran on a multi-node GPU cluster. The configurations compared are:

  1. Base — the standard NGC PyTorch image

  2. ACS AI Image (Base + ACCL) — the base image with the ACCL communication library

  3. ACS AI Image (AC2 + ACCL) — the Golden image using the AC2 BaseOS, no optimizations enabled

  4. ACS AI Image (AC2 + ACCL + CompilerOpt) — the Golden image with only torch.compile optimization enabled

  5. ACS AI Image (AC2 + ACCL + CompilerOpt + CkptOpt) — the Golden image with both torch.compile and selective gradient checkpointing enabled

image.png

Quick start

To use this image in ACS, select it from Artifact Center when creating a workload, or specify the image reference in a YAML manifest.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:<tag>

Replace <tag> with the target image tag, such as 25.11-cu130-serverless or 25.11-cu128-serverless.

2. Enable optimizations

Compiler optimization — enable via the Transformers Trainer API:

image.png

Gradient checkpointing optimization — set the following environment variable before starting the container:

export CHECKPOINT_OPTIMIZATION=true

3. Start the container and run a training job

The image includes ljperf, a built-in model training and benchmarking tool.

LLMs

# Start the container and enter its shell
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:<tag>

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Configuration notes

This image includes custom modifications to PyTorch, DeepSpeed, and other core libraries. Reinstalling these packages from their upstream sources will overwrite those modifications and disable the performance optimizations.

DeepSpeed configuration:

Leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

NCCL network interface:

Set NCCL_SOCKET_IFNAME based on your GPU count:

GPU count per pod NCCL_SOCKET_IFNAME value Notes
1, 2, 4, or 8 GPUs eth0 Default in this image
16 GPUs (full node) hpn0 Enables the High-Performance Network (HPN)

Known issues

Compiling fa3 directly in the CUDA 13.0.2 image results in an error. This is a known community issue.