training-nv-pytorch 25.06 - Container Compute Service - Alibaba Cloud Documentation Center

Announcements

Do not reinstall PyTorch, DeepSpeed, or related libraries. The image ships pre-optimized binaries. Reinstalling these packages overwrites the optimized builds and may degrade performance.
This image is compatible with Alibaba Cloud Container Compute Service (ACS) clusters and Lingjun multi-tenant clusters, but not supported on Lingjun single-tenant clusters.

What's new

Updated frameworks

PyTorch and related components upgraded to V2.7.1.8
Triton Inference Server upgraded to V3.3.0
vLLM compatibility extended to 0.9.1
Added support for NVIDIA's Blackwell GPU architecture

Bug fix

Upgrading PyTorch to V2.7.1.8 resolves degraded VRAM (video random access memory) optimization efficiency in legacy container images.

Image details

Attribute	Details
Scenario	Training/Inference
Framework	PyTorch
Driver requirement	NVIDIA Driver ≥ 575 (see Driver requirements for data center GPU compatibility)

Core components

Component	Version
Ubuntu	24.04
Python	3.12.7+gc
Torch	2.7.1.8+nv25.3
CUDA	12.8.0
ACCL-N	2.23.4.12
triton	3.3.0
TransformerEngine	2.3.0+5de3e14
deepspeed	0.16.9+ali
flash-attn	2.7.2
flashattn-hopper	3.0.0b1
transformers	4.51.2+ali
megatron-core	0.12.1
grouped_gemm	1.1.4
accelerate	1.7.0+ali
diffusers	0.31.0
mmengine	0.10.3
mmcv	2.1.0
mmdet	3.3.0
opencv-python-headless	4.10.0.84
ultralytics	8.3.96
timm	1.0.15
vLLM	0.9.1
flashinfer-python	0.2.5
pytorch-dynamic-profiler	0.24.11
perf	5.4.30
gdb	15.0.50
peft	0.13.2
ray	2.47.1

Available images

V25.06

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.06-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders:

Placeholder	Description	Example
`{region-id}`	The region where ACS is activated	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	The image name and tag	`training-nv-pytorch:25.06-serverless`

Important

VPC image pulling is currently supported only in the China (Beijing) region.

Driver requirements

V25.06 is based on CUDA 12.8.0 and requires NVIDIA Driver 575 or later.

For data center GPUs (such as T4), the following driver branches are also compatible:

470.57+ (R470 branch)
525.85+ (R525 branch)
535.86+ (R535 branch)
545.23+ (R545 branch)

Important

The CUDA driver compatibility package supports only the branches listed above. If your driver is on an incompatible branch (R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560), upgrade your driver before using this image — those branches lack forward compatibility with CUDA 12.8. For details, see CUDA compatibility and CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compilation optimization

torch.compile() delivers strong performance gains in single-GPU scenarios, but its impact is limited in large-scale LLM training because distributed frameworks like FSDP and DeepSpeed introduce frequent graph breaks that constrain the compiler.

To address this, three optimizations are applied:

DeepSpeed communication granularity: Optimized to expose larger, more coherent computation graphs to the compiler.
Compiler frontend: Enhanced to handle arbitrary graph breaks.
Pattern matching and dynamic shape support: Improved for stable compiled performance across varied workloads.

Result: ~20% end-to-end (E2E) throughput improvement in 8B-parameter LLM training.

Gradient checkpointing optimization

Through extensive benchmarking across models, cluster configurations, and system metrics (including memory utilization), a predictive model identifies the optimal activation recomputation layers for each workload. This optimization is natively integrated into PyTorch and supported in DeepSpeed, so you can adopt advanced memory optimization with minimal configuration changes.

E2E performance gain evaluation

Using the Cloud Native Platform (CNP) AI performance analysis tool, comprehensive end-to-end comparisons were run against standard base images (such as NGC PyTorch), using mainstream open-source models and frameworks with ablation studies to quantify each optimization's contribution.

Test configuration (multi-node GPU clusters)

Test case	Configuration
1. Baseline	NGC PyTorch image
2. ACS AI image: Base + ACCL	Base image with ACCL communication library
3. ACS AI image: AC2 + ACCL	Golden image with AC2 BaseOS (no optimizations)
4. ACS AI image: AC2 + ACCL + CompilerOpt	AC2 BaseOS with `torch.compile` optimization
5. ACS AI image: AC2 + ACCL + CompilerOpt + CkptOpt	AC2 BaseOS with both `torch.compile` and selective gradient checkpointing

Quick start

This example uses Docker to pull and run the training-nv-pytorch image.

For ACS clusters, select the image from the Artifact Center in the console or specify it in your YAML configuration instead of using Docker pull.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable compiler and memory optimization

Compilation optimization with Transformers Trainer API

Gradient checkpointing optimization

export CHECKPOINT_OPTIMIZATION=true

3. Launch the container

The image includes a built-in training tool: ljperf.

LLM training example

# Start the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Configuration notes

In the DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.
The image pre-sets NCCL_SOCKET_IFNAME based on pod size:
- 1/2/4/8 GPUs per pod (training or inference): NCCL_SOCKET_IFNAME=eth0 — this is the default.
- 16-GPU node training: Set NCCL_SOCKET_IFNAME=hpn0 manually to use HPN.

Known issues

None.