All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 25.06

Last Updated:Mar 26, 2026

This release upgrades PyTorch to 2.7.1.8, extends vLLM compatibility to 0.9.1, adds Blackwell GPU support, and delivers ~20% end-to-end throughput improvement in 8B-parameter LLM training through compiler and gradient checkpointing optimizations.

Announcements

  • Do not reinstall PyTorch, DeepSpeed, or related libraries. The image ships pre-optimized binaries. Reinstalling these packages overwrites the optimized builds and may degrade performance.

  • This image is compatible with Alibaba Cloud Container Compute Service (ACS) clusters and Lingjun multi-tenant clusters, but not supported on Lingjun single-tenant clusters.

What's new

Updated frameworks

  • PyTorch and related components upgraded to V2.7.1.8

  • Triton Inference Server upgraded to V3.3.0

  • vLLM compatibility extended to 0.9.1

  • Added support for NVIDIA's Blackwell GPU architecture

Bug fix

Upgrading PyTorch to V2.7.1.8 resolves degraded VRAM (video random access memory) optimization efficiency in legacy container images.

Image details

Attribute Details
Scenario Training/Inference
Framework PyTorch
Driver requirement NVIDIA Driver ≥ 575 (see Driver requirements for data center GPU compatibility)

Core components

Component Version
Ubuntu 24.04
Python 3.12.7+gc
Torch 2.7.1.8+nv25.3
CUDA 12.8.0
ACCL-N 2.23.4.12
triton 3.3.0
TransformerEngine 2.3.0+5de3e14
deepspeed 0.16.9+ali
flash-attn 2.7.2
flashattn-hopper 3.0.0b1
transformers 4.51.2+ali
megatron-core 0.12.1
grouped_gemm 1.1.4
accelerate 1.7.0+ali
diffusers 0.31.0
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.10.0.84
ultralytics 8.3.96
timm 1.0.15
vLLM 0.9.1
flashinfer-python 0.2.5
pytorch-dynamic-profiler 0.24.11
perf 5.4.30
gdb 15.0.50
peft 0.13.2
ray 2.47.1

Available images

V25.06

Public image

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.06-serverless

VPC image

acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders:

Placeholder Description Example
{region-id} The region where ACS is activated cn-beijing, cn-wulanchabu
{image:tag} The image name and tag training-nv-pytorch:25.06-serverless
Important

VPC image pulling is currently supported only in the China (Beijing) region.

Driver requirements

V25.06 is based on CUDA 12.8.0 and requires NVIDIA Driver 575 or later.

For data center GPUs (such as T4), the following driver branches are also compatible:

  • 470.57+ (R470 branch)

  • 525.85+ (R525 branch)

  • 535.86+ (R535 branch)

  • 545.23+ (R545 branch)

Important

The CUDA driver compatibility package supports only the branches listed above. If your driver is on an incompatible branch (R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560), upgrade your driver before using this image — those branches lack forward compatibility with CUDA 12.8. For details, see CUDA compatibility and CUDA compatibility and upgrades.

Key features and enhancements

PyTorch compilation optimization

torch.compile() delivers strong performance gains in single-GPU scenarios, but its impact is limited in large-scale LLM training because distributed frameworks like FSDP and DeepSpeed introduce frequent graph breaks that constrain the compiler.

To address this, three optimizations are applied:

  • DeepSpeed communication granularity: Optimized to expose larger, more coherent computation graphs to the compiler.

  • Compiler frontend: Enhanced to handle arbitrary graph breaks.

  • Pattern matching and dynamic shape support: Improved for stable compiled performance across varied workloads.

Result: ~20% end-to-end (E2E) throughput improvement in 8B-parameter LLM training.

Gradient checkpointing optimization

Through extensive benchmarking across models, cluster configurations, and system metrics (including memory utilization), a predictive model identifies the optimal activation recomputation layers for each workload. This optimization is natively integrated into PyTorch and supported in DeepSpeed, so you can adopt advanced memory optimization with minimal configuration changes.

E2E performance gain evaluation

Using the Cloud Native Platform (CNP) AI performance analysis tool, comprehensive end-to-end comparisons were run against standard base images (such as NGC PyTorch), using mainstream open-source models and frameworks with ablation studies to quantify each optimization's contribution.

Test configuration (multi-node GPU clusters)

Test case Configuration
1. Baseline NGC PyTorch image
2. ACS AI image: Base + ACCL Base image with ACCL communication library
3. ACS AI image: AC2 + ACCL Golden image with AC2 BaseOS (no optimizations)
4. ACS AI image: AC2 + ACCL + CompilerOpt AC2 BaseOS with torch.compile optimization
5. ACS AI image: AC2 + ACCL + CompilerOpt + CkptOpt AC2 BaseOS with both torch.compile and selective gradient checkpointing
image.png

Quick start

This example uses Docker to pull and run the training-nv-pytorch image.

For ACS clusters, select the image from the Artifact Center in the console or specify it in your YAML configuration instead of using Docker pull.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable compiler and memory optimization

Compilation optimization with Transformers Trainer API

image.png

Gradient checkpointing optimization

export CHECKPOINT_OPTIMIZATION=true

3. Launch the container

The image includes a built-in training tool: ljperf.

LLM training example

# Start the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Configuration notes

  • In the DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

  • The image pre-sets NCCL_SOCKET_IFNAME based on pod size:

    • 1/2/4/8 GPUs per pod (training or inference): NCCL_SOCKET_IFNAME=eth0 — this is the default.

    • 16-GPU node training: Set NCCL_SOCKET_IFNAME=hpn0 manually to use HPN.

Known issues

None.