This release upgrades PyTorch to 2.7.1.8, extends vLLM compatibility to 0.9.1, adds Blackwell GPU support, and delivers ~20% end-to-end throughput improvement in 8B-parameter LLM training through compiler and gradient checkpointing optimizations.
Announcements
-
Do not reinstall PyTorch, DeepSpeed, or related libraries. The image ships pre-optimized binaries. Reinstalling these packages overwrites the optimized builds and may degrade performance.
-
This image is compatible with Alibaba Cloud Container Compute Service (ACS) clusters and Lingjun multi-tenant clusters, but not supported on Lingjun single-tenant clusters.
What's new
Updated frameworks
-
PyTorch and related components upgraded to V2.7.1.8
-
Triton Inference Server upgraded to V3.3.0
-
vLLM compatibility extended to 0.9.1
-
Added support for NVIDIA's Blackwell GPU architecture
Bug fix
Upgrading PyTorch to V2.7.1.8 resolves degraded VRAM (video random access memory) optimization efficiency in legacy container images.
Image details
| Attribute | Details |
|---|---|
| Scenario | Training/Inference |
| Framework | PyTorch |
| Driver requirement | NVIDIA Driver ≥ 575 (see Driver requirements for data center GPU compatibility) |
Core components
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| Torch | 2.7.1.8+nv25.3 |
| CUDA | 12.8.0 |
| ACCL-N | 2.23.4.12 |
| triton | 3.3.0 |
| TransformerEngine | 2.3.0+5de3e14 |
| deepspeed | 0.16.9+ali |
| flash-attn | 2.7.2 |
| flashattn-hopper | 3.0.0b1 |
| transformers | 4.51.2+ali |
| megatron-core | 0.12.1 |
| grouped_gemm | 1.1.4 |
| accelerate | 1.7.0+ali |
| diffusers | 0.31.0 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| opencv-python-headless | 4.10.0.84 |
| ultralytics | 8.3.96 |
| timm | 1.0.15 |
| vLLM | 0.9.1 |
| flashinfer-python | 0.2.5 |
| pytorch-dynamic-profiler | 0.24.11 |
| perf | 5.4.30 |
| gdb | 15.0.50 |
| peft | 0.13.2 |
| ray | 2.47.1 |
Available images
V25.06
Public image
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:25.06-serverless
VPC image
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace the placeholders:
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
The region where ACS is activated | cn-beijing, cn-wulanchabu |
{image:tag} |
The image name and tag | training-nv-pytorch:25.06-serverless |
VPC image pulling is currently supported only in the China (Beijing) region.
Driver requirements
V25.06 is based on CUDA 12.8.0 and requires NVIDIA Driver 575 or later.
For data center GPUs (such as T4), the following driver branches are also compatible:
-
470.57+ (R470 branch)
-
525.85+ (R525 branch)
-
535.86+ (R535 branch)
-
545.23+ (R545 branch)
The CUDA driver compatibility package supports only the branches listed above. If your driver is on an incompatible branch (R418, R440, R450, R460, R510, R520, R530, R545, R555, or R560), upgrade your driver before using this image — those branches lack forward compatibility with CUDA 12.8. For details, see CUDA compatibility and CUDA compatibility and upgrades.
Key features and enhancements
PyTorch compilation optimization
torch.compile() delivers strong performance gains in single-GPU scenarios, but its impact is limited in large-scale LLM training because distributed frameworks like FSDP and DeepSpeed introduce frequent graph breaks that constrain the compiler.
To address this, three optimizations are applied:
-
DeepSpeed communication granularity: Optimized to expose larger, more coherent computation graphs to the compiler.
-
Compiler frontend: Enhanced to handle arbitrary graph breaks.
-
Pattern matching and dynamic shape support: Improved for stable compiled performance across varied workloads.
Result: ~20% end-to-end (E2E) throughput improvement in 8B-parameter LLM training.
Gradient checkpointing optimization
Through extensive benchmarking across models, cluster configurations, and system metrics (including memory utilization), a predictive model identifies the optimal activation recomputation layers for each workload. This optimization is natively integrated into PyTorch and supported in DeepSpeed, so you can adopt advanced memory optimization with minimal configuration changes.
E2E performance gain evaluation
Using the Cloud Native Platform (CNP) AI performance analysis tool, comprehensive end-to-end comparisons were run against standard base images (such as NGC PyTorch), using mainstream open-source models and frameworks with ablation studies to quantify each optimization's contribution.
Test configuration (multi-node GPU clusters)
| Test case | Configuration |
|---|---|
| 1. Baseline | NGC PyTorch image |
| 2. ACS AI image: Base + ACCL | Base image with ACCL communication library |
| 3. ACS AI image: AC2 + ACCL | Golden image with AC2 BaseOS (no optimizations) |
| 4. ACS AI image: AC2 + ACCL + CompilerOpt | AC2 BaseOS with torch.compile optimization |
| 5. ACS AI image: AC2 + ACCL + CompilerOpt + CkptOpt | AC2 BaseOS with both torch.compile and selective gradient checkpointing |
Quick start
This example uses Docker to pull and run the training-nv-pytorch image.
For ACS clusters, select the image from the Artifact Center in the console or specify it in your YAML configuration instead of using Docker pull.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
2. Enable compiler and memory optimization
Compilation optimization with Transformers Trainer API
Gradient checkpointing optimization
export CHECKPOINT_OPTIMIZATION=true
3. Launch the container
The image includes a built-in training tool: ljperf.
LLM training example
# Start the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b
Configuration notes
-
In the DeepSpeed configuration, leave
zero_optimization.stage3_prefetch_bucket_sizeblank or set it toauto. -
The image pre-sets
NCCL_SOCKET_IFNAMEbased on pod size:-
1/2/4/8 GPUs per pod (training or inference):
NCCL_SOCKET_IFNAME=eth0— this is the default. -
16-GPU node training: Set
NCCL_SOCKET_IFNAME=hpn0manually to use HPN.
-
Known issues
None.