training-nv-pytorch 26.03 - Container Compute Service - Alibaba Cloud Documentation Center

These release notes cover the updates for the training-nv-pytorch 26.03 image.

Main features and bug fixes

Main features

Upgraded torch to version 2.10.
Upgraded vllm to version 0.17.0.
Upgraded megatron-core to version 0.16.0.
Upgraded deepspeed to version 0.18.8.
Upgraded transformer_engine to version 2.12.

Bug fixes

No bug fixes in this release.

Image name	training-nv-pytorch
Tag	26.03-cu130-serverless	26.03-cu128-serverless
Use case	Training and inference
Framework	PyTorch
Requirements	NVIDIA Driver release ≥ 580	NVIDIA Driver release ≥ 575
Supported architectures	amd64 and aarch64	amd64
Core components	Ubuntu: 24.04 Python: 3.12.7+gc CUDA: 13.0 perf: 5.4.30 gdb: 15.1 torch: 2.10.0+ali.10.nv25.10 triton: 3.6.0 transformer_engine: 2.12.0+5671fd36 deepspeed: 0.18.8+ali flash_attn: 2.8.3 transformers: 4.57.6+ali grouped_gemm: 1.1.4 accelerate: 1.11.0+ali diffusers: 0.34.0 mmengine: 0.10.3 mmcv: 2.1.0 mmdet: 3.3.0 opencv-python-headless: 4.11.0.86 ultralytics: 8.3.96 timm: 1.0.26 vllm: 0.17.0+cu130 flashinfer-python: 0.6.4 pytorch-dynamic-profiler: 0.24.11 peft: 0.16.0 ray: 2.54.1 megatron-core: 0.16.0	Ubuntu: 24.04 Python: 3.12.7+gc CUDA: 12.8 perf: 5.4.30 gdb: 15.1 torch: 2.10.0+ali.10.nv25.3.pgo triton: 3.6.0 transformer_engine: 2.12.0+5671fd36 deepspeed: 0.18.8+ali flash_attn: 2.8.3 flash_attn_3: 3.0.0b1 transformers: 4.57.6+ali grouped_gemm: 1.1.4 accelerate: 1.11.0+ali diffusers: 0.34.0 mmengine: 0.10.3 mmcv: 2.1.0 mmdet: 3.3.0 opencv-python-headless: 4.11.0.86 ultralytics: 8.3.96 timm: 1.0.26 vllm: 0.17.0+cu128 flashinfer-python: 0.6.4 pytorch-dynamic-profiler: 0.24.11 peft: 0.16.0 ray: 2.54.1 megatron-core: 0.16.0

Assets

Public images

CUDA 13.0.2 (Driver ≥ 580, amd64 and aarch64)

registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu130-serverless

CUDA 12.8 (Driver ≥ 575, amd64)

registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu128-serverless

VPC images

Replace the public network registry host in your image URI with the region-specific VPC endpoint.

URI component	Public network	IN-VPC
Registry host	`egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com`	`acs-registry-vpc.{region-id}.cr.aliyuncs.com`
Repository	`egslingjun`	`egslingjun` (unchanged)
Image and tag	`{image:tag}`	`{image:tag}` (unchanged)

Replace {region-id} with the ID of the region where your ACS service runs. For example:

Region	Region ID
China (Beijing)	`cn-beijing`
China (Ulanqab)	`cn-wulanchabu`

For the full list of supported regions, see Regions.

Example {image:tag} values:

inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless
training-nv-pytorch:25.10-serverless

Note

This image is suitable for Alibaba Cloud Container Compute Service (ACS) and multi-tenant Lingjun environments. It is not intended for use in single-tenant Lingjun environments.

Driver requirements

The 26.03 release supports CUDA 12.8.0 and CUDA 13.0.2, depending on the driver version. CUDA 13.0.2 requires NVIDIA Driver version 580 or later, and CUDA 12.8.0 requires NVIDIA Driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

torch.compile(), introduced in PyTorch 2.0, is effective for single-GPU training but provides limited or negative benefit for large language model (LLM) training, which depends on GPU memory optimization and distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.

This release improves torch.compile() for distributed LLM training through two optimizations:

Communication granularity control in DeepSpeed: Controlling communication granularity gives the compiler a complete compute graph, enabling wider compiling optimization.
Frontend improvements: The PyTorch compiler frontend now compiles even when a graph break occurs, with enhanced mode matching and dynamic shape capabilities.

Result: ~20% higher end-to-end throughput in 8B-parameter LLM training.

GPU memory optimization for recomputation

Based on performance tests across different clusters and parameter configurations, this release integrates the optimal number of activation recomputation layers directly into PyTorch. Enable it with a single environment variable—no manual tuning required.

This feature is available in the DeepSpeed framework.

End-to-end performance evaluation

Using the cloud-native AI performance analysis tool CNP, we conducted a comprehensive end-to-end performance comparison against a standard base image with mainstream open-source models and framework configurations. We also performed ablation studies to evaluate the contribution of each optimized component to the overall model training performance.

Image comparison: Base image vs. iterative evaluation

End-to-end performance contribution analysis of core GPU components

The following tests evaluate and compare end-to-end training performance on a multi-node GPU cluster using this image release. The comparison items include:

Base: NGC PyTorch image
ACS AI Image (Base + ACCL): NGC PyTorch image using the Alibaba Cloud Communication Library (ACCL).
ACS AI Image (AC2 + ACCL): Golden image using AC2 Base OS with no optimizations enabled.
ACS AI Image (AC2 + ACCL + CompilerOpt): Golden image using AC2 Base OS with only the torch compile optimization enabled.
ACS AI Image (AC2 + ACCL + CompilerOpt + CkptOpt): Golden image using AC2 Base OS with both torch compile and selective gradient checkpointing optimizations enabled.

Quick start

The following example shows how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, select it from the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file.

Step 1: Select an image

docker pull registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Step 2: Enable compiler and memory optimizations

Enable compilation optimization
Use the transformers Trainer API:
Enable selective gradient checkpointing
```
export CHECKPOINT_OPTIMIZATION=true
```

Step 3: Start the container

The image includes the built-in model training tool ljperf. The following steps describe how to start the container and run a training task.

LLM workloads

# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

Step 4: Usage recommendations

The image contains modifications to libraries such as PyTorch and DeepSpeed. Do not reinstall them.
In the DeepSpeed configuration, leave the zero_optimization.stage3_prefetch_bucket_size parameter empty or set it to auto.

Known issues

No known issues in this release.

Container Compute Service:training-nv-pytorch 26.03