training-nv-pytorch 26.02 - Container Compute Service - Alibaba Cloud Documentation Center

What's new

Component	Previous	26.02
DeepSpeed	—	0.18.5
transformers	—	4.57.6
vLLM	—	0.15.0
flashinfer-python	—	0.6.1

Bug fixes

No bug fixes in this release.

Image variants

Two image tags are available, differing in CUDA version and supported architectures.

Tag	CUDA	NVIDIA driver	Supported architectures
`26.02-cu130-serverless`	13.0	>= 580	amd64 and aarch64
`26.02-cu128-serverless`	12.8	>= 575	amd64

Both variants support training and inference workloads and are built on the PyTorch framework.

Driver requirements

The 26.02 release supports CUDA 12.8.0 and CUDA 13.0.2 on different NVIDIA driver versions:

CUDA 13.0.2: NVIDIA driver version 580 or later
CUDA 12.8.0: NVIDIA driver version 575 or later

For the full compatibility matrix, see CUDA application compatibility and CUDA compatibility and upgrades.

Core components

CUDA 13.0 (`26.02-cu130-serverless`)

Component	Version
Ubuntu	24.04
Python	3.12.7+gc
CUDA	13.0
torch	2.9.0+ali.10.nv25.10
triton	3.5.0
transformer_engine	2.11.0+c188b533
DeepSpeed	0.18.5+ali
flash_attn	2.8.3
transformers	4.57.6+ali
grouped_gemm	1.1.4
accelerate	1.11.0+ali
vLLM	0.15.0+cu130
flashinfer-python	0.6.1
peft	0.16.0
ray	2.53.0
megatron-core	0.15.0
pytorch-dynamic-profiler	0.24.11
diffusers	0.34.0
mmengine	0.10.3
mmcv	2.1.0
mmdet	3.3.0
opencv-python-headless	4.11.0.86
ultralytics	8.3.96
timm	1.0.24
perf	5.4.30
gdb	15.0.50.20240403-git

CUDA 12.8 (`26.02-cu128-serverless`)

Component	Version
Ubuntu	24.04
Python	3.12.7+gc
CUDA	12.8
torch	2.9.0+ali.10.nv25.3
triton	3.5.0
transformer_engine	2.10.0+769ed778
DeepSpeed	0.18.5+ali
flash_attn	2.8.3
flash_attn_3	3.0.0b1
transformers	4.57.6+ali
grouped_gemm	1.1.4
accelerate	1.11.0+ali
vLLM	0.15.0+cu128
flashinfer-python	0.6.1
peft	0.16.0
ray	2.53.0
megatron-core	0.15.0
pytorch-dynamic-profiler	0.24.11
diffusers	0.34.0
mmengine	0.10.3
mmcv	2.1.0
mmdet	3.3.0
opencv-python-headless	4.11.0.86
ultralytics	8.3.96
timm	1.0.24
perf	5.4.30
gdb	15.0.50.20240403-git

Note

flash_attn_3 (3.0.0b1) is included only in the CUDA 12.8 variant.

Key enhancements

PyTorch compiler optimization

torch.compile(), introduced in PyTorch 2.0, provides limited benefit in LLM distributed training because Fully Sharded Data Parallel (FSDP) and DeepSpeed require GPU memory optimizations that can break the compiler's compute graph. This release addresses that through two improvements:

Communication granularity control in DeepSpeed: Controlling communication granularity within the DeepSpeed framework lets the compiler obtain a complete compute graph, enabling broader compiler optimization.
PyTorch compiler frontend improvements: The compiler frontend is optimized to compile successfully even when graph breaks occur. Mode matching and dynamic shape capabilities are also enhanced.

Together, these optimizations increase end-to-end (E2E) throughput by 20% when training an 8B LLM.

GPU memory optimization for recomputation

Determining the right number of activation recomputation layers typically requires manual tuning across different cluster configurations. This release automates that process: the image forecasts and analyzes GPU memory consumption by running performance tests across different cluster configurations, then integrates the optimal number of activation recomputation layers directly into PyTorch.

This feature is currently available in the DeepSpeed framework.

End-to-end performance evaluation

Performance is measured using CNP, a cloud-native AI performance evaluation tool, against mainstream open-source models and framework configurations. Ablation experiments isolate the contribution of each optimization.

Comparison against base image across releases

GPU core component contribution to E2E performance

The following tests run on multi-node GPU clusters using the 26.02 image. Each configuration adds one optimization over the previous:

Base: NGC PyTorch image
ACS AI Image: Base + ACCL: ACCL communication library added
ACS AI Image: AC2 + ACCL: AC2 BaseOS, no further optimizations
ACS AI Image: AC2 + ACCL + CompilerOpt: torch.compile enabled
ACS AI Image: AC2 + ACCL + CompilerOpt + CkptOpt: torch.compile and selective gradient checkpointing both enabled

Image availability

Public network

CUDA 13.0 (NVIDIA driver >= 580, amd64 and aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu130-serverless

CUDA 12.8 (NVIDIA driver >= 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu128-serverless

VPC

To pull images faster within a virtual private cloud (VPC), replace the registry hostname:

# Public network
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

# VPC
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders:

Placeholder	Description	Example
`{region-id}`	ID of the ACS service region	`cn-beijing`, `cn-wulanchabu`
`{image:tag}`	Image name and tag	`inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless` or `training-nv-pytorch:25.10-serverless`

Note

This image supports the ACS product form and the Lingjun multi-tenant product form. It does not support the Lingjun single-tenant product form.

Quick start

Note

To use this image in ACS, select it from the Artifacts page in the Workload Creation interface, or specify the image reference in a YAML file.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable optimizations

Enable compiler optimization

Use the transformers Trainer API:

Enable activation recomputation for GPU memory optimization

export CHECKPOINT_OPTIMIZATION=true

3. Start the container and run a training job

The image includes ljperf, a model training tool.

# Start the container
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run a training demo (LLM)
ljperf benchmark --model deepspeed/llama3-8b

Usage notes

Do not reinstall libraries bundled in this image, such as PyTorch and DeepSpeed.
In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.
The image sets NCCL_SOCKET_IFNAME by default. Adjust based on your GPU topology:
- 1, 2, 4, or 8 GPUs per pod: set NCCL_SOCKET_IFNAME=eth0 (default)
- 16 GPUs per node with HPN high-performance networking: set NCCL_SOCKET_IFNAME=hpn0

Known issues

No known issues in this release.

What's new

Bug fixes

Image variants

Driver requirements

Core components

CUDA 13.0 (26.02-cu130-serverless)

CUDA 12.8 (26.02-cu128-serverless)

Key enhancements

PyTorch compiler optimization

GPU memory optimization for recomputation

End-to-end performance evaluation

Image availability

Public network

VPC

Quick start

1. Pull the image

2. Enable optimizations

3. Start the container and run a training job

Usage notes

Known issues

CUDA 13.0 (`26.02-cu130-serverless`)

CUDA 12.8 (`26.02-cu128-serverless`)