All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 26.03

Last Updated:Apr 09, 2026

These release notes cover the updates for the training-nv-pytorch 26.03 image.

Main features and bug fixes

Main features

  • Upgraded torch to version 2.10.

  • Upgraded vllm to version 0.17.0.

  • Upgraded megatron-core to version 0.16.0.

  • Upgraded deepspeed to version 0.18.8.

  • Upgraded transformer_engine to version 2.12.

Bug fixes

No bug fixes in this release.

Contents

Image name

training-nv-pytorch

Tag

26.03-cu130-serverless

26.03-cu128-serverless

Use case

Training and inference

Framework

PyTorch

Requirements

NVIDIA Driver release >= 580

NVIDIA Driver release >= 575

Supported architectures

amd64 and aarch64

amd64

Core components

  • Ubuntu: 24.04

  • Python: 3.12.7+gc

  • CUDA: 13.0

  • perf: 5.4.30

  • gdb: 15.1

  • torch: 2.10.0+ali.10.nv25.10

  • triton: 3.6.0

  • transformer_engine: 2.12.0+5671fd36

  • deepspeed: 0.18.8+ali

  • flash_attn: 2.8.3

  • transformers: 4.57.6+ali

  • grouped_gemm: 1.1.4

  • accelerate: 1.11.0+ali

  • diffusers: 0.34.0

  • mmengine: 0.10.3

  • mmcv: 2.1.0

  • mmdet: 3.3.0

  • opencv-python-headless: 4.11.0.86

  • ultralytics: 8.3.96

  • timm: 1.0.26

  • vllm: 0.17.0+cu130

  • flashinfer-python: 0.6.4

  • pytorch-dynamic-profiler: 0.24.11

  • peft: 0.16.0

  • ray: 2.54.1

  • megatron-core: 0.16.0

  • Ubuntu: 24.04

  • Python: 3.12.7+gc

  • CUDA: 12.8

  • perf: 5.4.30

  • gdb: 15.1

  • torch: 2.10.0+ali.10.nv25.3.pgo

  • triton: 3.6.0

  • transformer_engine: 2.12.0+5671fd36

  • deepspeed: 0.18.8+ali

  • flash_attn: 2.8.3

  • flash_attn_3: 3.0.0b1

  • transformers: 4.57.6+ali

  • grouped_gemm: 1.1.4

  • accelerate: 1.11.0+ali

  • diffusers: 0.34.0

  • mmengine: 0.10.3

  • mmcv: 2.1.0

  • mmdet: 3.3.0

  • opencv-python-headless: 4.11.0.86

  • ultralytics: 8.3.96

  • timm: 1.0.26

  • vllm: 0.17.0+cu128

  • flashinfer-python: 0.6.4

  • pytorch-dynamic-profiler: 0.24.11

  • peft: 0.16.0

  • ray: 2.54.1

  • megatron-core: 0.16.0

Assets

Public images

CUDA 13.0.2 (Driver >= 580, amd64 and aarch64)

  • registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu130-serverless

CUDA 12.8 (Driver >= 575, amd64)

  • registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu128-serverless

VPC images

To quickly pull ACS AI container images within a VPC, replace the public registry in the image URI registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with the VPC registry registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

  • {region-id}: The region ID of one of the ACS available regions, such as cn-beijing or cn-wulanchabu.

  • {image:tag}: The name and tag of the AI container image, such as inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless or training-nv-pytorch:25.10-serverless.

Note

This image is suitable for ACS and multi-tenant Lingjun products. It is not intended for use with single-tenant Lingjun products.

Driver requirements

  • The 26.03 release supports CUDA 12.8.0 and CUDA 13.0.2, depending on the driver version. CUDA 13.0.2 requires NVIDIA Driver version 580 or later, and CUDA 12.8.0 requires NVIDIA Driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

torch.compile(), introduced in PyTorch 2.0, is effective for single-GPU training but provides limited or negative benefit for large language model (LLM) training, which depends on GPU memory optimization and distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.

This release improves torch.compile() for distributed LLM training through two optimizations:

  • Communication granularity control in DeepSpeed: Controlling communication granularity gives the compiler a complete compute graph, enabling wider compiling optimization.

  • Frontend improvements: The PyTorch compiler frontend now compiles even when a graph break occurs, with enhanced mode matching and dynamic shape capabilities.

Result: 20% higher end-to-end throughput when training an 8B LLM.

GPU memory optimization for recomputation

Based on performance tests across different clusters and parameter configurations, this release integrates the optimal number of activation recomputation layers directly into PyTorch. Enable it with a single environment variable — no manual tuning required.

This feature is currently available in the DeepSpeed framework only.

End-to-end performance evaluation

Using the cloud-native AI performance analysis tool CNP, we conducted a comprehensive end-to-end performance comparison against a standard base image with mainstream open-source models and framework configurations. We also performed ablation studies to evaluate the contribution of each optimized component to the overall model training performance.

Image comparison: Base image vs. iterative evaluation

image.png

E2E performance contribution analysis of core GPU components

The following tests evaluate and compare end-to-end training performance on a multi-node GPU cluster using this image release. The comparison items include:

  1. Base: NGC PyTorch image

  2. ACS AI Image (Base + ACCL): The image uses the ACCL communication library.

  3. ACS AI Image (AC2 + ACCL): The image uses AC2 BaseOS with no optimizations enabled.

  4. ACS AI Image (AC2 + ACCL + CompilerOpt): The image uses AC2 BaseOS with only the torch compile optimization enabled.

  5. ACS AI Image (AC2 + ACCL + CompilerOpt + CkptOpt): The image uses AC2 BaseOS with both torch compile and selective gradient checkpointing optimizations enabled.

image.png

Quick start

The following example shows how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, select it from the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file.

Step 1: Select an image

docker pull registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

Step 2: Enable compiler and memory optimizations

  • Enable compilation optimization

    Use the transformers Trainer API:

    image.png

  • Enable selective gradient checkpointing

    export CHECKPOINT_OPTIMIZATION=true

Step 3: Start the container

The image includes the built-in model training tool ljperf. The following steps describe how to start the container and run a training task.

LLM workloads

# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 

Step 4: Usage recommendations

  • The image contains modifications to libraries such as PyTorch and DeepSpeed. Do not reinstall them.

  • In the DeepSpeed configuration, leave the zero_optimization.stage3_prefetch_bucket_size parameter empty or set it to auto.

Known issues

No known issues at this time.