These release notes cover the updates for the training-nv-pytorch 26.03 image.
Main features and bug fixes
Main features
Upgraded torch to version 2.10.
Upgraded vllm to version 0.17.0.
Upgraded megatron-core to version 0.16.0.
Upgraded deepspeed to version 0.18.8.
Upgraded transformer_engine to version 2.12.
Bug fixes
No bug fixes in this release.
Contents
Image name | training-nv-pytorch | |
Tag | 26.03-cu130-serverless | 26.03-cu128-serverless |
Use case | Training and inference | |
Framework | PyTorch | |
Requirements | NVIDIA Driver release >= 580 | NVIDIA Driver release >= 575 |
Supported architectures | amd64 and aarch64 | amd64 |
Core components |
|
|
Assets
Public images
CUDA 13.0.2 (Driver >= 580, amd64 and aarch64)
registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu130-serverless
CUDA 12.8 (Driver >= 575, amd64)
registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.03-cu128-serverless
VPC images
To quickly pull ACS AI container images within a VPC, replace the public registry in the image URI registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with the VPC registry registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.
{region-id}: The region ID of one of the ACS available regions, such ascn-beijingorcn-wulanchabu.{image:tag}: The name and tag of the AI container image, such asinference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverlessortraining-nv-pytorch:25.10-serverless.
This image is suitable for ACS and multi-tenant Lingjun products. It is not intended for use with single-tenant Lingjun products.
Driver requirements
The 26.03 release supports CUDA 12.8.0 and CUDA 13.0.2, depending on the driver version. CUDA 13.0.2 requires NVIDIA Driver version 580 or later, and CUDA 12.8.0 requires NVIDIA Driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
Key features and enhancements
PyTorch compiling optimization
torch.compile(), introduced in PyTorch 2.0, is effective for single-GPU training but provides limited or negative benefit for large language model (LLM) training, which depends on GPU memory optimization and distributed frameworks such as Fully Sharded Data Parallel (FSDP) or DeepSpeed.
This release improves torch.compile() for distributed LLM training through two optimizations:
-
Communication granularity control in DeepSpeed: Controlling communication granularity gives the compiler a complete compute graph, enabling wider compiling optimization.
-
Frontend improvements: The PyTorch compiler frontend now compiles even when a graph break occurs, with enhanced mode matching and dynamic shape capabilities.
Result: 20% higher end-to-end throughput when training an 8B LLM.
GPU memory optimization for recomputation
Based on performance tests across different clusters and parameter configurations, this release integrates the optimal number of activation recomputation layers directly into PyTorch. Enable it with a single environment variable — no manual tuning required.
This feature is currently available in the DeepSpeed framework only.
End-to-end performance evaluation
Using the cloud-native AI performance analysis tool CNP, we conducted a comprehensive end-to-end performance comparison against a standard base image with mainstream open-source models and framework configurations. We also performed ablation studies to evaluate the contribution of each optimized component to the overall model training performance.
Image comparison: Base image vs. iterative evaluation

E2E performance contribution analysis of core GPU components
The following tests evaluate and compare end-to-end training performance on a multi-node GPU cluster using this image release. The comparison items include:
Base: NGC PyTorch image
ACS AI Image (Base + ACCL): The image uses the ACCL communication library.
ACS AI Image (AC2 + ACCL): The image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image (AC2 + ACCL + CompilerOpt): The image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image (AC2 + ACCL + CompilerOpt + CkptOpt): The image uses AC2 BaseOS with both torch compile and selective gradient checkpointing optimizations enabled.

Quick start
The following example shows how to pull the training-nv-pytorch image using Docker.
To use the training-nv-pytorch image in ACS, select it from the Artifacts page when you create a workload in the console, or specify the image reference in a YAML file.
Step 1: Select an image
docker pull registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]Step 2: Enable compiler and memory optimizations
Enable compilation optimization
Use the transformers Trainer API:

Enable selective gradient checkpointing
export CHECKPOINT_OPTIMIZATION=true
Step 3: Start the container
The image includes the built-in model training tool ljperf. The following steps describe how to start the container and run a training task.
LLM workloads
# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b Step 4: Usage recommendations
The image contains modifications to libraries such as PyTorch and DeepSpeed. Do not reinstall them.
In the DeepSpeed configuration, leave the
zero_optimization.stage3_prefetch_bucket_sizeparameter empty or set it toauto.
Known issues
No known issues at this time.