This topic describes the release notes for training-nv-pytorch version 26.01.
Main features and bug fixes
Main features
The built-in training component megatron-core is upgraded to 0.15.0, the inference component vLLM is upgraded to 0.13.0, and flashinfer-python is upgraded to 0.5.3.
health_check is upgraded to be compatible with shuttle 1.5.3.
Bug fixes
None.
Contents
Image name | training-nv-pytorch | |
Tag | 26.01-cu130-serverless | 26.01-cu128-serverless |
Scenarios | Training/Inference | |
Framework | PyTorch | |
Requirements | NVIDIA Driver release >= 580 | NVIDIA Driver release >= 575 |
Supported Architectures | amd64 & aarch64 | amd64 |
Core components |
|
|
Assets
Public network images
CUDA 13.0.2 (Driver >=580, amd64 & aarch64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.01-cu130-serverless
CUDA 12.8 (Driver >= 575, amd64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.01-cu128-serverless
VPC images
To quickly pull an ACS AI container image within a VPC, replace the specified AI container image Asset URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.
{region-id}: The ID of the region where ACS is available. For more information, see Regions and zones. Examples:cn-beijingandcn-wulanchabu.{image:tag}: The name and tag of the AI container image. Examples:inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverlessandtraining-nv-pytorch:25.10-serverless.
This image is for ACS and Lingjun multi-tenant products. Do not use this image with Lingjun single-tenant products.
Driver requirements
The 26.01 release supports CUDA 12.8.0 and CUDA 13.0.2 with different driver versions. CUDA 13.0.2 requires NVIDIA driver version 580 or later. CUDA 12.8.0 requires NVIDIA driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
Key features and enhancements
PyTorch compiling optimization
The compiling optimization feature introduced in PyTorch 2.0 is suitable for small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.
Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.
After the preceding optimizations, the E2E throughput is increased by 20% when an 8B LLM is trained.
GPU memory optimization for recomputation
We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.
E2E performance benefit evaluation
Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive E2E performance comparison. We used mainstream open source models and framework configurations against a standard base image. We also performed ablation experiments to further evaluate the contribution of each optimization component to the overall model training performance.
Image comparison: Base image and iteration evaluation

E2E performance contribution analysis of core GPU components
The following tests are based on version 26.01. They involve E2E training performance evaluation and comparative analysis on a multi-node GPU cluster. The comparison items include the following:
Base: NGC PyTorch Image
ACS AI Image: Base+ACCL: The image uses the ACCL communication library.
ACS AI Image: AC2+ACCL: The golden image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image: AC2+ACCL+CompilerOpt: The golden image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The golden image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start
The following examples show how to pull the training-nv-pytorch image using Docker.
To use the training-nv-pytorch image in ACS, select it from the Artifact Center page on the Create Workload interface in the console, or specify the image reference in a YAML file.
1. Select an image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]2. Call the API to enable the compiler and recomputation for GPU memory optimization
Enable compilation optimization
Use the transformers Trainer API:

Enable recomputation for GPU memory optimization
export CHECKPOINT_OPTIMIZATION=true
3. Start the container
The image has a built-in model training tool, ljperf. The following steps use this tool to show how to start a container and run a training job.
LLM models
# Start and enter the container
docker run --rm -it --ipc=host --net=host --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b 4. Recommendations
The changes in the image involve libraries such as PyTorch and DeepSpeed. Do not reinstall them.
In the DeepSpeed configuration, leave
zero_optimization.stage3_prefetch_bucket_sizeempty or set it to `auto`.The
NCCL_SOCKET_IFNAMEenvironment variable built into this image needs to be dynamically adjusted based on the scenario:When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set
NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.When a single pod requests all 16 cards of a machine for a training or inference task (you can use the HPN high-speed network in this case), set
NCCL_SOCKET_IFNAME=hpn0.
Known issues
None.