training-nv-pytorch 26.01 - Container Compute Service - Alibaba Cloud ドキュメントセンター

This topic describes the release notes for training-nv-pytorch version 26.01.

Main features and bug fixes

Main features

The built-in training component megatron-core is upgraded to 0.15.0, the inference component vLLM is upgraded to 0.13.0, and flashinfer-python is upgraded to 0.5.3.
health_check is upgraded to be compatible with shuttle 1.5.3.

Bug fixes

None.

Image name	training-nv-pytorch
Tag	26.01-cu130-serverless	26.01-cu128-serverless
Scenarios	Training/Inference
Framework	PyTorch
Requirements	NVIDIA Driver release >= 580	NVIDIA Driver release >= 575
Supported Architectures	amd64 & aarch64	amd64
Core components	Ubuntu: 24.04 Python: 3.12.7+gc CUDA: 13.0 perf: 5.4.30 gdb: 15.0.50.20240403-git torch: 2.9.0+ali.10.nv25.10 triton: 3.5.0 transformer_engine: 2.10.0+769ed778 deepspeed: 0.18.1+ali flash_attn: 2.8.3 transformers: 4.57.1+ali grouped_gemm: 1.1.4 accelerate: 1.11.0+ali diffusers: 0.34.0 mmengine: 0.10.3 mmcv: 2.1.0 mmdet: 3.3.0 opencv-python-headless: 4.11.0.86 ultralytics: 8.3.96 timm: 1.0.24 vllm: 0.13.0+cu130 flashinfer-python: 0.5.3 pytorch-dynamic-profiler: 0.24.11 peft: 0.16.0 ray: 2.53.0 megatron-core: 0.15.0	Ubuntu: 24.04 Python: 3.12.7+gc CUDA: 12.8 perf: 5.4.30 gdb: 15.0.50.20240403-git torch: 2.9.0+ali.10.nv25.3 triton: 3.5.0 transformer_engine: 2.10.0+769ed778 deepspeed: 0.18.1+ali flash_attn: 2.8.3 flash_attn_3: 3.0.0b1 transformers: 4.57.1+ali grouped_gemm: 1.1.4 accelerate: 1.11.0+ali diffusers: 0.34.0 mmengine: 0.10.3 mmcv: 2.1.0 mmdet: 3.3.0 opencv-python-headless: 4.11.0.86 ultralytics: 8.3.96 timm: 1.0.24 vllm: 0.13.0+cu128 flashinfer-python: 0.5.3 pytorch-dynamic-profiler: 0.24.11 peft: 0.16.0 ray: 2.53.0 megatron-core: 0.15.0

Assets

Public network images

CUDA 13.0.2 (Driver >=580, amd64 & aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.01-cu130-serverless

CUDA 12.8 (Driver >= 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.01-cu128-serverless

VPC images

To quickly pull an ACS AI container image within a VPC, replace the specified AI container image Asset URI egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag} with acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}.

{region-id}: The ID of the region where ACS is available. For more information, see Regions and zones. Examples: cn-beijing and cn-wulanchabu.
{image:tag}: The name and tag of the AI container image. Examples: inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless and training-nv-pytorch:25.10-serverless.

Note

This image is for ACS and Lingjun multi-tenant products. Do not use this image with Lingjun single-tenant products.

Driver requirements

The 26.01 release supports CUDA 12.8.0 and CUDA 13.0.2 with different driver versions. CUDA 13.0.2 requires NVIDIA driver version 580 or later. CUDA 12.8.0 requires NVIDIA driver version 575 or later. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key features and enhancements

PyTorch compiling optimization

The compiling optimization feature introduced in PyTorch 2.0 is suitable for small-scale training on one GPU. However, LLM training requires GPU memory optimization and a distributed framework, such as FSDP or DeepSpeed. Consequently, torch.compile() cannot benefit your training or provide negative benefits.

Controlling the communication granularity in the DeepSpeed framework helps the compiler obtain a complete compute graph for a wider scope of compiling optimization.
Optimized PyTorch:
- The frontend of the PyTorch compiler is optimized to ensure compiling when any graph break occurs in a compute graph.
- The mode matching and dynamic shape capabilities are enhanced to optimize the compiled code.

After the preceding optimizations, the E2E throughput is increased by 20% when an 8B LLM is trained.

GPU memory optimization for recomputation

We forecast and analyze the consumption of GPU memory of models by running performance tests on models deployed in different clusters or configured with different parameters and collecting system metrics, such as GPU memory utilization. Based on the results, we suggest the optimal number of activation recomputation layers and integrate it into PyTorch. This allows users to easily benefit from GPU memory optimization. Currently, this feature can be used in the DeepSpeed framework.

E2E performance benefit evaluation

Using the cloud-native AI performance evaluation and analysis tool CNP, we conducted a comprehensive E2E performance comparison. We used mainstream open source models and framework configurations against a standard base image. We also performed ablation experiments to further evaluate the contribution of each optimization component to the overall model training performance.

Image comparison: Base image and iteration evaluation

E2E performance contribution analysis of core GPU components

The following tests are based on version 26.01. They involve E2E training performance evaluation and comparative analysis on a multi-node GPU cluster. The comparison items include the following:

Base: NGC PyTorch Image
ACS AI Image: Base+ACCL: The image uses the ACCL communication library.
ACS AI Image: AC2+ACCL: The golden image uses AC2 BaseOS with no optimizations enabled.
ACS AI Image: AC2+ACCL+CompilerOpt: The golden image uses AC2 BaseOS with only the torch compile optimization enabled.
ACS AI Image: AC2+ACCL+CompilerOpt+CkptOpt: The golden image uses AC2 BaseOS with both torch compile and selective gradient checkpoint optimizations enabled.

Quick start

The following examples show how to pull the training-nv-pytorch image using Docker.

Note

To use the training-nv-pytorch image in ACS, select it from the Artifact Center page on the Create Workload interface in the console, or specify the image reference in a YAML file.

1. Select an image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Call the API to enable the compiler and recomputation for GPU memory optimization

Enable compilation optimization
Use the transformers Trainer API:
Enable recomputation for GPU memory optimization
```
export CHECKPOINT_OPTIMIZATION=true
```

3. Start the container

The image has a built-in model training tool, ljperf. The following steps use this tool to show how to start a container and run a training job.

LLM models

# Start and enter the container
docker run --rm -it --ipc=host --net=host  --privileged egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run the training demo
ljperf benchmark --model deepspeed/llama3-8b

4. Recommendations

The changes in the image involve libraries such as PyTorch and DeepSpeed. Do not reinstall them.
In the DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size empty or set it to `auto`.
The NCCL_SOCKET_IFNAME environment variable built into this image needs to be dynamically adjusted based on the scenario:
- When a single pod requests 1, 2, 4, or 8 cards for a training or inference task, set NCCL_SOCKET_IFNAME=eth0. This is the default configuration in this image.
- When a single pod requests all 16 cards of a machine for a training or inference task (you can use the HPN high-speed network in this case), set NCCL_SOCKET_IFNAME=hpn0.

Known issues