All Products
Search
Document Center

Container Compute Service:training-nv-pytorch 26.02

Last Updated:Mar 26, 2026

This release upgrades DeepSpeed and transformers for distributed large language model (LLM) training, and vLLM and flashinfer-python for inference workloads.

What's new

Component Previous 26.02
DeepSpeed 0.18.5
transformers 4.57.6
vLLM 0.15.0
flashinfer-python 0.6.1

Bug fixes

No bug fixes in this release.

Image variants

Two image tags are available, differing in CUDA version and supported architectures.

Tag CUDA NVIDIA driver Supported architectures
26.02-cu130-serverless 13.0 >= 580 amd64 and aarch64
26.02-cu128-serverless 12.8 >= 575 amd64

Both variants support training and inference workloads and are built on the PyTorch framework.

Driver requirements

The 26.02 release supports CUDA 12.8.0 and CUDA 13.0.2 on different NVIDIA driver versions:

  • CUDA 13.0.2: NVIDIA driver version 580 or later

  • CUDA 12.8.0: NVIDIA driver version 575 or later

For the full compatibility matrix, see CUDA application compatibility and CUDA compatibility and upgrades.

Core components

CUDA 13.0 (26.02-cu130-serverless)

Component Version
Ubuntu 24.04
Python 3.12.7+gc
CUDA 13.0
torch 2.9.0+ali.10.nv25.10
triton 3.5.0
transformer_engine 2.11.0+c188b533
DeepSpeed 0.18.5+ali
flash_attn 2.8.3
transformers 4.57.6+ali
grouped_gemm 1.1.4
accelerate 1.11.0+ali
vLLM 0.15.0+cu130
flashinfer-python 0.6.1
peft 0.16.0
ray 2.53.0
megatron-core 0.15.0
pytorch-dynamic-profiler 0.24.11
diffusers 0.34.0
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.11.0.86
ultralytics 8.3.96
timm 1.0.24
perf 5.4.30
gdb 15.0.50.20240403-git

CUDA 12.8 (26.02-cu128-serverless)

Component Version
Ubuntu 24.04
Python 3.12.7+gc
CUDA 12.8
torch 2.9.0+ali.10.nv25.3
triton 3.5.0
transformer_engine 2.10.0+769ed778
DeepSpeed 0.18.5+ali
flash_attn 2.8.3
flash_attn_3 3.0.0b1
transformers 4.57.6+ali
grouped_gemm 1.1.4
accelerate 1.11.0+ali
vLLM 0.15.0+cu128
flashinfer-python 0.6.1
peft 0.16.0
ray 2.53.0
megatron-core 0.15.0
pytorch-dynamic-profiler 0.24.11
diffusers 0.34.0
mmengine 0.10.3
mmcv 2.1.0
mmdet 3.3.0
opencv-python-headless 4.11.0.86
ultralytics 8.3.96
timm 1.0.24
perf 5.4.30
gdb 15.0.50.20240403-git
Note

flash_attn_3 (3.0.0b1) is included only in the CUDA 12.8 variant.

Key enhancements

PyTorch compiler optimization

torch.compile(), introduced in PyTorch 2.0, provides limited benefit in LLM distributed training because Fully Sharded Data Parallel (FSDP) and DeepSpeed require GPU memory optimizations that can break the compiler's compute graph. This release addresses that through two improvements:

  • Communication granularity control in DeepSpeed: Controlling communication granularity within the DeepSpeed framework lets the compiler obtain a complete compute graph, enabling broader compiler optimization.

  • PyTorch compiler frontend improvements: The compiler frontend is optimized to compile successfully even when graph breaks occur. Mode matching and dynamic shape capabilities are also enhanced.

Together, these optimizations increase end-to-end (E2E) throughput by 20% when training an 8B LLM.

GPU memory optimization for recomputation

Determining the right number of activation recomputation layers typically requires manual tuning across different cluster configurations. This release automates that process: the image forecasts and analyzes GPU memory consumption by running performance tests across different cluster configurations, then integrates the optimal number of activation recomputation layers directly into PyTorch.

This feature is currently available in the DeepSpeed framework.

End-to-end performance evaluation

Performance is measured using CNP, a cloud-native AI performance evaluation tool, against mainstream open-source models and framework configurations. Ablation experiments isolate the contribution of each optimization.

Comparison against base image across releases

image.png

GPU core component contribution to E2E performance

The following tests run on multi-node GPU clusters using the 26.02 image. Each configuration adds one optimization over the previous:

  1. Base: NGC PyTorch image

  2. ACS AI Image: Base + ACCL: ACCL communication library added

  3. ACS AI Image: AC2 + ACCL: AC2 BaseOS, no further optimizations

  4. ACS AI Image: AC2 + ACCL + CompilerOpt: torch.compile enabled

  5. ACS AI Image: AC2 + ACCL + CompilerOpt + CkptOpt: torch.compile and selective gradient checkpointing both enabled

image.png

Image availability

Public network

CUDA 13.0 (NVIDIA driver >= 580, amd64 and aarch64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu130-serverless

CUDA 12.8 (NVIDIA driver >= 575, amd64)

egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu128-serverless

VPC

To pull images faster within a virtual private cloud (VPC), replace the registry hostname:

# Public network
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}

# VPC
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}

Replace the placeholders:

Placeholder Description Example
{region-id} ID of the ACS service region cn-beijing, cn-wulanchabu
{image:tag} Image name and tag inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless or training-nv-pytorch:25.10-serverless
Note

This image supports the ACS product form and the Lingjun multi-tenant product form. It does not support the Lingjun single-tenant product form.

Quick start

Note

To use this image in ACS, select it from the Artifacts page in the Workload Creation interface, or specify the image reference in a YAML file.

1. Pull the image

docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

2. Enable optimizations

Enable compiler optimization

Use the transformers Trainer API:

image.png

Enable activation recomputation for GPU memory optimization

export CHECKPOINT_OPTIMIZATION=true

3. Start the container and run a training job

The image includes ljperf, a model training tool.

# Start the container
docker run --rm -it --ipc=host --net=host --privileged \
  egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]

# Run a training demo (LLM)
ljperf benchmark --model deepspeed/llama3-8b

Usage notes

  • Do not reinstall libraries bundled in this image, such as PyTorch and DeepSpeed.

  • In your DeepSpeed configuration, leave zero_optimization.stage3_prefetch_bucket_size blank or set it to auto.

  • The image sets NCCL_SOCKET_IFNAME by default. Adjust based on your GPU topology:

    • 1, 2, 4, or 8 GPUs per pod: set NCCL_SOCKET_IFNAME=eth0 (default)

    • 16 GPUs per node with HPN high-performance networking: set NCCL_SOCKET_IFNAME=hpn0

Known issues

No known issues in this release.