This release upgrades DeepSpeed and transformers for distributed large language model (LLM) training, and vLLM and flashinfer-python for inference workloads.
What's new
| Component | Previous | 26.02 |
|---|---|---|
| DeepSpeed | — | 0.18.5 |
| transformers | — | 4.57.6 |
| vLLM | — | 0.15.0 |
| flashinfer-python | — | 0.6.1 |
Bug fixes
No bug fixes in this release.
Image variants
Two image tags are available, differing in CUDA version and supported architectures.
| Tag | CUDA | NVIDIA driver | Supported architectures |
|---|---|---|---|
26.02-cu130-serverless |
13.0 | >= 580 | amd64 and aarch64 |
26.02-cu128-serverless |
12.8 | >= 575 | amd64 |
Both variants support training and inference workloads and are built on the PyTorch framework.
Driver requirements
The 26.02 release supports CUDA 12.8.0 and CUDA 13.0.2 on different NVIDIA driver versions:
-
CUDA 13.0.2: NVIDIA driver version 580 or later
-
CUDA 12.8.0: NVIDIA driver version 575 or later
For the full compatibility matrix, see CUDA application compatibility and CUDA compatibility and upgrades.
Core components
CUDA 13.0 (26.02-cu130-serverless)
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| CUDA | 13.0 |
| torch | 2.9.0+ali.10.nv25.10 |
| triton | 3.5.0 |
| transformer_engine | 2.11.0+c188b533 |
| DeepSpeed | 0.18.5+ali |
| flash_attn | 2.8.3 |
| transformers | 4.57.6+ali |
| grouped_gemm | 1.1.4 |
| accelerate | 1.11.0+ali |
| vLLM | 0.15.0+cu130 |
| flashinfer-python | 0.6.1 |
| peft | 0.16.0 |
| ray | 2.53.0 |
| megatron-core | 0.15.0 |
| pytorch-dynamic-profiler | 0.24.11 |
| diffusers | 0.34.0 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| opencv-python-headless | 4.11.0.86 |
| ultralytics | 8.3.96 |
| timm | 1.0.24 |
| perf | 5.4.30 |
| gdb | 15.0.50.20240403-git |
CUDA 12.8 (26.02-cu128-serverless)
| Component | Version |
|---|---|
| Ubuntu | 24.04 |
| Python | 3.12.7+gc |
| CUDA | 12.8 |
| torch | 2.9.0+ali.10.nv25.3 |
| triton | 3.5.0 |
| transformer_engine | 2.10.0+769ed778 |
| DeepSpeed | 0.18.5+ali |
| flash_attn | 2.8.3 |
| flash_attn_3 | 3.0.0b1 |
| transformers | 4.57.6+ali |
| grouped_gemm | 1.1.4 |
| accelerate | 1.11.0+ali |
| vLLM | 0.15.0+cu128 |
| flashinfer-python | 0.6.1 |
| peft | 0.16.0 |
| ray | 2.53.0 |
| megatron-core | 0.15.0 |
| pytorch-dynamic-profiler | 0.24.11 |
| diffusers | 0.34.0 |
| mmengine | 0.10.3 |
| mmcv | 2.1.0 |
| mmdet | 3.3.0 |
| opencv-python-headless | 4.11.0.86 |
| ultralytics | 8.3.96 |
| timm | 1.0.24 |
| perf | 5.4.30 |
| gdb | 15.0.50.20240403-git |
flash_attn_3 (3.0.0b1) is included only in the CUDA 12.8 variant.
Key enhancements
PyTorch compiler optimization
torch.compile(), introduced in PyTorch 2.0, provides limited benefit in LLM distributed training because Fully Sharded Data Parallel (FSDP) and DeepSpeed require GPU memory optimizations that can break the compiler's compute graph. This release addresses that through two improvements:
-
Communication granularity control in DeepSpeed: Controlling communication granularity within the DeepSpeed framework lets the compiler obtain a complete compute graph, enabling broader compiler optimization.
-
PyTorch compiler frontend improvements: The compiler frontend is optimized to compile successfully even when graph breaks occur. Mode matching and dynamic shape capabilities are also enhanced.
Together, these optimizations increase end-to-end (E2E) throughput by 20% when training an 8B LLM.
GPU memory optimization for recomputation
Determining the right number of activation recomputation layers typically requires manual tuning across different cluster configurations. This release automates that process: the image forecasts and analyzes GPU memory consumption by running performance tests across different cluster configurations, then integrates the optimal number of activation recomputation layers directly into PyTorch.
This feature is currently available in the DeepSpeed framework.
End-to-end performance evaluation
Performance is measured using CNP, a cloud-native AI performance evaluation tool, against mainstream open-source models and framework configurations. Ablation experiments isolate the contribution of each optimization.
Comparison against base image across releases
GPU core component contribution to E2E performance
The following tests run on multi-node GPU clusters using the 26.02 image. Each configuration adds one optimization over the previous:
-
Base: NGC PyTorch image
-
ACS AI Image: Base + ACCL: ACCL communication library added
-
ACS AI Image: AC2 + ACCL: AC2 BaseOS, no further optimizations
-
ACS AI Image: AC2 + ACCL + CompilerOpt:
torch.compileenabled -
ACS AI Image: AC2 + ACCL + CompilerOpt + CkptOpt:
torch.compileand selective gradient checkpointing both enabled
Image availability
Public network
CUDA 13.0 (NVIDIA driver >= 580, amd64 and aarch64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu130-serverless
CUDA 12.8 (NVIDIA driver >= 575, amd64)
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:26.02-cu128-serverless
VPC
To pull images faster within a virtual private cloud (VPC), replace the registry hostname:
# Public network
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/{image:tag}
# VPC
acs-registry-vpc.{region-id}.cr.aliyuncs.com/egslingjun/{image:tag}
Replace the placeholders:
| Placeholder | Description | Example |
|---|---|---|
{region-id} |
ID of the ACS service region | cn-beijing, cn-wulanchabu |
{image:tag} |
Image name and tag | inference-nv-pytorch:25.10-vllm0.11.0-pytorch2.8-cu128-20251028-serverless or training-nv-pytorch:25.10-serverless |
This image supports the ACS product form and the Lingjun multi-tenant product form. It does not support the Lingjun single-tenant product form.
Quick start
To use this image in ACS, select it from the Artifacts page in the Workload Creation interface, or specify the image reference in a YAML file.
1. Pull the image
docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
2. Enable optimizations
Enable compiler optimization
Use the transformers Trainer API:
Enable activation recomputation for GPU memory optimization
export CHECKPOINT_OPTIMIZATION=true
3. Start the container and run a training job
The image includes ljperf, a model training tool.
# Start the container
docker run --rm -it --ipc=host --net=host --privileged \
egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/training-nv-pytorch:[tag]
# Run a training demo (LLM)
ljperf benchmark --model deepspeed/llama3-8b
Usage notes
-
Do not reinstall libraries bundled in this image, such as PyTorch and DeepSpeed.
-
In your DeepSpeed configuration, leave
zero_optimization.stage3_prefetch_bucket_sizeblank or set it toauto. -
The image sets
NCCL_SOCKET_IFNAMEby default. Adjust based on your GPU topology:-
1, 2, 4, or 8 GPUs per pod: set
NCCL_SOCKET_IFNAME=eth0(default) -
16 GPUs per node with HPN high-performance networking: set
NCCL_SOCKET_IFNAME=hpn0
-
Known issues
No known issues in this release.