All Products
Search
Document Center

Platform For AI:Reference: Performance benchmarks

Last Updated:Jun 20, 2026

This document shows how to use Pai-Megatron-Patch to optimize PyTorch Transformer model training.

Background information

We conducted all experiments on an Alibaba Cloud ECS instance with the following specifications: instance type ecs.gn6e-c12g1.12xlarge with a 48-core CPU, 368 GiB of memory, and four NVIDIA V100 GPUs. The environment used the Ubuntu 18.04 64-bit operating system with the image ID ubuntu_18_04_x64_20G_alibase_20211227.vhd and a peak bandwidth of 100 Mbps. We ran the nvidia-smi command to confirm that driver version 440.64.00 and CUDA 10.2 were ready, and that all four Tesla V100-SXM2 GPUs were idle.

| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   32C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   30C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   31C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Mixed-precision training

Experiment setup: Pre-training an English Hugging Face BERT model.

  • num-layers 12

  • hidden-size 768

  • num-attention-heads 12

  • num-params 110,106,428

  • local-rank 4

  • seq-length 512

  • micro-batch-size 16

  • global-batch-size 64

Solution

Throughput (samples/s)

Peak memory (MB)

Single-precision training

103.07 +/- 1.03

17,025

Mixed-precision training

178.15 +/- 2.10

12,698

GPU memory optimization: Model state partitioning

Experiment setup: Pre-training an English Megatron GPT model.

  • num-layers 24

  • hidden-size 2048

  • num-attention-heads 32

  • num-params 1,313,722,368 (1.3 billion)

  • local-rank 4

  • seq-length 1024

  • micro-batch-size 1

  • global-batch-size 4

Without optimizations, this model cannot fit on a 32 GB GPU, causing an out-of-memory error. For example, the Adam optimizer's state parameters alone require 16 GB of GPU memory.

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/language_model.py", line 351, in forward
    encoder_output = self.encoder(encoder_input,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 703, in forward
    hidden_states = layer(hidden_states,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 441, in forward
    self.self_attention(layernorm_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 264, in forward
    matmul_result = torch.baddbmm(
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 28.56 GiB already allocated; 84.00 MiB free; 30.19 GiB reserved in total by PyTorch)

Solution

Throughput (samples/s)

Peak memory (MB)

No optimization

OOM

OOM

Mixed-precision training

9.57 +/- 0.26

25,061

Mixed-precision training + Optimizer State Sharding (OSS)

6.02 +/- 0.06

22,077

Mixed-precision training + OSS / Sharded Data Parallel (SDP)

7.01 +/- 0.07

17,113

Mixed-precision training + Fully Sharded Data Parallel (FSDP)

NA

NA

Mixed-precision training + ZeRO Stage 1

12.88 +/- 0.10

15,709

Mixed-precision training + ZeRO Stage 2

10.27 +/- 0.08

15,693

Mixed-precision training + ZeRO Stage 3

NA

NA

3D hybrid parallelism

Experiment setup: Pre-training an English Megatron GPT model.

  • num-layers 24

  • hidden-size 2048

  • num-attention-heads 32

  • num-params 1,313,722,368 (1.3 billion)

  • local-rank 4

  • seq-length 1024

  • micro-batch-size 1

  • global-batch-size 4

With mixed-precision training enabled:

Tensor parallelism

Pipeline parallelism

Throughput (samples/s)

Peak memory (MB)

1

1

9.63 +/- 0.29

25,061

2

1

7.59 +/- 0.14

11,300

4

1

6.16 +/- 0.06

5,673

1

2

8.46 +/- 0.17

12,375

1

4

8.03 +/- 0.12

8,141

2

2

7.37 +/- 0.11

6,211

4

4

6.24 +/- 0.08

5,673

ONNX Runtime graph optimization

Experiment setup: Fine-tuning an English Hugging Face BERT model.

  • num-layers 12

  • hidden-size 768

  • num-attention-heads 12

  • num-params 110,106,428

  • local-rank 4

  • seq-length 512

  • micro-batch-size 16

  • global-batch-size 64

Compared with single-precision training, using only ONNX Runtime graph optimization improves throughput by 15.6%.

Solution

Throughput (samples/s)

Peak memory (MB)

Single-precision training

479.15 +/- 1.67

2,112

Mixed-precision training

589.66 +/- 4.79

2,127

ONNX Runtime graph optimization

554.24 +/- 1.98

2,430

ONNX Runtime + Mixed-precision training

614.70 +/- 8.69

2,289