Optimize PyTorch Transformer model training with Pai-Megatron-Patch - Platform For AI

This document shows how to use Pai-Megatron-Patch to optimize PyTorch Transformer model training.

Background information

We conducted all experiments on an Alibaba Cloud ECS instance with the following specifications: instance type ecs.gn6e-c12g1.12xlarge with a 48-core CPU, 368 GiB of memory, and four NVIDIA V100 GPUs. The environment used the Ubuntu 18.04 64-bit operating system with the image ID ubuntu_18_04_x64_20G_alibase_20211227.vhd and a peak bandwidth of 100 Mbps. We ran the nvidia-smi command to confirm that driver version 440.64.00 and CUDA 10.2 were ready, and that all four Tesla V100-SXM2 GPUs were idle.

| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   32C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   30C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   31C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Mixed-precision training

Experiment setup: Pre-training an English Hugging Face BERT model.

num-layers 12
hidden-size 768
num-attention-heads 12
num-params 110,106,428
local-rank 4
seq-length 512
micro-batch-size 16
global-batch-size 64

Solution	Throughput (samples/s)	Peak memory (MB)
Single-precision training	103.07 +/- 1.03	17,025
Mixed-precision training	178.15 +/- 2.10	12,698

GPU memory optimization: Model state partitioning

Experiment setup: Pre-training an English Megatron GPT model.

num-layers 24
hidden-size 2048
num-attention-heads 32
num-params 1,313,722,368 (1.3 billion)
local-rank 4
seq-length 1024
micro-batch-size 1
global-batch-size 4

Without optimizations, this model cannot fit on a 32 GB GPU, causing an out-of-memory error. For example, the Adam optimizer's state parameters alone require 16 GB of GPU memory.

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/language_model.py", line 351, in forward
    encoder_output = self.encoder(encoder_input,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 703, in forward
    hidden_states = layer(hidden_states,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 441, in forward
    self.self_attention(layernorm_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 264, in forward
    matmul_result = torch.baddbmm(
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 28.56 GiB already allocated; 84.00 MiB free; 30.19 GiB reserved in total by PyTorch)

Solution	Throughput (samples/s)	Peak memory (MB)
No optimization	OOM	OOM
Mixed-precision training	9.57 +/- 0.26	25,061
Mixed-precision training + Optimizer State Sharding (OSS)	6.02 +/- 0.06	22,077
Mixed-precision training + OSS / Sharded Data Parallel (SDP)	7.01 +/- 0.07	17,113
Mixed-precision training + Fully Sharded Data Parallel (FSDP)	NA	NA
Mixed-precision training + ZeRO Stage 1	12.88 +/- 0.10	15,709
Mixed-precision training + ZeRO Stage 2	10.27 +/- 0.08	15,693
Mixed-precision training + ZeRO Stage 3	NA	NA

3D hybrid parallelism

Experiment setup: Pre-training an English Megatron GPT model.

num-layers 24
hidden-size 2048
num-attention-heads 32
num-params 1,313,722,368 (1.3 billion)
local-rank 4
seq-length 1024
micro-batch-size 1
global-batch-size 4

With mixed-precision training enabled:

Tensor parallelism	Pipeline parallelism	Throughput (samples/s)	Peak memory (MB)
1	1	9.63 +/- 0.29	25,061
2	1	7.59 +/- 0.14	11,300
4	1	6.16 +/- 0.06	5,673
1	2	8.46 +/- 0.17	12,375
1	4	8.03 +/- 0.12	8,141
2	2	7.37 +/- 0.11	6,211
4	4	6.24 +/- 0.08	5,673

ONNX Runtime graph optimization

Experiment setup: Fine-tuning an English Hugging Face BERT model.

num-layers 12
hidden-size 768
num-attention-heads 12
num-params 110,106,428
local-rank 4
seq-length 512
micro-batch-size 16
global-batch-size 64

Compared with single-precision training, using only ONNX Runtime graph optimization improves throughput by 15.6%.

Solution	Throughput (samples/s)	Peak memory (MB)
Single-precision training	479.15 +/- 1.67	2,112
Mixed-precision training	589.66 +/- 4.79	2,127
ONNX Runtime graph optimization	554.24 +/- 1.98	2,430
ONNX Runtime + Mixed-precision training	614.70 +/- 8.69	2,289