This document shows how to use Pai-Megatron-Patch to optimize PyTorch Transformer model training.
Background information
We conducted all experiments on an Alibaba Cloud ECS instance with the following specifications: instance type ecs.gn6e-c12g1.12xlarge with a 48-core CPU, 368 GiB of memory, and four NVIDIA V100 GPUs. The environment used the Ubuntu 18.04 64-bit operating system with the image ID ubuntu_18_04_x64_20G_alibase_20211227.vhd and a peak bandwidth of 100 Mbps. We ran the nvidia-smi command to confirm that driver version 440.64.00 and CUDA 10.2 were ready, and that all four Tesla V100-SXM2 GPUs were idle.
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:07.0 Off | 0 |
| N/A 32C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:08.0 Off | 0 |
| N/A 31C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 |
| N/A 30C P0 39W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 |
| N/A 31C P0 40W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Mixed-precision training
Experiment setup: Pre-training an English Hugging Face BERT model.
-
num-layers 12
-
hidden-size 768
-
num-attention-heads 12
-
num-params 110,106,428
-
local-rank 4
-
seq-length 512
-
micro-batch-size 16
-
global-batch-size 64
|
Solution |
Throughput (samples/s) |
Peak memory (MB) |
|
Single-precision training |
103.07 +/- 1.03 |
17,025 |
|
Mixed-precision training |
178.15 +/- 2.10 |
12,698 |
GPU memory optimization: Model state partitioning
Experiment setup: Pre-training an English Megatron GPT model.
-
num-layers 24
-
hidden-size 2048
-
num-attention-heads 32
-
num-params 1,313,722,368 (1.3 billion)
-
local-rank 4
-
seq-length 1024
-
micro-batch-size 1
-
global-batch-size 4
Without optimizations, this model cannot fit on a 32 GB GPU, causing an out-of-memory error. For example, the Adam optimizer's state parameters alone require 16 GB of GPU memory.
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/megatron/model/language_model.py", line 351, in forward
encoder_output = self.encoder(encoder_input,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 703, in forward
hidden_states = layer(hidden_states,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 441, in forward
self.self_attention(layernorm_output,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/megatron/model/transformer.py", line 264, in forward
matmul_result = torch.baddbmm(
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 28.56 GiB already allocated; 84.00 MiB free; 30.19 GiB reserved in total by PyTorch)
|
Solution |
Throughput (samples/s) |
Peak memory (MB) |
|
No optimization |
OOM |
OOM |
|
Mixed-precision training |
9.57 +/- 0.26 |
25,061 |
|
Mixed-precision training + Optimizer State Sharding (OSS) |
6.02 +/- 0.06 |
22,077 |
|
Mixed-precision training + OSS / Sharded Data Parallel (SDP) |
7.01 +/- 0.07 |
17,113 |
|
Mixed-precision training + Fully Sharded Data Parallel (FSDP) |
NA |
NA |
|
Mixed-precision training + ZeRO Stage 1 |
12.88 +/- 0.10 |
15,709 |
|
Mixed-precision training + ZeRO Stage 2 |
10.27 +/- 0.08 |
15,693 |
|
Mixed-precision training + ZeRO Stage 3 |
NA |
NA |
3D hybrid parallelism
Experiment setup: Pre-training an English Megatron GPT model.
-
num-layers 24
-
hidden-size 2048
-
num-attention-heads 32
-
num-params 1,313,722,368 (1.3 billion)
-
local-rank 4
-
seq-length 1024
-
micro-batch-size 1
-
global-batch-size 4
With mixed-precision training enabled:
|
Tensor parallelism |
Pipeline parallelism |
Throughput (samples/s) |
Peak memory (MB) |
|
1 |
1 |
9.63 +/- 0.29 |
25,061 |
|
2 |
1 |
7.59 +/- 0.14 |
11,300 |
|
4 |
1 |
6.16 +/- 0.06 |
5,673 |
|
1 |
2 |
8.46 +/- 0.17 |
12,375 |
|
1 |
4 |
8.03 +/- 0.12 |
8,141 |
|
2 |
2 |
7.37 +/- 0.11 |
6,211 |
|
4 |
4 |
6.24 +/- 0.08 |
5,673 |
ONNX Runtime graph optimization
Experiment setup: Fine-tuning an English Hugging Face BERT model.
-
num-layers 12
-
hidden-size 768
-
num-attention-heads 12
-
num-params 110,106,428
-
local-rank 4
-
seq-length 512
-
micro-batch-size 16
-
global-batch-size 64
Compared with single-precision training, using only ONNX Runtime graph optimization improves throughput by 15.6%.
|
Solution |
Throughput (samples/s) |
Peak memory (MB) |
|
Single-precision training |
479.15 +/- 1.67 |
2,112 |
|
Mixed-precision training |
589.66 +/- 4.79 |
2,127 |
|
ONNX Runtime graph optimization |
554.24 +/- 1.98 |
2,430 |
|
ONNX Runtime + Mixed-precision training |
614.70 +/- 8.69 |
2,289 |