Accelerate PyTorch Transformer model training using PAI-Rapidformer black box (CLI-based) or white box (code template) methods.
Prerequisites
-
The Rapidformer runtime image is installed. For more information, see Install a PAI-Megatron-Patch runtime image.
-
You are familiar with the Rapidformer training parameter settings. For more information, see Parameter settings guide.
-
You are familiar with the Rapidformer API operations. For more information, see Rapidformer API.
Acceleration methods
Rapidformer supports two acceleration methods:
-
Black box acceleration
Rapidformer provides a CLI to accelerate model training with simple configurations, without writing code. Register the data and model first. Examples:
-
White box acceleration
Rapidformer provides code templates for customization. Customize data or models in the template and pass them to the Rapidformer CLI using
--user-script. Examples:-
White box acceleration examples using the Data/Model Hub
-
Fully custom acceleration without the Data/Model Hub (more flexible)
White box acceleration: Megatron model pre-training based on the Pretrainer code template
-
Black box: Hugging Face fine-tuning
-
Register your dataset with Hugging Face, or use an existing dataset. Pass it to Rapidformer using
--dataset-name.For more information, see Register a Hugging Face dataset and Query the list of existing Hugging Face datasets.
-
Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using
--pretrained-model-name-or-path.For more information, see Register a Hugging Face model and Query the list of existing Hugging Face models.
-
Configure the Rapidformer CLI to start training.
#!/bin/bash export CUDA_VISIBLE_DEVICES=4,5,6,7 export MASTER_ADDR=localhost export MASTER_PORT=6010 export NNODES=1 export NODE_RANK=0 rapidformer --task sequence_classification \ # Task name --pretrained-model-name-or-path 'bert-base-cased' \ # Registered model name --data-path glue \ # Registered data path name --data-name mrpc \ # Registered data file name --epochs 3 \ # Number of training iterations --micro-batch-size 16 \ # Batch size on each GPU --global-batch-size 64 \ # Total batch size for distributed training --lr 2e-5 \ # Learning rate --lr-decay-style linear \ # Learning rate decay policy --lr-warmup-iters 100 \ # Number of learning rate warmup steps --weight-decay 1e-2 \ # LR coefficient --clip-grad 1.0 \ # Gradient clip coefficient --seed 42 \ # Random seed --mixed-precision \ # Enable mixed-precision training --onnx-runtime-training \ # Enable computational graph optimization --zero-1-memory-optimization \ # Enable optimizer state partitioningFor more information about each parameter, see Parameter settings guide.
Black box: Hugging Face pre-training
-
Create an mmap-type dataset for pre-training.
See Megatron data processing script for details. Example command to create an mmap dataset:
python preprocess_data.py \ --input book_wiki_owtv2_small.json \ --output-prefix gpt_small \ --vocab gpt2-vocab.json \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --merge-file gpt2-merges.txt \ --append-eod -
Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using
--pretrained-model-name-or-path.For more information, see Register a Hugging Face model and Query the list of existing Hugging Face models.
-
Configure the Rapidformer CLI to start training.
#!/bin/bash export CUDA_VISIBLE_DEVICES=4,5,6,7 export MASTER_ADDR=localhost export MASTER_PORT=6010 export NNODES=1 export NODE_RANK=0 rapidformer --task pretraining \ --pretrained-model-name-or-path 'bert-base-uncased' \ --num-layers 12 \ --hidden-size 768 \ --num-attention-heads 12 \ --micro-batch-size 16 \ --global-batch-size 128 \ # Enable gradient accumulation --seq-length 512 \ --tokenizer-type BertWordPieceLowerCase \ --max-position-embeddings 512 \ --train-iters 100 \ --data-path book_wiki_owtv2_small_text_sentence \ --vocab-file bert-en-uncased-vocab.txt \ --data-impl mmap \ --split 980,20 \ --lr 1e-3 \ --lr-decay-style linear \ --min-lr 0.0 \ --lr-decay-iters 2000 \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --lr-warmup-fraction .01 \ --mixed-precision \ # Enable mixed-precision training --onnx-runtime-training \ # Enable computational graph optimization --fsdp-memory-optimization \ # Enable model state partitioningFor more information about each parameter, see Parameter settings guide.
White box: Hugging Face fine-tuning with Finetuner template
The Finetuner code template from Rapidformer quickly creates Hugging Face fine-tuning jobs. It contains four functions:
-
train_valid_test_datasets_providerto create datasets -
model_optimizer_lr_scheduler_providerto construct the model, optimizer, and learning rate scheduler -
run_forward_stepto define forward operation logic -
run_compute_metricsto calculate training and evaluation precision
See Rapidformer API for details on these functions. Brief introduction to inputs and outputs:
class MyFintuner(Finetuner):
def __init__(self, engine):
super().__init__(engine=engine)
# Get the training/validation/test dataset
# Input: None
# Output: Three objects and one object function
def train_valid_test_datasets_provider(self):
return train_dataset, valid_dataset, test_dataset, collate_f
# Create the model/optimizer/learning rate scheduler
# Input: None
# Output: Three objects
def model_optimizer_lr_scheduler_provider(self):
return model, optimizer, lr_scheduler
# Write the forward logic
# Input: batch or iterator, model
# Output: loss
def run_forward_step(self, batch_or_iterator, model):
return loss
# Write the validation set evaluation logic, dedicated for fine-tuning
# Input: model, validation set data loader
# Output: metric object
def run_compute_metrics(self, model, eval_dataloader):
return metric
After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:
-
Import the Rapidformer and Hugging Face interfaces.
from transformers.easytexmier import AutoConfig, BertForSequenceClassification from datasets import load_dataset, load_metric from rapidformer import RapidformerEngine from rapidformer import get_args from rapidformer import get_logger from rapidformer import get_timers from rapidformer import Finetuner from rapidformer import Pretrainer from rapidformer import build_train_valid_test_datasets_for_huggingface -
Complete the four functions in the code template.
class MyFintuner(Finetuner): def __init__(self,engine): super().__init__(engine=engine) def train_valid_test_datasets_provider(self): tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): # max_length=None => use the model max length (it's actually the default) outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None) return outputs datasets = load_dataset(args.dataset_path, args.dataset_name) # Apply the method we just defined to all the examples in all the splits of the dataset tokenized_datasets = datasets.map( tokenize_function, batched=True, remove_columns=["idx", "sentence1", "sentence2"], ) tokenized_datasets.rename_column_("label", "labels") train_dataset = tokenized_datasets["train"] valid_dataset = tokenized_datasets['validation'] test_dataset = tokenized_datasets['test'] def collate_fn(examples): return tokenizer.pad(examples, padding="longest", return_tensors="pt") return train_dataset, valid_dataset, test_dataset, collate_fn def model_optimizer_lr_scheduler_provider(self): args = get_args() model = BertForSequenceClassification.from_pretrained(args.load) return model, None, None def run_forward_step(self, batch, model): output_tensor = model(**batch) return output_tensor.loss # after each epoch run metric on eval dataset def run_compute_metrics(self, model, eval_dataloader): model = model[0] metric = load_metric(args.dataset_path, args.dataset_name) for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) predictions = outputs.logits.argmax(dim=-1) metric.add_batch( predictions=self.gather(predictions), references=self.gather(batch["labels"]), ) eval_metric = metric.compute() return eval_metric -
Initialize the Rapidformer engine, create a trainer object, call
finetune(), and save asrapidformer_finetune_hugging_face_bert_trainer.py.engine = RapidformerEngine() trainer = MyFintuner(engine=engine) trainer.train() -
Prepare a startup script based on the CLI. Set
--user-scripttorapidformer_finetune_hugging_face_bert_trainer.pyand set the acceleration switches.#!/bin/bash export CUDA_VISIBLE_DEVICES=4,5,6,7 export MASTER_ADDR=localhost export MASTER_PORT=6010 export NNODES=1 export NODE_RANK=0 rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py --task sequence_classification \ --pretrained-model-name-or-path 'bert-base-cased' \ --data-path glue \ --data-name mrpc \ --epochs 3 \ --micro-batch-size 16 \ --global-batch-size 16 \ --lr 2e-5 \ --lr-decay-style linear \ --lr-warmup-iters 100 \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --mixed-precision # Enable mixed-precision training --zero-3-memory-optimization \ # Enable model state partitioning --onnx-runtime-training \ # Enable computational graph optimization
White box: Hugging Face pre-training with Pretrainer template
The Pretrainer code template from Rapidformer quickly creates Hugging Face model pre-training tasks. It contains these functions:
-
train_valid_test_datasets_providerto create datasets -
model_optimizer_lr_scheduler_providerto construct the model, optimizer, and learning rate scheduler -
run_forward_stepto define forward pass logic
See Rapidformer API for details on these functions. For inputs and outputs, see White box: Hugging Face fine-tuning with Finetuner template.
After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:
-
Import the Rapidformer and Hugging Face interfaces.
NotePre-training uses an iterator to read data, so import mpu for data parallelism.
from megatron import mpu from transformers import BertConfig, BertForPreTraining from rapidformer import RapidformerEngine, get_args, PreTrainer from rapidformer import build_train_valid_test_datasets_for_huggingface -
Inherit Pretrainer and complete the pre-training code.
class MyBertPreTrainer(PreTrainer): def __init__(self,engine): super().__init__(engine=engine) def train_valid_test_datasets_provider(self, train_val_test_num_samples): args = get_args() train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface( data_prefix=args.data_path, data_impl=args.data_impl, splits_string=args.split, train_valid_test_num_samples=train_val_test_num_samples, max_seq_length=args.seq_length, masked_lm_prob=args.mask_prob, short_seq_prob=args.short_seq_prob, seed=args.seed, skip_warmup=(not args.mmap_warmup), binary_head=True) return train_ds, valid_ds, test_ds def model_optimizer_lr_scheduler_provider(self): args = get_args() model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path) return model, None, None def run_forward_step(self, data_iterator, model): # Items and their type. keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label'] datatype = torch.int64 # Broadcast data. if data_iterator is not None: data = next(data_iterator) else: data = None data_b = mpu.broadcast_data(keys, data, datatype) input_ids = data_b['input_ids'].long() attention_mask = data_b['attention_mask'].long() token_type_ids = data_b['token_type_ids'].long() labels = data_b['labels'].long() next_sentence_label = data_b['next_sentence_label'].long() output_tensor = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label) return output_tensor['loss'] -
Initialize the Rapidformer engine, create a trainer object, call
pretrain(), and save asrapidformer_pretrain_hugging_face_bert_trainer.py.engine = RapidformerEngine() trainer = MyBertPreTrainer(engine=engine) trainer.train() -
Prepare a startup script based on the CLI and set the acceleration switches.
#!/bin/bash export CUDA_VISIBLE_DEVICES=4,5,6,7 export MASTER_ADDR=localhost export MASTER_PORT=6010 export NNODES=1 export NODE_RANK=0 DATA_PATH=book_wiki_owtv2_small_text_sentence rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \ --pretrained-model-name-or-path 'bert-base-uncased' \ --num-layers 12 \ --hidden-size 768 \ --num-attention-heads 12 \ --micro-batch-size 16 \ --global-batch-size 64 \ --seq-length 512 \ --tokenizer-type BertWordPieceLowerCase \ --max-position-embeddings 512 \ --train-iters 100 \ --data-path $DATA_PATH \ --vocab-file bert-en-uncased-vocab.txt \ --data-impl mmap \ # Enable data acceleration --split 980,20 \ --lr 1e-3 \ --lr-decay-style linear \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --lr-warmup-fraction .01 \ --zero-3-memory-optimization \ # Enable model state partitioning --onnx-runtime-training \ # Enable computational graph optimization --mixed-precision # Mixed-precision training
White box: Hugging Face fine-tuning with custom Trainer
For programs with custom Trainers, Rapidformer provides limited acceleration: Apex optimizer, model state partitioning, and computational graph optimization. Mixed-precision training requires many modifications. Use the template-based method described earlier for better results. This section shows intrusive acceleration on typical Hugging Face fine-tuning code.
Hugging Face fine-tuning code example:
import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
AdamW,
AutoModelForSequenceClassification,
AutoTokenizer,
get_linear_schedule_with_warmup,
BertForSequenceClassification,
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")
def tokenize_function(examples):
# max_length=None => use the model max length (it's actually the default)
outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
return outputs
tokenized_datasets = datasets.map(
tokenize_function,
batched=True,
remove_columns=["idx", "sentence1", "sentence2"],
)
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)
optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
lr_scheduler = get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=args.lr_warmup_iters,
num_training_steps=args.train_iters
)
device = torch.device("cuda", args.local_rank)
for epoch in range(args.epochs):
model.train()
for step, batch in enumerate(train_dataloader):
batch.to(device)
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
model.eval()
for step, batch in enumerate(eval_dataloader):
batch.to(device)
with torch.no_grad():
outputs = model(**batch)
predictions = outputs.logits.argmax(dim=-1)
metric.add_batch(
predictions=engine.gather(predictions),
references=engine.gather(batch["labels"]))
eval_metric = metric.compute()
print("epoch {}: {}".format(epoch, eval_metric))
This code has issues: no data parallelism support, slow optimizer, and no mixed-precision training. The following steps modify this code using Rapidformer APIs.
-
Add data parallelism support.
Create a finetuner object, then call
finetuner.build_data_loaderto create a data loader. This loader supports data parallelism and automatically sends data to GPU. Removebatch.to(device)from the original code.+ from rapidformer import RapidformerEngine + engine = RapidformerEngine() + finetuner = Finetuner(engine=engine) - train_dataloader = DataLoader(tokenized_datasets["train"]) - eval_dataloader = DataLoader(tokenized_datasets["train"]) + train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"]) + eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"]) -
Use the Apex optimizer on top of data parallelism.
Replace the optimizer with faster Apex Fused Adam. Remove the original optimizer and use Fused Adam from Rapidformer. Call
engine.composeto encapsulate the model, optimizer, and learning rate scheduler.+ from rapidformer import RapidformerEngine + engine = RapidformerEngine() + finetuner = Finetuner(engine=engine) - optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True) - lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=args.lr_warmup_iters, num_training_steps=args.train_iters ) + lr_scheduler = partial( get_linear_schedule_with_warmup, num_warmup_steps=args.lr_warmup_iters, num_training_steps=args.train_iters ) + model, optimizer, lr_scheduler = engine.compose(model_obj=model, lr_scheduler_fn=lr_scheduler)NoteUsing Apex optimizer and mixed precision with data parallelism is complex. Mixed-precision training involves switching model to fp16 and loss scaling. Modifying frontend programs without a trainer is complex. Use a Trainer-based solution instead. Rapidformer's Finetuner integrates data parallelism, Apex, PyTorch mixed-precision training, Megatron optimizer mixed-precision training, and VRAM optimization from FairScale and DeepSpeed.
White box: Megatron pre-training with Pretrainer template
After understanding the White box: Hugging Face fine-tuning with custom Trainer method, bypass Data and Model Hubs for more flexibility. Write custom data creation logic in train_valid_test_datasets_provider, custom model in model_optimizer_lr_scheduler_provider, and custom forward logic in run_forward_step.
-
Create an mmap-type dataset for pre-training.
See Megatron data processing script for details. Example command to create an mmap dataset:
python preprocess_data.py \ --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json \ --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \ --vocab gpt2-vocab.json \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --merge-file gpt2-merges.txt \ --append-eod -
Inherit Pretrainer and complete the data custom function
train_valid_test_datasets_providerin pre-training code.Write custom logic to create train, validation, and test datasets without relying on third-party libraries. Datasets should inherit from
torch.utils.data.Dataset.from rapidformer import RapidformerEngine, get_args, PreTrainer class MegatronGPTPreTrainer(PreTrainer): def __init__(self, engine, ): super().__init__(engine=engine) def train_valid_test_datasets_provider(self, train_val_test_num_samples): args = get_args() train_ds, valid_ds, test_ds = build_train_valid_test_datasets( data_prefix=args.data_path, data_impl=args.data_impl, splits_string=args.split, train_valid_test_num_samples=train_val_test_num_samples, seq_length=args.seq_length, seed=args.seed, skip_warmup=(not args.mmap_warmup)) return train_ds, valid_ds, test_ds -
Inherit Pretrainer and complete the model custom function
model_optimizer_lr_scheduler_providerin pre-training code.Write custom logic to create a custom model object without relying on third-party libraries. Models should inherit from
torch.nn.Module.from rapidformer import RapidformerEngine, get_args, PreTrainer from yourmodel import GPTModel class MegatronGPTPreTrainer(PreTrainer): def __init__(self, engine, ): super().__init__(engine=engine) def model_optimizer_lr_scheduler_provider(self): model = GPTModel() return model, None, None -
Inherit Pretrainer and complete the forward custom function
run_forward_stepin pre-training code.from rapidformer import RapidformerEngine, get_args, PreTrainer class MyGPTPreTrainer(PreTrainer): def __init__(self, engine, ): super().__init__(engine=engine) def run_forward_step(self, data_iterator, model): """Forward step.""" args = get_args() tokenizer = get_tokenizer() # Items and their type. keys = ['text'] datatype = torch.int64 # Broadcast data. if data_iterator is not None: data = next(data_iterator) else: data = None data_b = mpu.broadcast_data(keys, data, datatype) # Unpack. tokens_ = data_b['text'].long() labels = tokens_[:, 1:].contiguous() tokens = tokens_[:, :-1].contiguous() # Get the masks and postition ids. attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids( tokens, tokenizer.eod, args.reset_position_ids, args.reset_attention_mask, args.eod_mask_loss) output_tensor = model(tokens, position_ids, attention_mask, labels=labels) losses = output_tensor.float() loss_mask = loss_mask.view(-1).float() loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum() return loss -
Initialize the Rapidformer engine, create a trainer object, call
pretrain(), and save asrapidformer_pretrain_megatron_gpt_trainer.py.engine = RapidformerEngine() trainer = MyGPTPreTrainer(engine=engine) trainer.train() -
Prepare a startup script and set the acceleration switches.
#!/bin/bash export CUDA_VISIBLE_DEVICES=4,5,6,7 export MASTER_ADDR=localhost export MASTER_PORT=6010 export NNODES=1 export NODE_RANK=0 DATA_PATH=book_wiki_owtv2_small_text_sentence PRETRAINED_CHECKPOINT= rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \ --tensor-model-parallel-size 2 \ # Enable operator splitting optimization --pipeline-model-parallel-size 2 \ # Enable pipeline parallelism optimization --num-layers 12 \ --hidden-size 768 \ --num-attention-heads 12 \ --micro-batch-size 16 \ --global-batch-size 128 \ # Enable gradient accumulation optimization --seq-length 512 \ --tokenizer-type GPT2BPETokenizer \ --max-position-embeddings 512 \ --train-iters 100 \ --data-path $DATA_PATH \ --vocab-file gpt2-vocab.json \ --merge-file gpt2-merges.txt \ --data-impl mmap \ # Enable data acceleration --split 980,20 \ --lr 1e-3 \ --lr-decay-style linear \ --weight-decay 1e-2 \ --clip-grad 1.0 \ --lr-warmup-fraction .01 \ --log-interval 1 \ --zero-2-memory-optimization \ # Enable model state partitioning --checkpoint-activations \ # Enable gradient checkpointing --mixed-precision # Enable mixed-precision training