Train Transformer Models Fast with PAI-Rapidformer - Platform for AI

Accelerate PyTorch Transformer model training using PAI-Rapidformer black box (CLI-based) or white box (code template) methods.

Prerequisites

The Rapidformer runtime image is installed. For more information, see Install a PAI-Megatron-Patch runtime image.
You are familiar with the Rapidformer training parameter settings. For more information, see Parameter settings guide.
You are familiar with the Rapidformer API operations. For more information, see Rapidformer API.

Acceleration methods

Rapidformer supports two acceleration methods:

Black box acceleration
Rapidformer provides a CLI to accelerate model training with simple configurations, without writing code. Register the data and model first. Examples:
- Black box acceleration: Accelerate Hugging Face model fine-tuning
- Black box acceleration: Accelerate Hugging Face model pre-training
White box acceleration
Rapidformer provides code templates for customization. Customize data or models in the template and pass them to the Rapidformer CLI using --user-script. Examples:
- White box acceleration examples using the Data/Model Hub
- Fully custom acceleration without the Data/Model Hub (more flexible)
  
  White box acceleration: Megatron model pre-training based on the Pretrainer code template

Black box: Hugging Face fine-tuning

Register your dataset with Hugging Face, or use an existing dataset. Pass it to Rapidformer using --dataset-name.

For more information, see Register a Hugging Face dataset and Query the list of existing Hugging Face datasets.
Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using --pretrained-model-name-or-path.

For more information, see Register a Hugging Face model and Query the list of existing Hugging Face models.

Configure the Rapidformer CLI to start training.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_ADDR=localhost
export MASTER_PORT=6010
export NNODES=1
export NODE_RANK=0

rapidformer --task sequence_classification \ # Task name
            --pretrained-model-name-or-path 'bert-base-cased' \  # Registered model name
            --data-path glue \                      # Registered data path name
            --data-name mrpc \                      # Registered data file name
            --epochs 3 \                               # Number of training iterations
            --micro-batch-size 16 \                    # Batch size on each GPU
            --global-batch-size 64 \                   # Total batch size for distributed training
            --lr 2e-5 \                                # Learning rate
            --lr-decay-style linear \                  # Learning rate decay policy
            --lr-warmup-iters 100 \                    # Number of learning rate warmup steps
            --weight-decay 1e-2 \                      # LR coefficient
            --clip-grad 1.0 \                          # Gradient clip coefficient
            --seed 42 \                                # Random seed
            --mixed-precision \                        # Enable mixed-precision training
            --onnx-runtime-training \                  # Enable computational graph optimization
            --zero-1-memory-optimization \             # Enable optimizer state partitioning

For more information about each parameter, see Parameter settings guide.

Black box: Hugging Face pre-training

Create an mmap-type dataset for pre-training.

See Megatron data processing script for details. Example command to create an mmap dataset:

python preprocess_data.py \
  --input book_wiki_owtv2_small.json  \
  --output-prefix gpt_small \
  --vocab gpt2-vocab.json \
  --dataset-impl mmap \
  --tokenizer-type GPT2BPETokenizer \
  --merge-file gpt2-merges.txt \
  --append-eod

Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using --pretrained-model-name-or-path.

For more information, see Register a Hugging Face model and Query the list of existing Hugging Face models.

Configure the Rapidformer CLI to start training.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_ADDR=localhost
export MASTER_PORT=6010
export NNODES=1
export NODE_RANK=0

rapidformer --task pretraining \
       --pretrained-model-name-or-path 'bert-base-uncased' \
       --num-layers 12 \
       --hidden-size 768 \
       --num-attention-heads 12 \
       --micro-batch-size 16 \
       --global-batch-size 128 \               # Enable gradient accumulation
       --seq-length 512 \
       --tokenizer-type BertWordPieceLowerCase \
       --max-position-embeddings 512 \
       --train-iters 100 \
       --data-path book_wiki_owtv2_small_text_sentence \
       --vocab-file bert-en-uncased-vocab.txt  \
       --data-impl mmap \
       --split 980,20 \
       --lr 1e-3 \
       --lr-decay-style linear \
       --min-lr 0.0 \
       --lr-decay-iters 2000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --mixed-precision \                    # Enable mixed-precision training
       --onnx-runtime-training \              # Enable computational graph optimization
       --fsdp-memory-optimization \           # Enable model state partitioning

For more information about each parameter, see Parameter settings guide.

White box: Hugging Face fine-tuning with Finetuner template

The Finetuner code template from Rapidformer quickly creates Hugging Face fine-tuning jobs. It contains four functions:

train_valid_test_datasets_provider to create datasets
model_optimizer_lr_scheduler_provider to construct the model, optimizer, and learning rate scheduler
run_forward_step to define forward operation logic
run_compute_metrics to calculate training and evaluation precision

See Rapidformer API for details on these functions. Brief introduction to inputs and outputs:

class MyFintuner(Finetuner):

    def __init__(self, engine):
        super().__init__(engine=engine)

    # Get the training/validation/test dataset
    # Input: None
    # Output: Three objects and one object function
    def train_valid_test_datasets_provider(self):

        return train_dataset, valid_dataset, test_dataset, collate_f

    # Create the model/optimizer/learning rate scheduler
    # Input: None
    # Output: Three objects
    def model_optimizer_lr_scheduler_provider(self):

        return model, optimizer, lr_scheduler

    # Write the forward logic
    # Input: batch or iterator, model
    # Output: loss
    def run_forward_step(self, batch_or_iterator, model):
        return loss

    # Write the validation set evaluation logic, dedicated for fine-tuning
    # Input: model, validation set data loader
    # Output: metric object
    def run_compute_metrics(self, model, eval_dataloader):
        return metric

After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:

Import the Rapidformer and Hugging Face interfaces.

from transformers.easytexmier import AutoConfig, BertForSequenceClassification
from datasets import load_dataset, load_metric
from rapidformer import RapidformerEngine
from rapidformer import get_args
from rapidformer import get_logger
from rapidformer import get_timers
from rapidformer import Finetuner
from rapidformer import Pretrainer
from rapidformer import build_train_valid_test_datasets_for_huggingface

Complete the four functions in the code template.

class MyFintuner(Finetuner):
    def __init__(self,engine):
        super().__init__(engine=engine)

    def train_valid_test_datasets_provider(self):
        tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

        def tokenize_function(examples):
            # max_length=None => use the model max length (it's actually the default)
            outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
            return outputs

        datasets = load_dataset(args.dataset_path, args.dataset_name)
        # Apply the method we just defined to all the examples in all the splits of the dataset
        tokenized_datasets = datasets.map(
            tokenize_function,
            batched=True,
            remove_columns=["idx", "sentence1", "sentence2"],
        )
        tokenized_datasets.rename_column_("label", "labels")

        train_dataset = tokenized_datasets["train"]
        valid_dataset = tokenized_datasets['validation']
        test_dataset = tokenized_datasets['test']

        def collate_fn(examples):
            return tokenizer.pad(examples, padding="longest", return_tensors="pt")

        return train_dataset, valid_dataset, test_dataset, collate_fn

    def model_optimizer_lr_scheduler_provider(self):
        args = get_args()
        model = BertForSequenceClassification.from_pretrained(args.load)
        return model, None, None

    def run_forward_step(self, batch, model):
        output_tensor = model(**batch)
        return output_tensor.loss

    # after each epoch run metric on eval dataset
    def run_compute_metrics(self, model, eval_dataloader):
        model = model[0]
        metric = load_metric(args.dataset_path, args.dataset_name)
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            metric.add_batch(
                predictions=self.gather(predictions),
                references=self.gather(batch["labels"]),
            )

        eval_metric = metric.compute()
        return eval_metric

Initialize the Rapidformer engine, create a trainer object, call finetune(), and save as rapidformer_finetune_hugging_face_bert_trainer.py.
```
engine = RapidformerEngine()
trainer = MyFintuner(engine=engine)
trainer.train()
```

Prepare a startup script based on the CLI. Set --user-script to rapidformer_finetune_hugging_face_bert_trainer.py and set the acceleration switches.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_ADDR=localhost
export MASTER_PORT=6010
export NNODES=1
export NODE_RANK=0

rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py
            --task sequence_classification \
            --pretrained-model-name-or-path 'bert-base-cased' \
            --data-path glue \
            --data-name mrpc \
            --epochs 3 \
            --micro-batch-size 16 \
            --global-batch-size 16 \
            --lr 2e-5 \
            --lr-decay-style linear \
            --lr-warmup-iters 100 \
            --weight-decay 1e-2 \
            --clip-grad 1.0 \
            --mixed-precision                                 # Enable mixed-precision training
            --zero-3-memory-optimization \                    # Enable model state partitioning
            --onnx-runtime-training \                         # Enable computational graph optimization

White box: Hugging Face pre-training with Pretrainer template

The Pretrainer code template from Rapidformer quickly creates Hugging Face model pre-training tasks. It contains these functions:

train_valid_test_datasets_provider to create datasets
model_optimizer_lr_scheduler_provider to construct the model, optimizer, and learning rate scheduler
run_forward_step to define forward pass logic

See Rapidformer API for details on these functions. For inputs and outputs, see White box: Hugging Face fine-tuning with Finetuner template.

After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:

Import the Rapidformer and Hugging Face interfaces.

Note

Pre-training uses an iterator to read data, so import mpu for data parallelism.

from megatron import mpu
from transformers import BertConfig, BertForPreTraining
from rapidformer import RapidformerEngine, get_args, PreTrainer
from rapidformer import build_train_valid_test_datasets_for_huggingface

Inherit Pretrainer and complete the pre-training code.

class MyBertPreTrainer(PreTrainer):

    def __init__(self,engine):
        super().__init__(engine=engine)

    def train_valid_test_datasets_provider(self, train_val_test_num_samples):
        args = get_args()

        train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface(
            data_prefix=args.data_path,
            data_impl=args.data_impl,
            splits_string=args.split,
            train_valid_test_num_samples=train_val_test_num_samples,
            max_seq_length=args.seq_length,
            masked_lm_prob=args.mask_prob,
            short_seq_prob=args.short_seq_prob,
            seed=args.seed,
            skip_warmup=(not args.mmap_warmup),
            binary_head=True)

        return train_ds, valid_ds, test_ds

    def model_optimizer_lr_scheduler_provider(self):
        args = get_args()
        model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path)
        return model, None, None

    def run_forward_step(self, data_iterator, model):
        # Items and their type.
        keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label']
        datatype = torch.int64

        # Broadcast data.
        if data_iterator is not None:
            data = next(data_iterator)
        else:
            data = None
        data_b = mpu.broadcast_data(keys, data, datatype)
        input_ids = data_b['input_ids'].long()
        attention_mask = data_b['attention_mask'].long()
        token_type_ids = data_b['token_type_ids'].long()
        labels = data_b['labels'].long()
        next_sentence_label = data_b['next_sentence_label'].long()
        output_tensor = model(input_ids=input_ids, attention_mask=attention_mask,
                              token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label)

        return output_tensor['loss']

Initialize the Rapidformer engine, create a trainer object, call pretrain(), and save as rapidformer_pretrain_hugging_face_bert_trainer.py.
```
engine = RapidformerEngine()
trainer = MyBertPreTrainer(engine=engine)
trainer.train()
```

Prepare a startup script based on the CLI and set the acceleration switches.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_ADDR=localhost
export MASTER_PORT=6010
export NNODES=1
export NODE_RANK=0

DATA_PATH=book_wiki_owtv2_small_text_sentence

rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \
       --pretrained-model-name-or-path 'bert-base-uncased' \
       --num-layers 12 \
       --hidden-size 768 \
       --num-attention-heads 12 \
       --micro-batch-size 16 \
       --global-batch-size 64 \
       --seq-length 512 \
       --tokenizer-type BertWordPieceLowerCase \
       --max-position-embeddings 512 \
       --train-iters 100 \
       --data-path $DATA_PATH \
       --vocab-file bert-en-uncased-vocab.txt  \
       --data-impl mmap \                               # Enable data acceleration
       --split 980,20 \
       --lr 1e-3 \
       --lr-decay-style linear \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --zero-3-memory-optimization \                    # Enable model state partitioning
       --onnx-runtime-training \                         # Enable computational graph optimization
       --mixed-precision                                 # Mixed-precision training

White box: Hugging Face fine-tuning with custom Trainer

For programs with custom Trainers, Rapidformer provides limited acceleration: Apex optimizer, model state partitioning, and computational graph optimization. Mixed-precision training requires many modifications. Use the template-based method described earlier for better results. This section shows intrusive acceleration on typical Hugging Face fine-tuning code.

Hugging Face fine-tuning code example:

import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    BertForSequenceClassification,

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")

def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
    return outputs

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)

device = torch.device("cuda", args.local_rank)

for epoch in range(args.epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(eval_dataloader):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            metric.add_batch(
                    predictions=engine.gather(predictions),
                    references=engine.gather(batch["labels"]))

     eval_metric = metric.compute()
     print("epoch {}: {}".format(epoch, eval_metric))

This code has issues: no data parallelism support, slow optimizer, and no mixed-precision training. The following steps modify this code using Rapidformer APIs.

Add data parallelism support.

Create a finetuner object, then call finetuner.build_data_loader to create a data loader. This loader supports data parallelism and automatically sends data to GPU. Remove batch.to(device) from the original code.

+ from rapidformer import RapidformerEngine
+ engine = RapidformerEngine()
+ finetuner = Finetuner(engine=engine)

- train_dataloader = DataLoader(tokenized_datasets["train"])
- eval_dataloader = DataLoader(tokenized_datasets["train"])

+ train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"])
+ eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"])

Use the Apex optimizer on top of data parallelism.
Replace the optimizer with faster Apex Fused Adam. Remove the original optimizer and use Fused Adam from Rapidformer. Call engine.compose to encapsulate the model, optimizer, and learning rate scheduler.
```
+ from rapidformer import RapidformerEngine
+ engine = RapidformerEngine()
+ finetuner = Finetuner(engine=engine)

- optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
- lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)


+ lr_scheduler = partial(
        get_linear_schedule_with_warmup,
        num_warmup_steps=args.lr_warmup_iters,
        num_training_steps=args.train_iters
    )

+ model, optimizer, lr_scheduler = engine.compose(model_obj=model,
      lr_scheduler_fn=lr_scheduler)
```
Note
Using Apex optimizer and mixed precision with data parallelism is complex. Mixed-precision training involves switching model to fp16 and loss scaling. Modifying frontend programs without a trainer is complex. Use a Trainer-based solution instead. Rapidformer's Finetuner integrates data parallelism, Apex, PyTorch mixed-precision training, Megatron optimizer mixed-precision training, and VRAM optimization from FairScale and DeepSpeed.

White box: Megatron pre-training with Pretrainer template

After understanding the White box: Hugging Face fine-tuning with custom Trainer method, bypass Data and Model Hubs for more flexibility. Write custom data creation logic in train_valid_test_datasets_provider, custom model in model_optimizer_lr_scheduler_provider, and custom forward logic in run_forward_step.

Create an mmap-type dataset for pre-training.

See Megatron data processing script for details. Example command to create an mmap dataset:

python preprocess_data.py \
  --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json  \
  --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \
  --vocab gpt2-vocab.json \
  --dataset-impl mmap \
  --tokenizer-type GPT2BPETokenizer \
  --merge-file gpt2-merges.txt \
  --append-eod

Inherit Pretrainer and complete the data custom function train_valid_test_datasets_provider in pre-training code.

Write custom logic to create train, validation, and test datasets without relying on third-party libraries. Datasets should inherit from torch.utils.data.Dataset.

from rapidformer import RapidformerEngine, get_args, PreTrainer

class MegatronGPTPreTrainer(PreTrainer):
    def __init__(self,
                 engine,
                 ):
        super().__init__(engine=engine)

    def train_valid_test_datasets_provider(self, train_val_test_num_samples):
        args = get_args()

        train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
            data_prefix=args.data_path,
            data_impl=args.data_impl,
            splits_string=args.split,
            train_valid_test_num_samples=train_val_test_num_samples,
            seq_length=args.seq_length,
            seed=args.seed,
            skip_warmup=(not args.mmap_warmup))

        return train_ds, valid_ds, test_ds

Inherit Pretrainer and complete the model custom function model_optimizer_lr_scheduler_provider in pre-training code.

Write custom logic to create a custom model object without relying on third-party libraries. Models should inherit from torch.nn.Module.

from rapidformer import RapidformerEngine, get_args, PreTrainer
from yourmodel import GPTModel

class MegatronGPTPreTrainer(PreTrainer):
    def __init__(self,
                 engine,
                 ):
        super().__init__(engine=engine)

    def model_optimizer_lr_scheduler_provider(self):
        model = GPTModel()
        return model, None, None

Inherit Pretrainer and complete the forward custom function run_forward_step in pre-training code.

from rapidformer import RapidformerEngine, get_args, PreTrainer

class MyGPTPreTrainer(PreTrainer):
    def __init__(self,
                 engine,
                 ):
        super().__init__(engine=engine)


    def run_forward_step(self, data_iterator, model):
        """Forward step."""
        args = get_args()

        tokenizer = get_tokenizer()

        # Items and their type.
        keys = ['text']
        datatype = torch.int64

        # Broadcast data.
        if data_iterator is not None:
            data = next(data_iterator)
        else:
            data = None
        data_b = mpu.broadcast_data(keys, data, datatype)

        # Unpack.
        tokens_ = data_b['text'].long()
        labels = tokens_[:, 1:].contiguous()
        tokens = tokens_[:, :-1].contiguous()

        # Get the masks and postition ids.
        attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
            tokens,
            tokenizer.eod,
            args.reset_position_ids,
            args.reset_attention_mask,
            args.eod_mask_loss)

        output_tensor = model(tokens, position_ids, attention_mask,
                              labels=labels)

        losses = output_tensor.float()
        loss_mask = loss_mask.view(-1).float()
        loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()

        return loss

Initialize the Rapidformer engine, create a trainer object, call pretrain(), and save as rapidformer_pretrain_megatron_gpt_trainer.py.
```
engine = RapidformerEngine()
trainer = MyGPTPreTrainer(engine=engine)
trainer.train()
```

Prepare a startup script and set the acceleration switches.

#!/bin/bash
export CUDA_VISIBLE_DEVICES=4,5,6,7
export MASTER_ADDR=localhost
export MASTER_PORT=6010
export NNODES=1
export NODE_RANK=0

DATA_PATH=book_wiki_owtv2_small_text_sentence
PRETRAINED_CHECKPOINT=

rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \
       --tensor-model-parallel-size 2 \          # Enable operator splitting optimization
       --pipeline-model-parallel-size 2 \        # Enable pipeline parallelism optimization
       --num-layers 12 \
       --hidden-size 768 \
       --num-attention-heads 12 \
       --micro-batch-size 16 \
       --global-batch-size 128 \                  # Enable gradient accumulation optimization
       --seq-length 512 \
       --tokenizer-type GPT2BPETokenizer \
       --max-position-embeddings 512 \
       --train-iters 100 \
       --data-path $DATA_PATH \
       --vocab-file gpt2-vocab.json \
       --merge-file gpt2-merges.txt \
       --data-impl mmap \                         # Enable data acceleration
       --split 980,20 \
       --lr 1e-3 \
       --lr-decay-style linear \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --log-interval 1 \
       --zero-2-memory-optimization \              # Enable model state partitioning
       --checkpoint-activations \                  # Enable gradient checkpointing
       --mixed-precision                           # Enable mixed-precision training