All Products
Search
Document Center

Platform For AI:Use case: Accelerate Transformer model training

Last Updated:Mar 10, 2026

Accelerate PyTorch Transformer model training using PAI-Rapidformer black box (CLI-based) or white box (code template) methods.

Prerequisites

Acceleration methods

Rapidformer supports two acceleration methods:

Black box: Hugging Face fine-tuning

  1. Register your dataset with Hugging Face, or use an existing dataset. Pass it to Rapidformer using --dataset-name.

  2. Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using --pretrained-model-name-or-path.

  3. Configure the Rapidformer CLI to start training.

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task sequence_classification \ # Task name
                --pretrained-model-name-or-path 'bert-base-cased' \  # Registered model name
                --data-path glue \                      # Registered data path name
                --data-name mrpc \                      # Registered data file name
                --epochs 3 \                               # Number of training iterations
                --micro-batch-size 16 \                    # Batch size on each GPU
                --global-batch-size 64 \                   # Total batch size for distributed training
                --lr 2e-5 \                                # Learning rate
                --lr-decay-style linear \                  # Learning rate decay policy
                --lr-warmup-iters 100 \                    # Number of learning rate warmup steps
                --weight-decay 1e-2 \                      # LR coefficient
                --clip-grad 1.0 \                          # Gradient clip coefficient
                --seed 42 \                                # Random seed
                --mixed-precision \                        # Enable mixed-precision training
                --onnx-runtime-training \                  # Enable computational graph optimization
                --zero-1-memory-optimization \             # Enable optimizer state partitioning

    For more information about each parameter, see Parameter settings guide.

Black box: Hugging Face pre-training

  1. Create an mmap-type dataset for pre-training.

    See Megatron data processing script for details. Example command to create an mmap dataset:

    python preprocess_data.py \
      --input book_wiki_owtv2_small.json  \
      --output-prefix gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. Register your model with Hugging Face, or use an existing model. Pass it to Rapidformer using --pretrained-model-name-or-path.

  3. Configure the Rapidformer CLI to start training.

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task pretraining \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \               # Enable gradient accumulation
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path book_wiki_owtv2_small_text_sentence \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --min-lr 0.0 \
           --lr-decay-iters 2000 \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --mixed-precision \                    # Enable mixed-precision training
           --onnx-runtime-training \              # Enable computational graph optimization
           --fsdp-memory-optimization \           # Enable model state partitioning

    For more information about each parameter, see Parameter settings guide.

White box: Hugging Face fine-tuning with Finetuner template

The Finetuner code template from Rapidformer quickly creates Hugging Face fine-tuning jobs. It contains four functions:

  • train_valid_test_datasets_provider to create datasets

  • model_optimizer_lr_scheduler_provider to construct the model, optimizer, and learning rate scheduler

  • run_forward_step to define forward operation logic

  • run_compute_metrics to calculate training and evaluation precision

See Rapidformer API for details on these functions. Brief introduction to inputs and outputs:

class MyFintuner(Finetuner):

    def __init__(self, engine):
        super().__init__(engine=engine)

    # Get the training/validation/test dataset
    # Input: None
    # Output: Three objects and one object function
    def train_valid_test_datasets_provider(self):

        return train_dataset, valid_dataset, test_dataset, collate_f

    # Create the model/optimizer/learning rate scheduler
    # Input: None
    # Output: Three objects
    def model_optimizer_lr_scheduler_provider(self):

        return model, optimizer, lr_scheduler

    # Write the forward logic
    # Input: batch or iterator, model
    # Output: loss
    def run_forward_step(self, batch_or_iterator, model):
        return loss

    # Write the validation set evaluation logic, dedicated for fine-tuning
    # Input: model, validation set data loader
    # Output: metric object
    def run_compute_metrics(self, model, eval_dataloader):
        return metric
                

After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:

  1. Import the Rapidformer and Hugging Face interfaces.

    from transformers.easytexmier import AutoConfig, BertForSequenceClassification
    from datasets import load_dataset, load_metric
    from rapidformer import RapidformerEngine
    from rapidformer import get_args
    from rapidformer import get_logger
    from rapidformer import get_timers
    from rapidformer import Finetuner
    from rapidformer import Pretrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. Complete the four functions in the code template.

    class MyFintuner(Finetuner):
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self):
            tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
            def tokenize_function(examples):
                # max_length=None => use the model max length (it's actually the default)
                outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
                return outputs
    
            datasets = load_dataset(args.dataset_path, args.dataset_name)
            # Apply the method we just defined to all the examples in all the splits of the dataset
            tokenized_datasets = datasets.map(
                tokenize_function,
                batched=True,
                remove_columns=["idx", "sentence1", "sentence2"],
            )
            tokenized_datasets.rename_column_("label", "labels")
    
            train_dataset = tokenized_datasets["train"]
            valid_dataset = tokenized_datasets['validation']
            test_dataset = tokenized_datasets['test']
    
            def collate_fn(examples):
                return tokenizer.pad(examples, padding="longest", return_tensors="pt")
    
            return train_dataset, valid_dataset, test_dataset, collate_fn
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = BertForSequenceClassification.from_pretrained(args.load)
            return model, None, None
    
        def run_forward_step(self, batch, model):
            output_tensor = model(**batch)
            return output_tensor.loss
    
        # after each epoch run metric on eval dataset
        def run_compute_metrics(self, model, eval_dataloader):
            model = model[0]
            metric = load_metric(args.dataset_path, args.dataset_name)
            for step, batch in enumerate(eval_dataloader):
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)
    
                metric.add_batch(
                    predictions=self.gather(predictions),
                    references=self.gather(batch["labels"]),
                )
    
            eval_metric = metric.compute()
            return eval_metric
                            
  3. Initialize the Rapidformer engine, create a trainer object, call finetune(), and save as rapidformer_finetune_hugging_face_bert_trainer.py.

    engine = RapidformerEngine()
    trainer = MyFintuner(engine=engine)
    trainer.train()
  4. Prepare a startup script based on the CLI. Set --user-script to rapidformer_finetune_hugging_face_bert_trainer.py and set the acceleration switches.

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py
                --task sequence_classification \
                --pretrained-model-name-or-path 'bert-base-cased' \
                --data-path glue \
                --data-name mrpc \
                --epochs 3 \
                --micro-batch-size 16 \
                --global-batch-size 16 \
                --lr 2e-5 \
                --lr-decay-style linear \
                --lr-warmup-iters 100 \
                --weight-decay 1e-2 \
                --clip-grad 1.0 \
                --mixed-precision                                 # Enable mixed-precision training
                --zero-3-memory-optimization \                    # Enable model state partitioning
                --onnx-runtime-training \                         # Enable computational graph optimization

White box: Hugging Face pre-training with Pretrainer template

The Pretrainer code template from Rapidformer quickly creates Hugging Face model pre-training tasks. It contains these functions:

  • train_valid_test_datasets_provider to create datasets

  • model_optimizer_lr_scheduler_provider to construct the model, optimizer, and learning rate scheduler

  • run_forward_step to define forward pass logic

See Rapidformer API for details on these functions. For inputs and outputs, see White box: Hugging Face fine-tuning with Finetuner template.

After understanding the custom code template, prepare the dataset and model as shown in Black box: Hugging Face fine-tuning. Then perform these steps:

  1. Import the Rapidformer and Hugging Face interfaces.

    Note

    Pre-training uses an iterator to read data, so import mpu for data parallelism.

    from megatron import mpu
    from transformers import BertConfig, BertForPreTraining
    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. Inherit Pretrainer and complete the pre-training code.

    class MyBertPreTrainer(PreTrainer):
    
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                max_seq_length=args.seq_length,
                masked_lm_prob=args.mask_prob,
                short_seq_prob=args.short_seq_prob,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup),
                binary_head=True)
    
            return train_ds, valid_ds, test_ds
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path)
            return model, None, None
    
        def run_forward_step(self, data_iterator, model):
            # Items and their type.
            keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
            input_ids = data_b['input_ids'].long()
            attention_mask = data_b['attention_mask'].long()
            token_type_ids = data_b['token_type_ids'].long()
            labels = data_b['labels'].long()
            next_sentence_label = data_b['next_sentence_label'].long()
            output_tensor = model(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label)
    
            return output_tensor['loss']
  3. Initialize the Rapidformer engine, create a trainer object, call pretrain(), and save as rapidformer_pretrain_hugging_face_bert_trainer.py.

    engine = RapidformerEngine()
    trainer = MyBertPreTrainer(engine=engine)
    trainer.train()
  4. Prepare a startup script based on the CLI and set the acceleration switches.

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    
    rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 64 \
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \                               # Enable data acceleration
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --zero-3-memory-optimization \                    # Enable model state partitioning
           --onnx-runtime-training \                         # Enable computational graph optimization
           --mixed-precision                                 # Mixed-precision training

White box: Hugging Face fine-tuning with custom Trainer

For programs with custom Trainers, Rapidformer provides limited acceleration: Apex optimizer, model state partitioning, and computational graph optimization. Mixed-precision training requires many modifications. Use the template-based method described earlier for better results. This section shows intrusive acceleration on typical Hugging Face fine-tuning code.

Hugging Face fine-tuning code example:

import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    BertForSequenceClassification,

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")

def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
    return outputs

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)

device = torch.device("cuda", args.local_rank)

for epoch in range(args.epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(eval_dataloader):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            metric.add_batch(
                    predictions=engine.gather(predictions),
                    references=engine.gather(batch["labels"]))

     eval_metric = metric.compute()
     print("epoch {}: {}".format(epoch, eval_metric))

This code has issues: no data parallelism support, slow optimizer, and no mixed-precision training. The following steps modify this code using Rapidformer APIs.

  1. Add data parallelism support.

    Create a finetuner object, then call finetuner.build_data_loader to create a data loader. This loader supports data parallelism and automatically sends data to GPU. Remove batch.to(device) from the original code.

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - train_dataloader = DataLoader(tokenized_datasets["train"])
    - eval_dataloader = DataLoader(tokenized_datasets["train"])
    
    + train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"])
    + eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"])
  2. Use the Apex optimizer on top of data parallelism.

    Replace the optimizer with faster Apex Fused Adam. Remove the original optimizer and use Fused Adam from Rapidformer. Call engine.compose to encapsulate the model, optimizer, and learning rate scheduler.

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
    - lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
        num_warmup_steps=args.lr_warmup_iters,
        num_training_steps=args.train_iters
    )
    
    
    + lr_scheduler = partial(
            get_linear_schedule_with_warmup,
            num_warmup_steps=args.lr_warmup_iters,
            num_training_steps=args.train_iters
        )
    
    + model, optimizer, lr_scheduler = engine.compose(model_obj=model,
          lr_scheduler_fn=lr_scheduler)
    Note

    Using Apex optimizer and mixed precision with data parallelism is complex. Mixed-precision training involves switching model to fp16 and loss scaling. Modifying frontend programs without a trainer is complex. Use a Trainer-based solution instead. Rapidformer's Finetuner integrates data parallelism, Apex, PyTorch mixed-precision training, Megatron optimizer mixed-precision training, and VRAM optimization from FairScale and DeepSpeed.

White box: Megatron pre-training with Pretrainer template

After understanding the White box: Hugging Face fine-tuning with custom Trainer method, bypass Data and Model Hubs for more flexibility. Write custom data creation logic in train_valid_test_datasets_provider, custom model in model_optimizer_lr_scheduler_provider, and custom forward logic in run_forward_step.

  1. Create an mmap-type dataset for pre-training.

    See Megatron data processing script for details. Example command to create an mmap dataset:

    python preprocess_data.py \
      --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json  \
      --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. Inherit Pretrainer and complete the data custom function train_valid_test_datasets_provider in pre-training code.

    Write custom logic to create train, validation, and test datasets without relying on third-party libraries. Datasets should inherit from torch.utils.data.Dataset.

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                seq_length=args.seq_length,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup))
    
            return train_ds, valid_ds, test_ds
  3. Inherit Pretrainer and complete the model custom function model_optimizer_lr_scheduler_provider in pre-training code.

    Write custom logic to create a custom model object without relying on third-party libraries. Models should inherit from torch.nn.Module.

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from yourmodel import GPTModel
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def model_optimizer_lr_scheduler_provider(self):
            model = GPTModel()
            return model, None, None
  4. Inherit Pretrainer and complete the forward custom function run_forward_step in pre-training code.

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MyGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
    
        def run_forward_step(self, data_iterator, model):
            """Forward step."""
            args = get_args()
    
            tokenizer = get_tokenizer()
    
            # Items and their type.
            keys = ['text']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
    
            # Unpack.
            tokens_ = data_b['text'].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
    
            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens,
                tokenizer.eod,
                args.reset_position_ids,
                args.reset_attention_mask,
                args.eod_mask_loss)
    
            output_tensor = model(tokens, position_ids, attention_mask,
                                  labels=labels)
    
            losses = output_tensor.float()
            loss_mask = loss_mask.view(-1).float()
            loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
    
            return loss
    
    
                            
  5. Initialize the Rapidformer engine, create a trainer object, call pretrain(), and save as rapidformer_pretrain_megatron_gpt_trainer.py.

    engine = RapidformerEngine()
    trainer = MyGPTPreTrainer(engine=engine)
    trainer.train()
  6. Prepare a startup script and set the acceleration switches.

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    PRETRAINED_CHECKPOINT=
    
    rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \
           --tensor-model-parallel-size 2 \          # Enable operator splitting optimization
           --pipeline-model-parallel-size 2 \        # Enable pipeline parallelism optimization
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \                  # Enable gradient accumulation optimization
           --seq-length 512 \
           --tokenizer-type GPT2BPETokenizer \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file gpt2-vocab.json \
           --merge-file gpt2-merges.txt \
           --data-impl mmap \                         # Enable data acceleration
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --log-interval 1 \
           --zero-2-memory-optimization \              # Enable model state partitioning
           --checkpoint-activations \                  # Enable gradient checkpointing
           --mixed-precision                           # Enable mixed-precision training