Continued pre-training tailored to specific tasks or domains can improve the performance of large language models (LLMs) in actual use. This topic uses the Qwen2-72B model as an example to describe how to perform continued pre-training on LLMs.
Background Information
This topic uses the Pai-Megatron-Patch toolkit to perform continued pre-training. Pai-Megatron-Patch is a training toolkit designed to train LLMs and vision language models (VLMs) by using Megatron framework easily. Pai-Megatron-Patch aims to effectively utilize the computational power of GPUs for LLM and allows convenient training of commonly used LLM with accelerating techniques provided by Megatron-LM.
Prepare data
Pai-Megatron-Patch uses pre-training data in the MMAP format, which is pre-tokenized to fasten data loading particularly in large dataset scenarios. You can prepare data in the MMAP format in one of the following methods:
Convert raw data to the MMAP format by yourself, see pretrain_data_preprocessing.
Use the Convert Data to mmap Format component in Visualized Modeling (Designer). For more information, see Overview of Machine Learning Designer.
Download small-scale datasets for trial:
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.bin wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.idx mv wudao_llama3bpe_content_document.bin dataset.bin mv wudao_llama3bpe_content_document.idx dataset.idx
To use the data in PAI-QuickStart, the MMAP files must be named dataset.bin
and dataset.idx
.
Use PAI-QuickStart for continued pre-training
The following steps perform continued pre-training with prepared data on Qwen2-72B.
Go to the Model Gallery page.
Log on to the PAI console.
In the top navigation bar, select a region.
In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click the name of the workspace.
In the left-side navigation pane, choose QuickStart > Model Gallery.
On the Model Gallery page, find the Qwen2-72B-Base (Megatron) model card and click the model card.
On the Qwen2-72B-Base (Megatron) page, click Train in the upper right corner. In the Train panel, configure the following key parameters:
Output Configuration: Specify a ModelOutput Path. Only NAS datasets are supported. For information about how to create a dataset, see Create and manage datasets. During training, the Megatron checkpoints are saved to the checkpoint subfolder of the path.
Computing resources: The continued pre-training of Qwen2-72B requires 4 nodes and altogether 32 GPUs of A100/A800/H800 (80 GB of memory) or more.
Hyper-parameters: The training algorithm supports the following hyperparameters. You can adjust the hyperparameters based on your data and resources or use the default configuration.
Hyperparameter
Default value
Type
Description
job_name
qwen2-72b-cpt
string
The training job type. No change required.
batch_size
1
int
The size of data processed per iteration by one GPU.
global_batch_size
32
int
The total size of data processed per iteration by all GPUs. The value is batch_size multiplied by the number of GPUs.
learning_rate
5e-5
float
The learning rate.
min_learning_rate
5e-6
float
The minimum learning rate.
sequence_length
1024
int
The length of text sequences.
pad_length
128
int
The padding length of text sequences.
save_interval
1000
int
The frequency to save checkpoints, measured in iterations.
train_tokens
1638400
int
The total number of tokens to be consumed by the training job. The number of tokens to be consumed by each iteration is global_batch_size multiplied by sequence_length.
warmup_tokens
163840
int
The number of tokens to be consumed during the warmup phase of the training job.
Click Fine-tune to start training. PAI-QuickStart redirects you to the model training page, where you can monitor the status and logs of the training job.
(Optional) Convert checkpoints to the HuggingFace format
The pre-training of Qwen2-72B uses the Megatron Dense Checkpoint format. To convert checkpoints of this format to the HuggingFace format, see Convert Megatron-Core to Huggingface.