LLM continued pre-training - Platform For AI - Alibaba Cloud Documentation Center

Continued pre-training tailored to specific tasks or domains can improve the performance of large language models (LLMs) in actual use. This topic uses the Qwen2-72B model as an example to describe how to perform continued pre-training on LLMs.

Background Information

This topic uses the Pai-Megatron-Patch toolkit to perform continued pre-training. Pai-Megatron-Patch is a training toolkit designed to train LLMs and vision language models (VLMs) by using Megatron framework easily. Pai-Megatron-Patch aims to effectively utilize the computational power of GPUs for LLM and allows convenient training of commonly used LLM with accelerating techniques provided by Megatron-LM.

Prepare data

Pai-Megatron-Patch uses pre-training data in the MMAP format, which is pre-tokenized to fasten data loading particularly in large dataset scenarios. You can prepare data in the MMAP format in one of the following methods:

Convert raw data to the MMAP format by yourself, see pretrain_data_preprocessing.
Use the Convert Data to mmap Format component in Visualized Modeling (Designer). For more information, see Designer overview.

Download small-scale datasets for trial:

wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.bin
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.idx
mv wudao_llama3bpe_content_document.bin dataset.bin
mv wudao_llama3bpe_content_document.idx dataset.idx

Note

To use the data in PAI-QuickStart, the MMAP files must be named dataset.bin and dataset.idx.

Use PAI-QuickStart for continued pre-training

The following steps perform continued pre-training with prepared data on Qwen2-72B.

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the top navigation bar, select a region.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click the name of the workspace.
4. In the left-side navigation pane, choose QuickStart > Model Gallery.
On the Model Gallery page, find the Qwen2-72B-Base (Megatron) model card and click the model card.

On the Qwen2-72B-Base (Megatron) page, click Train in the upper right corner. In the Train panel, configure the following key parameters:

Output Configuration: Specify a ModelOutput Path. Only NAS datasets are supported. For information about how to create a dataset, see Create and manage datasets. During training, the Megatron checkpoints are saved to the checkpoint subfolder of the path.
Computing resources: The continued pre-training of Qwen2-72B requires 4 nodes and altogether 32 GPUs of A100/A800/H800 (80 GB of memory) or more.

Hyper-parameters: The training algorithm supports the following hyperparameters. You can adjust the hyperparameters based on your data and resources or use the default configuration.

Hyperparameter	Default value	Type	Description
job_name	qwen2-72b-cpt	string	The training job type. No change required.
batch_size	1	int	The size of data processed per iteration by one GPU.
global_batch_size	32	int	The total size of data processed per iteration by all GPUs. The value is batch_size multiplied by the number of GPUs.
learning_rate	5e-5	float	The learning rate.
min_learning_rate	5e-6	float	The minimum learning rate.
sequence_length	1024	int	The length of text sequences.
pad_length	128	int	The padding length of text sequences.
save_interval	1000	int	The frequency to save checkpoints, measured in iterations.
train_tokens	1638400	int	The total number of tokens to be consumed by the training job. The number of tokens to be consumed by each iteration is global_batch_size multiplied by sequence_length.
warmup_tokens	163840	int	The number of tokens to be consumed during the warmup phase of the training job.

Click Fine-tune to start training. PAI-QuickStart redirects you to the model training page, where you can monitor the status and logs of the training job.

(Optional) Convert checkpoints to the HuggingFace format

The pre-training of Qwen2-72B uses the Megatron Dense Checkpoint format. To convert checkpoints of this format to the HuggingFace format, see Convert Megatron-Core to Huggingface.