Large language models (LLMs) drive technological progress in AI and natural language processing. Continued pre-training on domain-specific data improves model performance for specialized tasks. This guide demonstrates continued pre-training using the Qwen2-72B model on Alibaba Cloud PAI.
Overview
Continued pre-training (CPT) adapts a pre-trained model to a specific domain by training it on additional domain-relevant data. Unlike fine-tuning, which uses labeled examples for specific tasks, CPT uses large amounts of unlabeled text to enhance the model's general understanding of a domain.
When to use continued pre-training:
You have large amounts of domain-specific unlabeled data (documents, articles, code)
You want to improve the model's domain knowledge without task-specific fine-tuning
Your domain vocabulary or concepts differ significantly from general text
When to use fine-tuning instead:
You have labeled input-output pairs for specific tasks
You need the model to follow specific instructions or formats
Your dataset is smaller (hundreds to thousands of examples)
Before you begin
Ensure you have the following prerequisites:
Alibaba Cloud account with PAI service activated
Workspace created in PAI console
NAS (Network Attached Storage) dataset configured for training output. See Create dataset.
Sufficient GPU quota: 32 A100, A800, or H800 (80 GB) GPUs across 4 nodes
Continued pre-training of Qwen2-72B requires significant compute resources. Cost factors include GPU hours, storage, and data transfer. For pricing details, see PAI billing documentation.
Step 1: Prepare training data
The continued pre-training solution uses the Pai-Megatron-Patch toolkit, which simplifies LLM and Vision Language Model (VLM) training using the Megatron framework. This toolkit efficiently utilizes GPU resources and applies various Megatron-LM acceleration techniques.
Data format requirements:
Pai-Megatron-Patch requires pre-training data in MMAP (memory-mapped) format. This pre-tokenized format significantly reduces data loading time, especially with large datasets.
PAI-QuickStart requires MMAP data files to be named dataset.bin and dataset.idx.
Option A: Convert your data
Convert raw text data to MMAP format using one of these methods:
Data transformation script: Follow the data preprocessing tutorial
Designer component: Use the built-in "Convert text data to mmap format" component in Designer
Option B: Use sample data
For testing purposes, download the pre-processed sample dataset:
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.bin
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.idx
mv wudao_llama3bpe_content_document.bin dataset.bin
mv wudao_llama3bpe_content_document.idx dataset.idxStep 2: Start training in PAI-QuickStart
After you prepare the data, you can perform continued pre-training on the model in PAI-QuickStart. This section uses the Qwen2-72B model as an example.
Access Model Gallery
Log on to the PAI console.
In the upper-left corner, select your region.
In the left navigation pane, choose Workspaces and open your workspace.
In the left navigation pane, choose QuickStart> Model Gallery.
On the Model Gallery page, find the Qwen2-72B-Pre-trained (Megatron Edition) model card and click it.
Configure training
On the model details page, click Train in the upper-right corner. Configure the following settings:
Training output: Only NAS datasets are supported as the output channel. Megatron checkpoints are saved in the
checkpointsubfolder of your output directory.Compute resources: Continued pre-training of Qwen2-72B requires 4 nodes with 32 A100, A800, or H800 (80 GB) GPUs total.
Hyperparameters: Adjust hyperparameters based on your data and compute resources, or use the defaults:
Parameter
Default
Type
Description
job_name
qwen2-72b-cpt
string
Training task type. Do not modify.
batch_size
1
int
Data processed per GPU per iteration.
global_batch_size
32
int
Total data across all GPUs per iteration (batch_size × GPU count).
learning_rate
5e-5
float
Training learning rate.
min_learning_rate
5e-6
float
Minimum learning rate.
sequence_length
1024
int
Maximum text sequence length.
pad_length
128
int
Sequence padding length.
save_interval
1000
int
Iterations between checkpoint saves.
train_tokens
1638400
int
Total training tokens. Tokens per iteration = global_batch_size × sequence_length.
warmup_tokens
163840
int
Tokens consumed during warmup phase.
Start training
Click Train. PAI-QuickStart redirects you to the training page where you can monitor training status, logs, and GPU utilization.
Step 3: Convert checkpoint (optional)
The pre-training output is a checkpoint in Megatron Dense Checkpoint format. To use the model with Hugging Face libraries or deploy it to other platforms, convert it to Hugging Face format.
For conversion instructions, see Megatron-Core model format conversion.
Best practices
Data preparation
Data quality: Ensure training data is clean, deduplicated, and domain-relevant.
Data size: Larger datasets produce better results. Aim for at least 1B tokens for meaningful domain adaptation.
Data mixing: Consider mixing domain data with general data to prevent catastrophic forgetting.
Training configuration
Learning rate: Start with the default (5e-5) and reduce if training becomes unstable.
Checkpoint frequency: Save checkpoints frequently (every 500-1000 iterations) to enable recovery.
Monitoring: Watch for loss spikes or divergence, which may indicate learning rate issues.
What's next
After completing continued pre-training:
Convert checkpoint: Convert to Hugging Face format if needed for deployment.
Fine-tune: Apply supervised fine-tuning for specific tasks.
Deploy: Deploy the model using EAS (Elastic Algorithm Service) for inference.