All Products
Search
Document Center

Platform For AI:Getting started: Continued pre-training for large models

Last Updated:Jan 29, 2026

Large language models (LLMs) drive technological progress in AI and natural language processing. Continued pre-training on domain-specific data improves model performance for specialized tasks. This guide demonstrates continued pre-training using the Qwen2-72B model on Alibaba Cloud PAI.

Overview

Continued pre-training (CPT) adapts a pre-trained model to a specific domain by training it on additional domain-relevant data. Unlike fine-tuning, which uses labeled examples for specific tasks, CPT uses large amounts of unlabeled text to enhance the model's general understanding of a domain.

When to use continued pre-training:

  • You have large amounts of domain-specific unlabeled data (documents, articles, code)

  • You want to improve the model's domain knowledge without task-specific fine-tuning

  • Your domain vocabulary or concepts differ significantly from general text

When to use fine-tuning instead:

  • You have labeled input-output pairs for specific tasks

  • You need the model to follow specific instructions or formats

  • Your dataset is smaller (hundreds to thousands of examples)

Before you begin

Ensure you have the following prerequisites:

  • Alibaba Cloud account with PAI service activated

  • Workspace created in PAI console

  • NAS (Network Attached Storage) dataset configured for training output. See Create dataset.

  • Sufficient GPU quota: 32 A100, A800, or H800 (80 GB) GPUs across 4 nodes

Note

Continued pre-training of Qwen2-72B requires significant compute resources. Cost factors include GPU hours, storage, and data transfer. For pricing details, see PAI billing documentation.

Step 1: Prepare training data

The continued pre-training solution uses the Pai-Megatron-Patch toolkit, which simplifies LLM and Vision Language Model (VLM) training using the Megatron framework. This toolkit efficiently utilizes GPU resources and applies various Megatron-LM acceleration techniques.

Data format requirements:

Pai-Megatron-Patch requires pre-training data in MMAP (memory-mapped) format. This pre-tokenized format significantly reduces data loading time, especially with large datasets.

Important

PAI-QuickStart requires MMAP data files to be named dataset.bin and dataset.idx.

Option A: Convert your data

Convert raw text data to MMAP format using one of these methods:

  • Data transformation script: Follow the data preprocessing tutorial

  • Designer component: Use the built-in "Convert text data to mmap format" component in Designer

Option B: Use sample data

For testing purposes, download the pre-processed sample dataset:

wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.bin
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.idx
mv wudao_llama3bpe_content_document.bin dataset.bin
mv wudao_llama3bpe_content_document.idx dataset.idx

Step 2: Start training in PAI-QuickStart

After you prepare the data, you can perform continued pre-training on the model in PAI-QuickStart. This section uses the Qwen2-72B model as an example.

  1. Access Model Gallery

    1. Log on to the PAI console.

    2. In the upper-left corner, select your region.

    3. In the left navigation pane, choose Workspaces and open your workspace.

    4. In the left navigation pane, choose QuickStart> Model Gallery.

  2. On the Model Gallery page, find the Qwen2-72B-Pre-trained (Megatron Edition) model card and click it.

  3. Configure training

    On the model details page, click Train in the upper-right corner. Configure the following settings:

    • Training output: Only NAS datasets are supported as the output channel. Megatron checkpoints are saved in the checkpoint subfolder of your output directory.

    • Compute resources: Continued pre-training of Qwen2-72B requires 4 nodes with 32 A100, A800, or H800 (80 GB) GPUs total.

    • Hyperparameters: Adjust hyperparameters based on your data and compute resources, or use the defaults:

      Parameter

      Default

      Type

      Description

      job_name

      qwen2-72b-cpt

      string

      Training task type. Do not modify.

      batch_size

      1

      int

      Data processed per GPU per iteration.

      global_batch_size

      32

      int

      Total data across all GPUs per iteration (batch_size × GPU count).

      learning_rate

      5e-5

      float

      Training learning rate.

      min_learning_rate

      5e-6

      float

      Minimum learning rate.

      sequence_length

      1024

      int

      Maximum text sequence length.

      pad_length

      128

      int

      Sequence padding length.

      save_interval

      1000

      int

      Iterations between checkpoint saves.

      train_tokens

      1638400

      int

      Total training tokens. Tokens per iteration = global_batch_size × sequence_length.

      warmup_tokens

      163840

      int

      Tokens consumed during warmup phase.

  4. Start training

    Click Train. PAI-QuickStart redirects you to the training page where you can monitor training status, logs, and GPU utilization.

Step 3: Convert checkpoint (optional)

The pre-training output is a checkpoint in Megatron Dense Checkpoint format. To use the model with Hugging Face libraries or deploy it to other platforms, convert it to Hugging Face format.

Best practices

Data preparation

  • Data quality: Ensure training data is clean, deduplicated, and domain-relevant.

  • Data size: Larger datasets produce better results. Aim for at least 1B tokens for meaningful domain adaptation.

  • Data mixing: Consider mixing domain data with general data to prevent catastrophic forgetting.

Training configuration

  • Learning rate: Start with the default (5e-5) and reduce if training becomes unstable.

  • Checkpoint frequency: Save checkpoints frequently (every 500-1000 iterations) to enable recovery.

  • Monitoring: Watch for loss spikes or divergence, which may indicate learning rate issues.

What's next

After completing continued pre-training:

  • Convert checkpoint: Convert to Hugging Face format if needed for deployment.

  • Fine-tune: Apply supervised fine-tuning for specific tasks.

  • Deploy: Deploy the model using EAS (Elastic Algorithm Service) for inference.