All Products
Search
Document Center

Platform For AI:LLM continued pre-training

Last Updated:Feb 20, 2025

Continued pre-training tailored to specific tasks or domains can improve the performance of large language models (LLMs) in actual use. This topic uses the Qwen2-72B model as an example to describe how to perform continued pre-training on LLMs.

Background Information

This topic uses the Pai-Megatron-Patch toolkit to perform continued pre-training. Pai-Megatron-Patch is a training toolkit designed to train LLMs and vision language models (VLMs) by using Megatron framework easily. Pai-Megatron-Patch aims to effectively utilize the computational power of GPUs for LLM and allows convenient training of commonly used LLM with accelerating techniques provided by Megatron-LM.

Prepare data

Pai-Megatron-Patch uses pre-training data in the MMAP format, which is pre-tokenized to fasten data loading particularly in large dataset scenarios. You can prepare data in the MMAP format in one of the following methods:

  • Convert raw data to the MMAP format by yourself, see pretrain_data_preprocessing.

  • Use the Convert Data to mmap Format component in Visualized Modeling (Designer). For more information, see Overview of Machine Learning Designer.

  • Download small-scale datasets for trial:

    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.bin
    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama3-datasets/wudao_llama3bpe_content_document.idx
    mv wudao_llama3bpe_content_document.bin dataset.bin
    mv wudao_llama3bpe_content_document.idx dataset.idx
Note

To use the data in PAI-QuickStart, the MMAP files must be named dataset.bin and dataset.idx.

Use PAI-QuickStart for continued pre-training

The following steps perform continued pre-training with prepared data on Qwen2-72B.

  1. Go to the Model Gallery page.

    1. Log on to the PAI console.

    2. In the top navigation bar, select a region.

    3. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click the name of the workspace.

    4. In the left-side navigation pane, choose QuickStart > Model Gallery.

  2. On the Model Gallery page, find the Qwen2-72B-Base (Megatron) model card and click the model card.

  3. On the Qwen2-72B-Base (Megatron) page, click Train in the upper right corner. In the Train panel, configure the following key parameters:

    • Output Configuration: Specify a ModelOutput Path. Only NAS datasets are supported. For information about how to create a dataset, see Create and manage datasets. During training, the Megatron checkpoints are saved to the checkpoint subfolder of the path.

    • Computing resources: The continued pre-training of Qwen2-72B requires 4 nodes and altogether 32 GPUs of A100/A800/H800 (80 GB of memory) or more.

    • Hyper-parameters: The training algorithm supports the following hyperparameters. You can adjust the hyperparameters based on your data and resources or use the default configuration.

      Hyperparameter

      Default value

      Type

      Description

      job_name

      qwen2-72b-cpt

      string

      The training job type. No change required.

      batch_size

      1

      int

      The size of data processed per iteration by one GPU.

      global_batch_size

      32

      int

      The total size of data processed per iteration by all GPUs. The value is batch_size multiplied by the number of GPUs.

      learning_rate

      5e-5

      float

      The learning rate.

      min_learning_rate

      5e-6

      float

      The minimum learning rate.

      sequence_length

      1024

      int

      The length of text sequences.

      pad_length

      128

      int

      The padding length of text sequences.

      save_interval

      1000

      int

      The frequency to save checkpoints, measured in iterations.

      train_tokens

      1638400

      int

      The total number of tokens to be consumed by the training job. The number of tokens to be consumed by each iteration is global_batch_size multiplied by sequence_length.

      warmup_tokens

      163840

      int

      The number of tokens to be consumed during the warmup phase of the training job.

  4. Click Fine-tune to start training. PAI-QuickStart redirects you to the model training page, where you can monitor the status and logs of the training job.

(Optional) Convert checkpoints to the HuggingFace format

The pre-training of Qwen2-72B uses the Megatron Dense Checkpoint format. To convert checkpoints of this format to the HuggingFace format, see Convert Megatron-Core to Huggingface.