All Products
Search
Document Center

Platform For AI:Best practices for ChatLearn alignment on DLC

Last Updated:Feb 27, 2025

This topic describes how to use ChatLearn of Platform for AI (PAI) to perform alignment training for the Llama2-7B model.

Background information

ChatLearn of PAI is a flexible, easy-to-use, and efficient training framework for large-scale alignment. ChatLearn provides alignment training methods, such as Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), OnlineDPO, and Group Relative Policy Optimization (GRPO). ChatLearn also supports custom pipelines for models. You can build models by combining different backends. For example, you can use Megatron-LM to accelerate training and vLLM to speed up inference.

Prerequisites

Before you perform the operations that are described in this topic, make sure that the following requirements are met:

  • A workspace is created. For more information, see Create a workspace.

  • Lingjun resource quotas are prepared. In this example, the gu7xf instance type is used, such as ml.gu7xf.c96m1600.8-gu108. For information about how to purchase Lingjun resources, see Create a resource group and purchase Lingjun resources. For information about how to create Lingjun resource quotas, see Lingjun resource quotas.

  • A general-purpose File Storage NAS (NAS) file system is created. For more information, see Create a file system.

  • A dataset is created based on a general-purpose NAS file system to store the files and result files required for training. The default mount path is configured as /mnt/data/. For more information, see Create and manage datasets.

Limits

  • The best practice is supported only in the China (Ulanqab) region.

  • Lingjun resources, NAS file systems, and Deep Learning Containers (DLC) jobs must reside in the same region.

  • Lingjun resource quotas and NAS file systems must reside in the same virtual private cloud (VPC).

Make preparations

Prepare code

  1. In the terminal, run the following commands to download code:

    • Download the ChatLearn code.

      git clone https://github.com/alibaba/ChatLearn.git
    • Download the Megatron-LM code.

      Note

      If you want to perform alignment training based on the Megatron-LM framework, you must download the Megatron-LM code.

      git clone https://github.com/NVIDIA/Megatron-LM.git
      # Replace the path with the actual value. 
      cd /**/Megatron-LM
      git checkout 5161b1689
  2. Upload the downloaded code files to the root directory (/) of the general-purpose NAS file system based on the directory hierarchy. For more information, see Access a NAS file system from a data center by using a NAT gateway. You can also upload the code files to the NAS file system by using your method.

Prepare training data

Prepare the following data for ChatLearn alignment training and upload the training data to the NAS file system. For more information about how to prepare data, see Data. For more information, see Access a NAS file system from a data center by using a NAT gateway.

  • Prepare supervised fine-tuning (SFT) training data and upload the data to the /sft directory of NAS.

  • Prepare the reward training data and upload the data to the /rm directory of NAS.

  • Prepare RLHF alignment training data and upload the data to the /alignment directory of NAS.

Procedure

Step 1: Perform SFT training

SFT uses labeled data to fine-tune pre-trained large language models (LLMs). In this example, you need to download a pre-trained model and start SFT training. To train a model, perform the following steps:

1. Download and convert a pre-trained model

  1. Download pre-training checkpoints.

    You can use models from Hugging Face Transformers, such as Llama-2-7b-hf in Hugging Face Hub. You can also use an SFT model saved on your on-premises machine. In this example, llama-2-7b-hf is used.

  2. Upload the downloaded Llama2 model to the root directory (/) of the general-purpose NAS file system based on the directory hierarchy. For more information, see Access a NAS file system from a data center by using a NAT gateway.

  3. Convert the Hugging Face Transformers model to the Megatron-LM model format.

    1. Go to the Create Job page.

      1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

      2. On the Deep Learning Containers (DLC) page, click Create Job.

    2. On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.

      Parameter

      Description

      Environment Information

      Node Image

      Select Image Address and enter the following image address in the field: registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04.

      Data Set

      Click Custom Dataset and configure the following parameters:

      • Custom Dataset: Select a general-purpose NAS-based dataset.

      • Mount Path: Use the default mount path /mnt/data/.

      Startup Command

      Configure the following commands:

      export MEGATRON=path-to-megatron-lm
      export CHATLEARN=path-to-chatlearn
      
      cd ${CHATLEARN}/examples/megatron/
      
      TP=$num_of_tp \
      PP=$num_of_pp \
      LOAD_PATH=path-to-hf-model \
      TOKENIZER_MODEL=$LOAD_PATH/tokenizer.model \
      SAVE_PATH=path-to-megatron-model \
      bash scripts/convert_hf_to_megatron.sh

      Take note of the following parameters:

      • MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the /mnt/data/Megatron-LM path is used.

      • CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the /mnt/data/ChatLearn path is used.

      • LOAD_PATH: the path of the Llama2 model. In this example, the /mnt/data/llama-2-7b-hf path is used.

      • SAVE_PATH: the path of the converted Megatron-LM model. In this example, the /mnt/data/llama-2-7b-hf-to-megatron path is used.

      • The following section describes how to configure the TP (tensor_model_parallel_size) and PP (pipeline_model_parallel_size) parameters:

        • For Llama2-7B models, the system converts the models into checkpoints with TP set to 4 and PP set to 1.

        • For Llama2-13B models, the system converts the models into checkpoints with TP set to 8 and PP set to 1.

        • For Llama2-70B models, the system converts the models into checkpoints with TP set to 8 and PP set to 4.

      Resource Information

      Resource Type

      Select Lingjun AI Computing Service.

      Source

      Select Resource Quota.

      Resource Quota

      Select a resource quota.

      Framework

      Select PyTorch.

      Job Resource

      Configure the following parameters for worker nodes:

      • Number of Nodes: 1

      • vCPUs: 80

      • Memory (GiB): 800

      • Shared Memory (GiB): 800

      • GPUs: 1

    3. Click Confirm.

      The Deep Learning Containers (DLC) page appears. After the job is successfully executed, model conversion becomes successful.

2. Enable SFT training

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.

    Parameter

    Description

    Environment Information

    Node Image

    Select Image Address and enter the following image address in the field: registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04.

    Data Set

    Click Custom Dataset and configure the following parameters:

    • Custom Dataset: Select a general-purpose NAS-based dataset.

    • Mount Path: Use the default mount path /mnt/data/.

    Startup Command

    Configure the following commands:

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    cd ${CHATLEARN}/examples/megatron/
    
    export model_size=llama2-7B
    
    LOAD_PATH=$MEGATRON_LLAMA2_CKPT_PATH \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    DATASET_PATH=$DATASET_ROOT/sft/ \
    bash scripts/train_sft_llama.sh

    Take note of the following parameters:

    • MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the /mnt/data/Megatron-LM path is used.

    • CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the /mnt/data/ChatLearn path is used.

    • LOAD_PATH: the path of the converted Megatron-LM model. In this example, the /mnt/data/llama-2-7b-hf-to-megatron path is used.

    • TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the /mnt/data/llama-2-7b-hf/tokenizer.model path is used.

    • DATASET_PATH: the path of the SFT training dataset. In this example, the /mnt/data/sft/ path is used.

    By default, the logs generated during the training and the trained model are stored in the ${CHATLEARN}/output/sft path. You can use the CHECKPOINT_PATH environment variable to configure the path of the trained model. For more information, see the train_sft_llama.sh script in the ${CHATLEARN}/examples/megatron/scripts/ path.

    Resource Information

    Resource Type

    Select Lingjun AI Computing Service.

    Source

    Select Resource Quota.

    Resource Quota

    Select a resource quota.

    Framework

    Select PyTorch.

    Job Resource

    Configure the following parameters for worker nodes:

    • Number of Nodes: 1

    • vCPUs: 80

    • Memory (GiB): 800

    • Shared Memory (GiB): 800

    • GPUs: 8

    The training script requires the following resource configurations. You can modify the parameters based on your business requirements.

    • llama2-7B RLHF: 8 GPUs

    • llama2-13B RLHF: 2 groups with 8 GPUs each

    • llama2-70B RLHF: 4 groups with 8 GPUs each

  3. Click Confirm.

Step 2: Train a reward model

The goal of a reward model is to evaluate the degree of alignment a model response has with human preferences in RLHF. A reward model takes a prompt and a response as inputs and returns a single scalar that indicates the quality of the response.

Note

The DPO training method does not require reward model training.

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.

    Parameter

    Description

    Environment Information

    Node Image

    Select Image Address and enter the following image address in the field: registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04.

    Data Set

    Click Custom Dataset and configure the following parameters:

    • Custom Dataset: Select a general-purpose NAS-based dataset.

    • Mount Path: Use the default mount path /mnt/data/.

    Startup Command

    Configure the following commands:

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    cd ${CHATLEARN}/examples/megatron/
    
    LOAD_PATH=path-to-sft-ckpt \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    DATASET_PATH=$DATASET_ROOT/rm \
    bash scripts/train_reward_llama.sh

    Take note of the following parameters:

    • MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the /mnt/data/Megatron-LM path is used.

    • CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the /mnt/data/ChatLearn path is used.

    • LOAD_PATH: is the checkpoint path generated by SFT. In this example, the ${CHATLEARN}/output/sft/ path is used.

    • TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the /mnt/data/llama-2-7b-hf/tokenizer.model path is used.

    • DATASET_PATH: the path of the reward training dataset. In this example, the /mnt/data/rm path is used.

    By default, the logs generated during the training and the trained model are stored in the ${CHATLEARN}/output/reward path. You can use the CHECKPOINT_PATH environment variable to configure the path of the trained model. For more information, see the train_reward_llama.sh script in the ${CHATLEARN}/examples/megatron/scripts/ path.

    Resource Information

    Resource Type

    Select Lingjun AI Computing Service.

    Source

    Select Resource Quota.

    Resource Quota

    Select a resource quota.

    Framework

    Select PyTorch.

    Job Resource

    Configure the following parameters for worker nodes:

    • Number of Nodes: 1

    • vCPUs: 80

    • Memory (GiB): 800

    • Shared Memory (GiB): 800

    • GPUs: 8

    The training script requires the following resource configurations. You can modify the parameters based on your business requirements.

    • llama2-7B RLHF: 8 GPUs

    • llama2-13B RLHF: 2 groups with 8 GPUs each

    • llama2-70B RLHF: 4 groups with 8 GPUs each

  3. Click Confirm.

Step 3: Perform alignment training

RLHF training method

ChatLearn supports multiple alignment training methods, such as RLHF, DPO, OnlineDPO, and GRPO. In this example, the Llama2-7B model is used to describe how to perform alignment training. To perform alignment training, perform the following steps:

  1. Go to the Create Job page.

    1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).

    2. On the Deep Learning Containers (DLC) page, click Create Job.

  2. On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.

    Parameter

    Description

    Environment Information

    Node Image

    Select Image Address and configure the following image address in the field: registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04.

    Data Set

    Click Custom Dataset and configure the following parameters:

    • Custom Dataset: Select a general-purpose NAS-based dataset.

    • Mount Path: Use the default mount path /mnt/data/.

    Startup Command

    Configure the following commands:

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl
    
    cd ${CHATLEARN}/examples/megatron/
    
    export model_size=llama2-7B
    
    POLICY_LOAD=path-to-sft-ckpt \
    REWARD_LOAD=path-to-rm-ckpt \
    REWARD_LOAD_ITERATION=1000 \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    tokenizer_load=${HF_MODEL_DIR}
    bash scripts/train_rlhf_llama.sh

    Take note of the following parameters:

    • MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the /mnt/data/Megatron-LM path is used.

    • CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the /mnt/data/ChatLearn path is used.

    • DATASET_PATH: the path of the alignment training dataset. In this example, the /mnt/data/alignment/train.jsonl path is used.

    • POLICY_LOAD: the checkpoint path generated by SFT. Policy models and reference models are initialized by using checkpoints from SFT. In this example, the ${CHATLEARN}/output/sft/hh_sft***/ path is used.

    • REWARD_LOAD: the checkpoint path generated by reward model training. In this example, the ${CHATLEARN}/output/reward/reward_hh*** path is used.

    • REWARD_LOAD_ITERATION: the number of iterations that are used to load checkpoints. Reward models and value models are initialized based on weights of the reward models.

    • TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the /mnt/data/llama-2-7b-hf/tokenizer.model path is used.

    • tokenizer_load: the path of the Hugging Face model that you downloaded. In this example, the /mnt/data/llama-2-7b-hf path is used.

    • model_size: In this example, a Llama2-7B model is used. To train the llama2-13B or llama2-70B model, you only need to replace export model_size=llama2-7B in the preceding training script with export model_size=llama2-13B or export model_size=llama2-70B.

    The system saves the trained model to the ${CHATLEARN}/output/**-rlhf path.

    Resource Information

    Resource Type

    Select Lingjun AI Computing Service.

    Source

    Select Resource Quota.

    Resource Quota

    Select a resource quota.

    Framework

    Select PyTorch.

    Job Resource

    Configure the following parameters for worker nodes:

    • Number of Nodes: 1

    • vCPUs: 80

    • Memory (GiB): 800

    • Shared Memory (GiB): 800

    • GPUs: 8

    The training script requires the following resource configurations. You can modify the parameters based on your business requirements.

    • llama2-7B RLHF: 8 GPUs

    • llama2-13B RLHF: 2 groups with 8 GPUs each

    • llama2-70B RLHF: 4 groups with 8 GPUs each

  3. After you configure the parameters, click Confirm.

    On the Deep Learning Containers (DLC) page, you can click the name of a job to go to the job details page to view the job status.

Other alignment training methods

Other alignment training methods are the same as the RLHF training method. You only need to update the startup command to the required startup command when you create a training job. The following section describes the startup commands of other alignment training methods:

  • OnlineDPO/GRPO

    The OnlineDPO or GRPO training process is similar to the RLHF training process but does not require value models. The following sample code provides the training script of a Llama2-7B policy model and a Llama2-7B reward model.

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl
    
    cd ${CHATLEARN}/examples/megatron/
    
    export model_size=llama2-7B
    
    POLICY_LOAD=path-to-sft-ckpt \
    REWARD_LOAD=path-to-rm-ckpt \
    REWARD_LOAD_ITERATION=1000 \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    tokenizer_load=${HF_MODEL_DIR}
    bash scripts/train_online_dpo_llama.sh

    You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.

  • DPO

    The following sample code provides the training script of a Llama2-7B policy model:

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl
    
    cd ${CHATLEARN}/examples/megatron/
    
    export model_size=llama2-7B
    
    POLICY_LOAD=path-to-sft-ckpt \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    bash scripts/train_dpo_llama.sh

    You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.

  • GRPO Math

    export CHATLEARN=path-to-chatlearn
    export MEGATRON=path-to-megatron-lm
    export DATASET_PATH=$DATASET_ROOT/math/train.jsonl
    
    cd ${CHATLEARN}/examples/megatron/
    
    export model_size=llama2-7B
    
    POLICY_LOAD=path-to-sft-ckpt \
    REWARD_LOAD=path-to-rm-ckpt \
    REWARD_LOAD_ITERATION=1000 \
    TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
    tokenizer_load=${HF_MODEL_DIR}
    bash scripts/train_grpo_math_llama.sh

    You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.