This topic describes how to use ChatLearn of Platform for AI (PAI) to perform alignment training for the Llama2-7B model.
Background information
ChatLearn of PAI is a flexible, easy-to-use, and efficient training framework for large-scale alignment. ChatLearn provides alignment training methods, such as Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), OnlineDPO, and Group Relative Policy Optimization (GRPO). ChatLearn also supports custom pipelines for models. You can build models by combining different backends. For example, you can use Megatron-LM to accelerate training and vLLM to speed up inference.
Prerequisites
Before you perform the operations that are described in this topic, make sure that the following requirements are met:
A workspace is created. For more information, see Create a workspace.
Lingjun resource quotas are prepared. In this example, the gu7xf instance type is used, such as ml.gu7xf.c96m1600.8-gu108. For information about how to purchase Lingjun resources, see Create a resource group and purchase Lingjun resources. For information about how to create Lingjun resource quotas, see Lingjun resource quotas.
A general-purpose File Storage NAS (NAS) file system is created. For more information, see Create a file system.
A dataset is created based on a general-purpose NAS file system to store the files and result files required for training. The default mount path is configured as
/mnt/data/
. For more information, see Create and manage datasets.
Limits
The best practice is supported only in the China (Ulanqab) region.
Lingjun resources, NAS file systems, and Deep Learning Containers (DLC) jobs must reside in the same region.
Lingjun resource quotas and NAS file systems must reside in the same virtual private cloud (VPC).
Make preparations
Prepare code
In the terminal, run the following commands to download code:
Download the ChatLearn code.
git clone https://github.com/alibaba/ChatLearn.git
Download the Megatron-LM code.
NoteIf you want to perform alignment training based on the Megatron-LM framework, you must download the Megatron-LM code.
git clone https://github.com/NVIDIA/Megatron-LM.git # Replace the path with the actual value. cd /**/Megatron-LM git checkout 5161b1689
Upload the downloaded code files to the root directory (/) of the general-purpose NAS file system based on the directory hierarchy. For more information, see Access a NAS file system from a data center by using a NAT gateway. You can also upload the code files to the NAS file system by using your method.
Prepare training data
Prepare the following data for ChatLearn alignment training and upload the training data to the NAS file system. For more information about how to prepare data, see Data. For more information, see Access a NAS file system from a data center by using a NAT gateway.
Prepare supervised fine-tuning (SFT) training data and upload the data to the
/sft
directory of NAS.Prepare the reward training data and upload the data to the
/rm
directory of NAS.Prepare RLHF alignment training data and upload the data to the
/alignment
directory of NAS.
Procedure
Step 1: Perform SFT training
SFT uses labeled data to fine-tune pre-trained large language models (LLMs). In this example, you need to download a pre-trained model and start SFT training. To train a model, perform the following steps:
1. Download and convert a pre-trained model
Download pre-training checkpoints.
You can use models from Hugging Face Transformers, such as Llama-2-7b-hf in Hugging Face Hub. You can also use an SFT model saved on your on-premises machine. In this example, llama-2-7b-hf is used.
Upload the downloaded Llama2 model to the root directory (
/
) of the general-purpose NAS file system based on the directory hierarchy. For more information, see Access a NAS file system from a data center by using a NAT gateway.Convert the Hugging Face Transformers model to the Megatron-LM model format.
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.
Parameter
Description
Environment Information
Node Image
Select Image Address and enter the following image address in the field:
registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04
.Data Set
Click Custom Dataset and configure the following parameters:
Custom Dataset: Select a general-purpose NAS-based dataset.
Mount Path: Use the default mount path
/mnt/data/
.
Startup Command
Configure the following commands:
export MEGATRON=path-to-megatron-lm export CHATLEARN=path-to-chatlearn cd ${CHATLEARN}/examples/megatron/ TP=$num_of_tp \ PP=$num_of_pp \ LOAD_PATH=path-to-hf-model \ TOKENIZER_MODEL=$LOAD_PATH/tokenizer.model \ SAVE_PATH=path-to-megatron-model \ bash scripts/convert_hf_to_megatron.sh
Take note of the following parameters:
MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the
/mnt/data/Megatron-LM
path is used.CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the
/mnt/data/ChatLearn
path is used.LOAD_PATH: the path of the Llama2 model. In this example, the
/mnt/data/llama-2-7b-hf
path is used.SAVE_PATH: the path of the converted Megatron-LM model. In this example, the
/mnt/data/llama-2-7b-hf-to-megatron
path is used.The following section describes how to configure the TP (tensor_model_parallel_size) and PP (
pipeline_model_parallel_size
) parameters:For Llama2-7B models, the system converts the models into checkpoints with
TP set to 4 and PP set to 1
.For Llama2-13B models, the system converts the models into checkpoints with
TP set to 8 and PP set to 1
.For Llama2-70B models, the system converts the models into checkpoints with
TP set to 8 and PP set to 4
.
Resource Information
Resource Type
Select Lingjun AI Computing Service.
Source
Select Resource Quota.
Resource Quota
Select a resource quota.
Framework
Select PyTorch.
Job Resource
Configure the following parameters for worker nodes:
Number of Nodes: 1
vCPUs: 80
Memory (GiB): 800
Shared Memory (GiB): 800
GPUs: 1
Click Confirm.
The Deep Learning Containers (DLC) page appears. After the job is successfully executed, model conversion becomes successful.
2. Enable SFT training
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.
Parameter
Description
Environment Information
Node Image
Select Image Address and enter the following image address in the field:
registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04
.Data Set
Click Custom Dataset and configure the following parameters:
Custom Dataset: Select a general-purpose NAS-based dataset.
Mount Path: Use the default mount path
/mnt/data/
.
Startup Command
Configure the following commands:
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm cd ${CHATLEARN}/examples/megatron/ export model_size=llama2-7B LOAD_PATH=$MEGATRON_LLAMA2_CKPT_PATH \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ DATASET_PATH=$DATASET_ROOT/sft/ \ bash scripts/train_sft_llama.sh
Take note of the following parameters:
MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the
/mnt/data/Megatron-LM
path is used.CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the
/mnt/data/ChatLearn
path is used.LOAD_PATH: the path of the converted Megatron-LM model. In this example, the
/mnt/data/llama-2-7b-hf-to-megatron
path is used.TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the
/mnt/data/llama-2-7b-hf/tokenizer.model
path is used.DATASET_PATH: the path of the SFT training dataset. In this example, the
/mnt/data/sft/
path is used.
By default, the logs generated during the training and the trained model are stored in the
${CHATLEARN}/output/sft
path. You can use the CHECKPOINT_PATH environment variable to configure the path of the trained model. For more information, see the train_sft_llama.sh script in the${CHATLEARN}/examples/megatron/scripts/
path.Resource Information
Resource Type
Select Lingjun AI Computing Service.
Source
Select Resource Quota.
Resource Quota
Select a resource quota.
Framework
Select PyTorch.
Job Resource
Configure the following parameters for worker nodes:
Number of Nodes: 1
vCPUs: 80
Memory (GiB): 800
Shared Memory (GiB): 800
GPUs: 8
The training script requires the following resource configurations. You can modify the parameters based on your business requirements.
llama2-7B RLHF: 8 GPUs
llama2-13B RLHF: 2 groups with 8 GPUs each
llama2-70B RLHF: 4 groups with 8 GPUs each
Click Confirm.
Step 2: Train a reward model
The goal of a reward model is to evaluate the degree of alignment a model response has with human preferences in RLHF. A reward model takes a prompt and a response as inputs and returns a single scalar that indicates the quality of the response.
The DPO training method does not require reward model training.
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.
Parameter
Description
Environment Information
Node Image
Select Image Address and enter the following image address in the field:
registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04
.Data Set
Click Custom Dataset and configure the following parameters:
Custom Dataset: Select a general-purpose NAS-based dataset.
Mount Path: Use the default mount path
/mnt/data/
.
Startup Command
Configure the following commands:
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm cd ${CHATLEARN}/examples/megatron/ LOAD_PATH=path-to-sft-ckpt \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ DATASET_PATH=$DATASET_ROOT/rm \ bash scripts/train_reward_llama.sh
Take note of the following parameters:
MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the
/mnt/data/Megatron-LM
path is used.CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the
/mnt/data/ChatLearn
path is used.LOAD_PATH: is the checkpoint path generated by SFT. In this example, the
${CHATLEARN}/output/sft/
path is used.TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the
/mnt/data/llama-2-7b-hf/tokenizer.model
path is used.DATASET_PATH: the path of the reward training dataset. In this example, the
/mnt/data/rm
path is used.
By default, the logs generated during the training and the trained model are stored in the
${CHATLEARN}/output/reward
path. You can use the CHECKPOINT_PATH environment variable to configure the path of the trained model. For more information, see the train_reward_llama.sh script in the${CHATLEARN}/examples/megatron/scripts/
path.Resource Information
Resource Type
Select Lingjun AI Computing Service.
Source
Select Resource Quota.
Resource Quota
Select a resource quota.
Framework
Select PyTorch.
Job Resource
Configure the following parameters for worker nodes:
Number of Nodes: 1
vCPUs: 80
Memory (GiB): 800
Shared Memory (GiB): 800
GPUs: 8
The training script requires the following resource configurations. You can modify the parameters based on your business requirements.
llama2-7B RLHF: 8 GPUs
llama2-13B RLHF: 2 groups with 8 GPUs each
llama2-70B RLHF: 4 groups with 8 GPUs each
Click Confirm.
Step 3: Perform alignment training
RLHF training method
ChatLearn supports multiple alignment training methods, such as RLHF, DPO, OnlineDPO, and GRPO. In this example, the Llama2-7B model is used to describe how to perform alignment training. To perform alignment training, perform the following steps:
Go to the Create Job page.
Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
On the Deep Learning Containers (DLC) page, click Create Job.
On the Create Job page, configure the parameters. The following table describes the parameters. For information about other parameters, see Submit training jobs.
Parameter
Description
Environment Information
Node Image
Select Image Address and configure the following image address in the field:
registry.cn-wulanchabu.aliyuncs.com/pai-dlc/pytorch-training:2.4.0-gpu-py3.10-cu12.5-ngc24.06-ubuntu22.04
.Data Set
Click Custom Dataset and configure the following parameters:
Custom Dataset: Select a general-purpose NAS-based dataset.
Mount Path: Use the default mount path
/mnt/data/
.
Startup Command
Configure the following commands:
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl cd ${CHATLEARN}/examples/megatron/ export model_size=llama2-7B POLICY_LOAD=path-to-sft-ckpt \ REWARD_LOAD=path-to-rm-ckpt \ REWARD_LOAD_ITERATION=1000 \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ tokenizer_load=${HF_MODEL_DIR} bash scripts/train_rlhf_llama.sh
Take note of the following parameters:
MEGATRON: the path of the cloned Megatron-LM code repository. In this example, the
/mnt/data/Megatron-LM
path is used.CHATLEARN: the path of the cloned ChatLearn code repository. In this example, the
/mnt/data/ChatLearn
path is used.DATASET_PATH: the path of the alignment training dataset. In this example, the
/mnt/data/alignment/train.jsonl
path is used.POLICY_LOAD: the checkpoint path generated by SFT. Policy models and reference models are initialized by using checkpoints from SFT. In this example, the
${CHATLEARN}/output/sft/hh_sft***/
path is used.REWARD_LOAD: the checkpoint path generated by reward model training. In this example, the
${CHATLEARN}/output/reward/reward_hh***
path is used.REWARD_LOAD_ITERATION: the number of iterations that are used to load checkpoints. Reward models and value models are initialized based on weights of the reward models.
TOKENIZER_MODEL: the path of the tokenizer.model file required by the Llama2 tokenizer. In this example, the
/mnt/data/llama-2-7b-hf/tokenizer.model
path is used.tokenizer_load: the path of the Hugging Face model that you downloaded. In this example, the
/mnt/data/llama-2-7b-hf
path is used.model_size: In this example, a Llama2-7B model is used. To train the llama2-13B or llama2-70B model, you only need to replace
export model_size=llama2-7B
in the preceding training script withexport model_size=llama2-13B
orexport model_size=llama2-70B
.
The system saves the trained model to the
${CHATLEARN}/output/**-rlhf
path.Resource Information
Resource Type
Select Lingjun AI Computing Service.
Source
Select Resource Quota.
Resource Quota
Select a resource quota.
Framework
Select PyTorch.
Job Resource
Configure the following parameters for worker nodes:
Number of Nodes: 1
vCPUs: 80
Memory (GiB): 800
Shared Memory (GiB): 800
GPUs: 8
The training script requires the following resource configurations. You can modify the parameters based on your business requirements.
llama2-7B RLHF: 8 GPUs
llama2-13B RLHF: 2 groups with 8 GPUs each
llama2-70B RLHF: 4 groups with 8 GPUs each
After you configure the parameters, click Confirm.
On the Deep Learning Containers (DLC) page, you can click the name of a job to go to the job details page to view the job status.
Other alignment training methods
Other alignment training methods are the same as the RLHF training method. You only need to update the startup command to the required startup command when you create a training job. The following section describes the startup commands of other alignment training methods:
OnlineDPO/GRPO
The OnlineDPO or GRPO training process is similar to the RLHF training process but does not require value models. The following sample code provides the training script of a Llama2-7B policy model and a Llama2-7B reward model.
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl cd ${CHATLEARN}/examples/megatron/ export model_size=llama2-7B POLICY_LOAD=path-to-sft-ckpt \ REWARD_LOAD=path-to-rm-ckpt \ REWARD_LOAD_ITERATION=1000 \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ tokenizer_load=${HF_MODEL_DIR} bash scripts/train_online_dpo_llama.sh
You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.
DPO
The following sample code provides the training script of a Llama2-7B policy model:
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl cd ${CHATLEARN}/examples/megatron/ export model_size=llama2-7B POLICY_LOAD=path-to-sft-ckpt \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ bash scripts/train_dpo_llama.sh
You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.
GRPO Math
export CHATLEARN=path-to-chatlearn export MEGATRON=path-to-megatron-lm export DATASET_PATH=$DATASET_ROOT/math/train.jsonl cd ${CHATLEARN}/examples/megatron/ export model_size=llama2-7B POLICY_LOAD=path-to-sft-ckpt \ REWARD_LOAD=path-to-rm-ckpt \ REWARD_LOAD_ITERATION=1000 \ TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \ tokenizer_load=${HF_MODEL_DIR} bash scripts/train_grpo_math_llama.sh
You must set the DATASET_PATH environment variable to the path of the training dataset supported by the training method. For more information about how to create a dataset, see Data. For information about other parameters, see parameter configurations in RLHF training method.