PAI-ChatLearn: Best practices for Qwen3 reinforcement learning - Platform For AI

This topic uses the Qwen3 model as an example. It describes how to use the PAI-ChatLearn training framework and Lingjun resources in PAI to perform efficient, distributed reinforcement learning for a large language model (LLM) and deploy the trained model.

1. Preparations

1.1 Prepare the development environment

Before you begin, make sure you have completed the following tasks:

Activate PAI and create a default workspace.
Purchase Lingjun resources and create a resource quota. You need two machines with the node specification ml.gx8xf.8xlarge-gu108. For more information about the node specifications of Lingjun resources, see Billing of AI computing resources.
Create a dataset to store the required training files and result files.
- Dataset Type: Select Basic.
- Storage Type: Select a file storage class. This topic uses a General-purpose NAS file system. If you do not have a NAS file system, see Create a file system.
  Note
  If your training task requires high read/write speeds and performance, use File Storage (CPFS).
- Default Mount Path: Use the default value /mnt/data/.
Create a Data Science Workshop (DSW) instance with the following key parameter settings.
- Resource Type: Select Resource Quota.
- Resource Quota: Select the resource quota for the Lingjun resources that you created.
- Instance Type: Configure the following resource specifications.
  - GPUs: At least 8.
  - vCPUs: 90.
  - Memory (GiB): 1024.
  - Shared Memory (GiB): 1024.
- Image Information: Select Image Address and set the image to dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310. You must change the region in the image address to match your current region. For example, if you start a DSW instance in Shanghai, change the region in the image address to cn-shanghai.
- Dataset Mounting: Click Custom Dataset, select the dataset that you created, and use the default mount path.
If you use a Resource Access Management (RAM) user to perform the following operations, you must grant the RAM user the permissions for DSW, Deep Learning Containers (DLC), or EAS. For more information, see Product dependencies and permissions: DSW, Product dependencies and permissions: DLC, or Product dependencies and permissions: EAS.

1.2 Download the code repository

Enter the PAI-DSW development environment.
1. Log on to the PAI console. In the upper-left corner of the page, select a region. China (Ulanqab) is recommended.
2. In the left navigation pane, click Workspace List. On the Workspace List page, click the name of the desired workspace to open it.
3. In the navigation pane on the left, choose Model Training > Data Science Workshop (DSW). Find the target instance and click Open in the Actions column.
In the top menu bar, click Terminal. On the new tab, click Create Terminal.

Download the ChatLearn code repository.

git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearn && git checkout 4ad5912306df5d4a814dc2dd5567fcb26f5d473b

1.3 Prepare the Qwen3 model

Download the Qwen3 model weights from ModelScope.

modelscope download --model Qwen/Qwen3-8B --local_dir Qwen3-8B

1.4 Prepare the training dataset

This example uses the MATH-lighteval dataset to demonstrate the ChatLearn reinforcement learning workflow.

This is a mathematical reasoning dataset that uses fixed rules to calculate reward scores.
To perform reinforcement learning training on a custom task, you can implement a custom reward scoring function based on the examples/fsdp/models/rule_reward.py file in the ChatLearn code repository.

# Download the dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# Pre-process the dataset
python examples/fsdp/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval

2. Reinforcement learning training

Note

First, develop and debug in the DSW environment. Then, submit a multi-node, multi-GPU distributed training task in the DLC environment.

This example uses FSDP as the training engine. To use Megatron to accelerate training, see tutorial_grpo_mcore.

2.1 Single-node training in DSW

Continue to run the following command in the DSW environment to start training. The trained model is saved to the mounted dataset for later deployment.

bash examples/fsdp/scripts/train_grpo_qwen3.sh

Note

With the default parameters of train_grpo_qwen3.sh, training is expected to take 2 to 3 hours.

2.2 Multi-node training in DLC

After you develop and debug on a single node, you can configure a multi-node, multi-GPU distributed task in the DLC environment to accelerate model training. The procedure is as follows:

Go to the Create Task page.
1. Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter DLC.
2. On the Deep Learning Containers (DLC) page, click Create Task.

On the Create Job page, configure the following key parameters. Use the default values for other parameters. For more information, see Create a training task.

Parameter		Description
Basic Information	Job Name	Enter a custom task name. This example uses test_qwen3_dlc.
Environment Information	Image Information	Select Image Address and enter dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310 in the text box. You must change the region in the image address based on your current region.
	Mount dataset	Click Custom Dataset, select the dataset that you created, and use the default mount path `/mnt/data/`.
	Startup Command	Configure the following command. The startup parameters for the train_grpo_qwen3.sh script are the same as those for the DSW single-node pre-trained model. `cd /mnt/data/ChatLearn && bash examples/fsdp/scripts/train_grpo_qwen3.sh`
Resource Information	Resource Type	Select Lingjun Intelligent Computing.
	Source	Select Resource Quota.
	Resource Quota	This example uses the resource quota for the Lingjun resources that you created.
	Framework	Select PyTorch.
	Job Resource	On the Worker node configuration tab, configure the following parameters: Quantity: 2. For multi-node training, set Number of nodes to the required number of machines. GPUs: 8 vCPUs: 90 Memory (GiB): 1024 Shared Memory (GiB): 1024

Click OK. You are redirected to the Deep Learning Containers (DLC) page. You can click the task name to view the task execution status on the Task Details page. When the Status changes to Succeeded, the training task is complete.
Note
If the DLC job fails with the error ray.exceptions.RpcError: Timed out while waiting for GCS to become available., the training task is complete. You can still deploy the service using the saved model.

2.3 Main parameter descriptions

Click to view the main parameters to configure in train_grpo_qwen3.sh.

Parameter	Description
model_path	Path of the model weights
output_dir	Output path to save logs, models, and data
train_data_path	Path of the training dataset
eval_data_path	Path of the evaluation dataset
sp_size	Ulysses sequence parallelism configuration for long-context model training
tensor_model_parallel_size	Tensor parallelism configuration for the vLLM inference service
gpu_memory_utilization	Percentage of VRAM pre-allocated by the vLLM inference service
seq_length	Maximum sequence length (prompt length + generation length)
max_new_tokens	Maximum generation length
num_inference_per_prompt	Number of responses generated for each prompt
sample_per_episode	Number of samples trained in each iteration = number of prompts × num_inference_per_prompt
train_micro_batch_size	Batch size for forward inference during training
enable_eval_before_training	Specifies whether to perform evaluation before training
num_episode	Number of reinforcement learning training epochs
eval_episode_interval	Interval between evaluation epochs
save_episode_interval	Interval between model save epochs

3. Deploy and invoke the model

After model training is complete, you can deploy the model as an online service and invoke it in a production environment.

3.1 Deploy the model service

Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter EAS.
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the following key parameters. Use the default values for other parameters.

Parameter		Description
Basic Information	Service Name	Enter a custom name for the model service. The name must be unique within the same region. This example uses test_qwen3.
Environment Information	Deployment Method	This example uses Image-based Deployment.
	Image Configuration	Select Image Address and enter the image address `eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/vllm:v0.8.5.post1` in the text box. You must change the region in the image address based on your current region.
	Mount storage	Select General-purpose NAS and configure the following parameters: Select File System: Select the NAS file system that you used to create the dataset. File System Mount Target: Select the mount target that you used to create the dataset. File System Path: Set this to the path of the Hugging Face format model stored in NAS. This example uses `/ChatLearn/output/qwen3-grpo/save_model/policy_trainer/20/huggingface/`. Mount Path: Specify the path after mounting. This example uses `/qwen3_rlhf`.
	Command	Set the command to `vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192`. Note If you deploy on a V100 instance, set the execution command to `vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192 --dtype=half`.
	Port Number	Set this to 8000.
Resource Information	Resource Type	This example uses Public Resources.
	Number of Replicas	Configure this parameter based on the model and selected resources. For an 8B model, set this to 1.
	Deployment	For the resource specification, select A10 or V100. This example uses `ecs.gn7i-c32g1.8xlarge`.
Network Information	VPC	After you configure the NAS mount target, the system automatically matches the VPC and vSwitch with the preset NAS file system. Set the security group as needed.
	vSwitch
	Security group

Click Deploy. The service deployment takes about 6 minutes. When the Service Status changes to Running, the service deployment is complete.

3.2 Invoke the service

Obtain the service endpoint and token. On the Inference Service tab, find the target service and navigate to the Overview page. In the Basic Information section, click View Endpoint Information.

Use the following code to invoke the service. Replace <YOUR EAS URL> with the endpoint obtained in Step 1. We recommend that you set the token as an environment variable.

import os
from openai import OpenAI

# Set the Token as an environment variable.
openai_api_key = os.environ.get("Token")
# Replace <YOUR EAS URL> with the service endpoint.
openai_api_base = "<YOUR EAS URL>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="/qwen3_rlhf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Find the smallest positive integer solution to $\\tan{19x^{\\circ}}=\\dfrac{\\cos{96^{\\circ}}+\\sin{96^{\\circ}}}{\\cos{96^{\\circ}}-\\sin{96^{\\circ}}}$. Let's think step by step and output the final answer within \\boxed{}."},
    ],
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20, 
        "chat_template_kwargs": {"enable_thinking": False},
    }
)
print("Chat response:", chat_response)

Appendix: Resource specification reference for training and deployment

The following table lists the supported resource specifications for different model sizes.

Model Size	Full-parameter Training Resources (Minimum)		Inference Resources (Minimum)
Model Size	Quantity	Specification	Inference Resources (Minimum)
Qwen3-8B	1 unit	`ml.gu7xf.c96m1600.8-gu108` or `ml.gu7ef.c96m1600.8-gu100` or `ml.gx8xf.8xlarge-gu108`	1 × V100 (32 GB VRAM) / 1 × A10 (24 GB VRAM)
Qwen3-32B	2 units		4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM)
Qwen3-30B-A3B	2 units		4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM)