This topic uses the Qwen3 model as an example. It describes how to use the PAI-ChatLearn training framework and Lingjun resources in PAI to perform efficient, distributed reinforcement learning for a large language model (LLM) and deploy the trained model.
1. Preparations
1.1 Prepare the development environment
Before you begin, make sure you have completed the following tasks:
Purchase Lingjun resources and create a resource quota. You need two machines with the node specification
ml.gx8xf.8xlarge-gu108. For more information about the node specifications of Lingjun resources, see Billing of AI computing resources.Create a dataset to store the required training files and result files.
Dataset Type: Select Basic.
Storage Type: Select a file storage class. This topic uses a General-purpose NAS file system. If you do not have a NAS file system, see Create a file system.
NoteIf your training task requires high read/write speeds and performance, use File Storage (CPFS).
Default Mount Path: Use the default value /mnt/data/.
Create a Data Science Workshop (DSW) instance with the following key parameter settings.
Resource Type: Select Resource Quota.
Resource Quota: Select the resource quota for the Lingjun resources that you created.
Instance Type: Configure the following resource specifications.
GPUs: At least 8.
vCPUs: 90.
Memory (GiB): 1024.
Shared Memory (GiB): 1024.
Image Information: Select Image Address and set the image to
dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310. You must change the region in the image address to match your current region. For example, if you start a DSW instance in Shanghai, change the region in the image address tocn-shanghai.Dataset Mounting: Click Custom Dataset, select the dataset that you created, and use the default mount path.
If you use a Resource Access Management (RAM) user to perform the following operations, you must grant the RAM user the permissions for DSW, Deep Learning Containers (DLC), or EAS. For more information, see Product dependencies and permissions: DSW, Product dependencies and permissions: DLC, or Product dependencies and permissions: EAS.
1.2 Download the code repository
Enter the PAI-DSW development environment.
Log on to the PAI console. In the upper-left corner of the page, select a region. China (Ulanqab) is recommended.
In the left navigation pane, click Workspace List. On the Workspace List page, click the name of the desired workspace to open it.
In the navigation pane on the left, choose Model Training > Data Science Workshop (DSW). Find the target instance and click Open in the Actions column.
In the top menu bar, click Terminal. On the new tab, click Create Terminal.
Download the ChatLearn code repository.
git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearn && git checkout 4ad5912306df5d4a814dc2dd5567fcb26f5d473b
1.3 Prepare the Qwen3 model
Download the Qwen3 model weights from ModelScope.
modelscope download --model Qwen/Qwen3-8B --local_dir Qwen3-8B1.4 Prepare the training dataset
This example uses the MATH-lighteval dataset to demonstrate the ChatLearn reinforcement learning workflow.
This is a mathematical reasoning dataset that uses fixed rules to calculate reward scores.
To perform reinforcement learning training on a custom task, you can implement a custom reward scoring function based on the
examples/fsdp/models/rule_reward.pyfile in the ChatLearn code repository.
# Download the dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# Pre-process the dataset
python examples/fsdp/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval2. Reinforcement learning training
First, develop and debug in the DSW environment. Then, submit a multi-node, multi-GPU distributed training task in the DLC environment.
This example uses FSDP as the training engine. To use Megatron to accelerate training, see tutorial_grpo_mcore.
2.1 Single-node training in DSW
Continue to run the following command in the DSW environment to start training. The trained model is saved to the mounted dataset for later deployment.
bash examples/fsdp/scripts/train_grpo_qwen3.shWith the default parameters of train_grpo_qwen3.sh, training is expected to take 2 to 3 hours.
2.2 Multi-node training in DLC
After you develop and debug on a single node, you can configure a multi-node, multi-GPU distributed task in the DLC environment to accelerate model training. The procedure is as follows:
Go to the Create Task page.
Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter DLC.
On the Deep Learning Containers (DLC) page, click Create Task.
On the Create Job page, configure the following key parameters. Use the default values for other parameters. For more information, see Create a training task.
Parameter
Description
Basic Information
Job Name
Enter a custom task name. This example uses test_qwen3_dlc.
Environment Information
Image Information
Select Image Address and enter dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310 in the text box. You must change the region in the image address based on your current region.
Mount dataset
Click Custom Dataset, select the dataset that you created, and use the default mount path
/mnt/data/.Startup Command
Configure the following command. The startup parameters for the train_grpo_qwen3.sh script are the same as those for the DSW single-node pre-trained model.
cd /mnt/data/ChatLearn && bash examples/fsdp/scripts/train_grpo_qwen3.shResource Information
Resource Type
Select Lingjun Intelligent Computing.
Source
Select Resource Quota.
Resource Quota
This example uses the resource quota for the Lingjun resources that you created.
Framework
Select PyTorch.
Job Resource
On the Worker node configuration tab, configure the following parameters:
Quantity: 2. For multi-node training, set Number of nodes to the required number of machines.
GPUs: 8
vCPUs: 90
Memory (GiB): 1024
Shared Memory (GiB): 1024
Click OK. You are redirected to the Deep Learning Containers (DLC) page. You can click the task name to view the task execution status on the Task Details page. When the Status changes to Succeeded, the training task is complete.
NoteIf the DLC job fails with the error
ray.exceptions.RpcError: Timed out while waiting for GCS to become available., the training task is complete. You can still deploy the service using the saved model.
2.3 Main parameter descriptions
3. Deploy and invoke the model
After model training is complete, you can deploy the model as an online service and invoke it in a production environment.
3.1 Deploy the model service
Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter EAS.
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the Custom Deployment page, configure the following key parameters. Use the default values for other parameters.
Parameter
Description
Basic Information
Service Name
Enter a custom name for the model service. The name must be unique within the same region. This example uses test_qwen3.
Environment Information
Deployment Method
This example uses Image-based Deployment.
Image Configuration
Select Image Address and enter the image address
eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/vllm:v0.8.5.post1in the text box. You must change the region in the image address based on your current region.Mount storage
Select General-purpose NAS and configure the following parameters:
Select File System: Select the NAS file system that you used to create the dataset.
File System Mount Target: Select the mount target that you used to create the dataset.
File System Path: Set this to the path of the Hugging Face format model stored in NAS. This example uses
/ChatLearn/output/qwen3-grpo/save_model/policy_trainer/20/huggingface/.Mount Path: Specify the path after mounting. This example uses
/qwen3_rlhf.
Command
Set the command to
vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192.NoteIf you deploy on a V100 instance, set the execution command to
vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192 --dtype=half.Port Number
Set this to 8000.
Resource Information
Resource Type
This example uses Public Resources.
Number of Replicas
Configure this parameter based on the model and selected resources. For an 8B model, set this to 1.
Deployment
For the resource specification, select A10 or V100. This example uses
ecs.gn7i-c32g1.8xlarge.Network Information
VPC
After you configure the NAS mount target, the system automatically matches the VPC and vSwitch with the preset NAS file system. Set the security group as needed.
vSwitch
Security group
Click Deploy. The service deployment takes about 6 minutes. When the Service Status changes to Running, the service deployment is complete.
3.2 Invoke the service
Obtain the service endpoint and token. On the Inference Service tab, find the target service and navigate to the Overview page. In the Basic Information section, click View Endpoint Information.

Use the following code to invoke the service. Replace <YOUR EAS URL> with the endpoint obtained in Step 1. We recommend that you set the token as an environment variable.
import os from openai import OpenAI # Set the Token as an environment variable. openai_api_key = os.environ.get("Token") # Replace <YOUR EAS URL> with the service endpoint. openai_api_base = "<YOUR EAS URL>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="/qwen3_rlhf", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Find the smallest positive integer solution to $\\tan{19x^{\\circ}}=\\dfrac{\\cos{96^{\\circ}}+\\sin{96^{\\circ}}}{\\cos{96^{\\circ}}-\\sin{96^{\\circ}}}$. Let's think step by step and output the final answer within \\boxed{}."}, ], temperature=0.7, top_p=0.8, presence_penalty=1.5, extra_body={ "top_k": 20, "chat_template_kwargs": {"enable_thinking": False}, } ) print("Chat response:", chat_response)
Appendix: Resource specification reference for training and deployment
The following table lists the supported resource specifications for different model sizes.
Model Size | Full-parameter Training Resources (Minimum) | Inference Resources (Minimum) | |
Quantity | Specification | ||
Qwen3-8B | 1 unit |
| 1 × V100 (32 GB VRAM) / 1 × A10 (24 GB VRAM) |
Qwen3-32B | 2 units | 4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM) | |
Qwen3-30B-A3B | 2 units | 4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM) | |