All Products
Search
Document Center

Platform For AI:PAI-ChatLearn: Best practices for Qwen3 reinforcement learning

Last Updated:Jan 30, 2026

This topic uses the Qwen3 model as an example. It describes how to use the PAI-ChatLearn training framework and Lingjun resources in PAI to perform efficient, distributed reinforcement learning for a large language model (LLM) and deploy the trained model.

1. Preparations

1.1 Prepare the development environment

Before you begin, make sure you have completed the following tasks:

  1. Activate PAI and create a default workspace.

  2. Purchase Lingjun resources and create a resource quota. You need two machines with the node specification ml.gx8xf.8xlarge-gu108. For more information about the node specifications of Lingjun resources, see Billing of AI computing resources.

  3. Create a dataset to store the required training files and result files.

    • Dataset Type: Select Basic.

    • Storage Type: Select a file storage class. This topic uses a General-purpose NAS file system. If you do not have a NAS file system, see Create a file system.

      Note

      If your training task requires high read/write speeds and performance, use File Storage (CPFS).

    • Default Mount Path: Use the default value /mnt/data/.

  4. Create a Data Science Workshop (DSW) instance with the following key parameter settings.

    • Resource Type: Select Resource Quota.

    • Resource Quota: Select the resource quota for the Lingjun resources that you created.

    • Instance Type: Configure the following resource specifications.

      • GPUs: At least 8.

      • vCPUs: 90.

      • Memory (GiB): 1024.

      • Shared Memory (GiB): 1024.

    • Image Information: Select Image Address and set the image to dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310. You must change the region in the image address to match your current region. For example, if you start a DSW instance in Shanghai, change the region in the image address to cn-shanghai.

    • Dataset Mounting: Click Custom Dataset, select the dataset that you created, and use the default mount path.

  5. If you use a Resource Access Management (RAM) user to perform the following operations, you must grant the RAM user the permissions for DSW, Deep Learning Containers (DLC), or EAS. For more information, see Product dependencies and permissions: DSW, Product dependencies and permissions: DLC, or Product dependencies and permissions: EAS.

1.2 Download the code repository

  1. Enter the PAI-DSW development environment.

    1. Log on to the PAI console. In the upper-left corner of the page, select a region. China (Ulanqab) is recommended.

    2. In the left navigation pane, click Workspace List. On the Workspace List page, click the name of the desired workspace to open it.

    3. In the navigation pane on the left, choose Model Training > Data Science Workshop (DSW). Find the target instance and click Open in the Actions column.

  2. In the top menu bar, click Terminal. On the new tab, click Create Terminal.

  3. Download the ChatLearn code repository.

    git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearn && git checkout 4ad5912306df5d4a814dc2dd5567fcb26f5d473b

1.3 Prepare the Qwen3 model

Download the Qwen3 model weights from ModelScope.

modelscope download --model Qwen/Qwen3-8B --local_dir Qwen3-8B

1.4 Prepare the training dataset

This example uses the MATH-lighteval dataset to demonstrate the ChatLearn reinforcement learning workflow.

  • This is a mathematical reasoning dataset that uses fixed rules to calculate reward scores.

  • To perform reinforcement learning training on a custom task, you can implement a custom reward scoring function based on the examples/fsdp/models/rule_reward.py file in the ChatLearn code repository.

# Download the dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# Pre-process the dataset
python examples/fsdp/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval

2. Reinforcement learning training

Note

First, develop and debug in the DSW environment. Then, submit a multi-node, multi-GPU distributed training task in the DLC environment.

This example uses FSDP as the training engine. To use Megatron to accelerate training, see tutorial_grpo_mcore.

2.1 Single-node training in DSW

Continue to run the following command in the DSW environment to start training. The trained model is saved to the mounted dataset for later deployment.

bash examples/fsdp/scripts/train_grpo_qwen3.sh
Note

With the default parameters of train_grpo_qwen3.sh, training is expected to take 2 to 3 hours.

2.2 Multi-node training in DLC

After you develop and debug on a single node, you can configure a multi-node, multi-GPU distributed task in the DLC environment to accelerate model training. The procedure is as follows:

  1. Go to the Create Task page.

    1. Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter DLC.

    2. On the Deep Learning Containers (DLC) page, click Create Task.

  2. On the Create Job page, configure the following key parameters. Use the default values for other parameters. For more information, see Create a training task.

    Parameter

    Description

    Basic Information

    Job Name

    Enter a custom task name. This example uses test_qwen3_dlc.

    Environment Information

    Image Information

    Select Image Address and enter dsw-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.5.1-vllm0.6.6-ubuntu22.04-cuda12.6-py310 in the text box. You must change the region in the image address based on your current region.

    Mount dataset

    Click Custom Dataset, select the dataset that you created, and use the default mount path /mnt/data/.

    Startup Command

    Configure the following command. The startup parameters for the train_grpo_qwen3.sh script are the same as those for the DSW single-node pre-trained model.

    cd /mnt/data/ChatLearn && bash examples/fsdp/scripts/train_grpo_qwen3.sh

    Resource Information

    Resource Type

    Select Lingjun Intelligent Computing.

    Source

    Select Resource Quota.

    Resource Quota

    This example uses the resource quota for the Lingjun resources that you created.

    Framework

    Select PyTorch.

    Job Resource

    On the Worker node configuration tab, configure the following parameters:

    • Quantity: 2. For multi-node training, set Number of nodes to the required number of machines.

    • GPUs: 8

    • vCPUs: 90

    • Memory (GiB): 1024

    • Shared Memory (GiB): 1024

  3. Click OK. You are redirected to the Deep Learning Containers (DLC) page. You can click the task name to view the task execution status on the Task Details page. When the Status changes to Succeeded, the training task is complete.

    Note

    If the DLC job fails with the error ray.exceptions.RpcError: Timed out while waiting for GCS to become available., the training task is complete. You can still deploy the service using the saved model.

2.3 Main parameter descriptions

Click to view the main parameters to configure in train_grpo_qwen3.sh.

Parameter

Description

model_path

Path of the model weights

output_dir

Output path to save logs, models, and data

train_data_path

Path of the training dataset

eval_data_path

Path of the evaluation dataset

sp_size

Ulysses sequence parallelism configuration for long-context model training

tensor_model_parallel_size

Tensor parallelism configuration for the vLLM inference service

gpu_memory_utilization

Percentage of VRAM pre-allocated by the vLLM inference service

seq_length

Maximum sequence length (prompt length + generation length)

max_new_tokens

Maximum generation length

num_inference_per_prompt

Number of responses generated for each prompt

sample_per_episode

Number of samples trained in each iteration = number of prompts × num_inference_per_prompt

train_micro_batch_size

Batch size for forward inference during training

enable_eval_before_training

Specifies whether to perform evaluation before training

num_episode

Number of reinforcement learning training epochs

eval_episode_interval

Interval between evaluation epochs

save_episode_interval

Interval between model save epochs

3. Deploy and invoke the model

After model training is complete, you can deploy the model as an online service and invoke it in a production environment.

3.1 Deploy the model service

  1. Log on to the PAI console. At the top of the page, select the destination region and the target workspace, and then click Enter EAS.

  2. Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

  3. On the Custom Deployment page, configure the following key parameters. Use the default values for other parameters.

    Parameter

    Description

    Basic Information

    Service Name

    Enter a custom name for the model service. The name must be unique within the same region. This example uses test_qwen3.

    Environment Information

    Deployment Method

    This example uses Image-based Deployment.

    Image Configuration

    Select Image Address and enter the image address eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/vllm:v0.8.5.post1 in the text box. You must change the region in the image address based on your current region.

    Mount storage

    Select General-purpose NAS and configure the following parameters:

    • Select File System: Select the NAS file system that you used to create the dataset.

    • File System Mount Target: Select the mount target that you used to create the dataset.

    • File System Path: Set this to the path of the Hugging Face format model stored in NAS. This example uses /ChatLearn/output/qwen3-grpo/save_model/policy_trainer/20/huggingface/.

    • Mount Path: Specify the path after mounting. This example uses /qwen3_rlhf.

    Command

    Set the command to vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192.

    Note

    If you deploy on a V100 instance, set the execution command to vllm serve /qwen3_rlhf --host 0.0.0.0 --port 8000 --max-model-len 8192 --dtype=half.

    Port Number

    Set this to 8000.

    Resource Information

    Resource Type

    This example uses Public Resources.

    Number of Replicas

    Configure this parameter based on the model and selected resources. For an 8B model, set this to 1.

    Deployment

    For the resource specification, select A10 or V100. This example uses ecs.gn7i-c32g1.8xlarge.

    Network Information

    VPC

    After you configure the NAS mount target, the system automatically matches the VPC and vSwitch with the preset NAS file system. Set the security group as needed.

    vSwitch

    Security group

  4. Click Deploy. The service deployment takes about 6 minutes. When the Service Status changes to Running, the service deployment is complete.

3.2 Invoke the service

  1. Obtain the service endpoint and token. On the Inference Service tab, find the target service and navigate to the Overview page. In the Basic Information section, click View Endpoint Information.image

  2. Use the following code to invoke the service. Replace <YOUR EAS URL> with the endpoint obtained in Step 1. We recommend that you set the token as an environment variable.

    import os
    from openai import OpenAI
    
    # Set the Token as an environment variable.
    openai_api_key = os.environ.get("Token")
    # Replace <YOUR EAS URL> with the service endpoint.
    openai_api_base = "<YOUR EAS URL>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    chat_response = client.chat.completions.create(
        model="/qwen3_rlhf",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Find the smallest positive integer solution to $\\tan{19x^{\\circ}}=\\dfrac{\\cos{96^{\\circ}}+\\sin{96^{\\circ}}}{\\cos{96^{\\circ}}-\\sin{96^{\\circ}}}$. Let's think step by step and output the final answer within \\boxed{}."},
        ],
        temperature=0.7,
        top_p=0.8,
        presence_penalty=1.5,
        extra_body={
            "top_k": 20, 
            "chat_template_kwargs": {"enable_thinking": False},
        }
    )
    print("Chat response:", chat_response)

Appendix: Resource specification reference for training and deployment

The following table lists the supported resource specifications for different model sizes.

Model Size

Full-parameter Training Resources (Minimum)

Inference Resources (Minimum)

Quantity

Specification

Qwen3-8B

1 unit

ml.gu7xf.c96m1600.8-gu108 or

ml.gu7ef.c96m1600.8-gu100 or

ml.gx8xf.8xlarge-gu108

1 × V100 (32 GB VRAM) / 1 × A10 (24 GB VRAM)

Qwen3-32B

2 units

4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM)

Qwen3-30B-A3B

2 units

4 × V100 (32 GB VRAM) / 4 × A10 (24 GB VRAM)