All Products
Search
Document Center

Platform For AI:Deploy, fine-tune, and evaluate Qwen3 models

Last Updated:May 12, 2025

Qwen3 is the latest large language model (LLM) series released by Qwen team of Alibaba Cloud on April 29, 2025, including 2 Mixture of Experts (MoE) models and 6 Dense models. Based on extensive training, Qwen3 has made breakthrough progress in reasoning, instruction following, agent capabilities, and multilingual support. Model Gallery of Platform for AI (PAI) has integrated all 8 models, along with their corresponding Base models and FP8 models, for a total of 22 models. This topic describes how to deploy, fine-tune, and evaluate these models in Model Gallery.

Deploy and call the model

Deploy the model

Take SGLang accelerated deployment of Qwen3-235B-A22B as an example.

  1. Go to the Model Gallery page.

    1. Log on to the PAI console, and select a region in the upper-left corner (you can switch to find regions with sufficient computing resource inventory).

    2. In the left-side navigation pane, choose Workspaces. Click a workspace name to enter the corresponding workspace.

    3. In the left-side navigation pane, choose QuickStart > Model Gallery.

  2. In the model list on the right side of the Model Gallery page, click the Qwen3-235B-A22B model card.

  3. Click Deploy in the upper right corner, select the deployment method and resources to deploy the model to Elastic Algorithm Service (EAS).

    Deployment Resources: For the minimum resources required by the model, see Computing power required for deployment & supported token length.

    • EAS Resource Group: Go to EAS pre-payment for dedicated machine to purchase EAS dedicated resources.

    • Public Resources: Used by default, with recommended specifications. The resource type list has automatically filtered out the public resource specifications available for the model. If all options are grayed out and cannot be selected, it indicates insufficient resource inventory. Consider switching regions.

      Important

      Lingjun preemptible resources (ml.gu7ef.8xlarge-gu100, ml.gu7xf.8xlarge-gu108, ml.gu8xf.8xlarge-gu108, ml.gu8tf.8.40xlarge) can only be used in the China (Ulanqab) region and do not require whitelist approval. However, they may be preempted, so pay attention to your bid price.

    image

Debug the model online

Click EAS Online Debugging at the bottom of the Service details page.

Important

For vLLM accelerated deployment, you need to add Content-Type in the request Headers with a value of application/json. This is not required for SGLang accelerated deployment.

image

image

Call the model by calling an API operation

  1. Obtain the service endpoint and token.

    1. In Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service to go to the service details page.

    2. Click View Call Information to obtain the endpoint and token.

      image

  2. Call the model service by calling an API operation. Sample request by calling the chat API /v1/chat/completions (SGLang accelerated deployment):

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "model": "<model name, obtained through '/v1/models' API>",
            "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    from openai import OpenAI
    
    ##### API configuration #####
    # <EAS_ENDPOINT> needs to be replaced with the deployed service endpoint, <EAS_TOKEN> needs to be replaced with the deployed service Token.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    models = client.models.list()
    model = models.data[0].id
    print(model)
    
    stream = True
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        model=model,
        max_completion_tokens=2048,
        stream=stream,
    )
    
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)

    Replace <EAS_ENDPOINT> with the endpoint of the deployed service, <EAS_TOKEN> with the Token of the deployed service.

Different deployment methods correspond to different calling methods, see API calling.

Fine-tune the model

  • Qwen3-32B/14B/8B/4B/1.7B/0.6B models support SFT (full parameter/LoRA/QLoRA fine-tuning).

  • One-click submission of training jobs to train scenario-specific models.

image

image

Evaluate the model

All models except the 235B model, which is still under testing, are evaluated by referring to the following documents:

Appendix: Computing power required for deployment & supported token length

The following table provides the minimum specifications required for Qwen3 deployment, along with the maximum number of Tokens supported by different inference frameworks and machine types.

Note

Among the FP8 models, only Qwen3-235B-A22B requires less computing power than the original model. Others require the same computing power as non-FP8 models, so they are not listed in the table. For example, for the computing power required by Qwen3-30B-A3B-FP8, refer to Qwen3-30B-A3B.

Model

Maximum token length (input+output)

Minimum specifications

SGLang accelerated deployment

vLLM accelerated deployment

Qwen3-235B-A22B

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

8 cards GPU H / GU120

(8 * 96 GB video memory)

Qwen3-235B-A22B-FP8

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

4 cards GPU H / GU120

(4 * 96 GB video memory)

Qwen3-30B-A3B

Qwen3-30B-A3B-Base

Qwen3-32B

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 card GPU H / GU120

(96 GB video memory)

Qwen3-14B

Qwen3-14B-Base

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 card GPU L / GU60

(48 GB video memory)

Qwen3-8B

Qwen3-4B

Qwen3-1.7B

Qwen3-0.6B

Qwen3-8B-Base

Qwen3-4B-Base

Qwen3-1.7B-Base

Qwen3-0.6B-Base

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 card A10 / GU30

(24 GB video memory)

FAQ

How to extend context token length

Qwen3 supports a token length of 32,768. RoPE scaling can extend it to 131,072 (but may cause some performance loss). Edit the script in the service configuration JSON as follows:

  • vLLM:

    vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
  • SGLang:

    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'

Function call support

vLLM and SGLang support parsing tool calling information generated by the model into structured format. Edit the script in the service configuration JSON as follows:

  • vLLM:

    vllm serve ... --enable-auto-tool-choice --tool-call-parser hermes
  • SGLang:

    python -m sglang.launch_server ... --tool-call-parser qwen25

Enable or disable thinking mode

Qwen3 supports enabling and disabling its thinking mode. When deploying in Model Gallery, you can turn on/off thinking mode through the following methods:

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<MODEL_NAME>",
        "messages": [
            {
                "role": "user",
                "content": "Give me a short introduction to large language models."
            }
        ],
        "temperature": 0.7,
        "top_p": 0.8,
        "max_tokens": 8192,
        "presence_penalty": 1.5,
        "chat_template_kwargs": {"enable_thinking": true}
    }' \
    <EAS_ENDPOINT>/v1/chat/completions
from openai import OpenAI
# # <EAS_ENDPOINT> needs to be replaced with the deployed service endpoint, <EAS_TOKEN> needs to be replaced with the deployed service Token.
openai_api_key = "<<EAS_TOKEN>"
openai_api_base = "<EAS_ENDPOINT>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="<MODEL_NAME>",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language models."},
    ],
    temperature=0.7,
    top_p=0.8,
    presence_penalty=1.5,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print("Chat response:", chat_response)

Replace <EAS_ENDPOINT> with the endpoint of the deployed service, <EAS_TOKEN> with the Token of the service. <MODEL_NAME> with the actual model name, obtained through the /v1/models API.

To distinguish the thinking part in the output, edit the script in the service configuration JSON as follows:

  • vLLM:

    vllm serve ... --enable-reasoning --reasoning-parser qwen3

    You must replace the image with eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/pai-quickstart:vllm-v0.8.5-netcat. Note that the region ID in this image is cn-wulanchabu. If your image is deployed in a different region, replace cn-wulanchabu with your region ID.

  • SGLang:

    python -m sglang.launch_server ... --reasoning-parser deepseek-r1

How to connect the deployed model service to Chatbox or Dify

See How to connect to Chatbox or Dify.

How to edit script

In the Service Configuration drction of the deployment panel, edit the JSON:

image