All Products
Search
Document Center

Platform For AI:Quickstart: Deploy, fine-tune, and evaluate Qwen3 models

Last Updated:Nov 27, 2025

Qwen3 is the latest large language model (LLM) series released by the Alibaba Cloud Qwen team on April 29, 2024. It includes two Mixture-of-Experts (MoE) models and six Dense models. Based on extensive training, Qwen3 achieves breakthroughs in reasoning, instruction following, agent capabilities, and multilingual support. The Platform for AI (PAI) Model Gallery provides access to all eight model sizes, along with their corresponding Base and FP8 models, for a total of 22 models. This guide explains how to deploy, fine-tune, and evaluate the Qwen3 model series in the Model Gallery.

Model deployment and invocation

Deploy the model

This section shows how to deploy the Qwen3-235B-A22B model with SGLang.

  1. Go to the Model Gallery page.

    1. Log on to the PAI console and select a region in the upper-left corner. You can switch regions to find one with sufficient computing resources.

    2. In the navigation pane on the left, click Workspace Management and click the name of the target workspace.

    3. In the left navigation pane, choose QuickStart > Model Gallery.

  2. On the Model Gallery page, click the Qwen3-235B-A22B model card to open the model details page.

  3. Click Deploy in the upper-right corner. Configure the following parameters and use the default values for the others to deploy the model to the Elastic Algorithm Service (EAS).

    • Deployment Method: Set Inference Engine to SGLang and Deployment Template to Single-Node.

    • Resource Information: Set Resource Type to Public Resources. The system automatically recommends an instance type. For the minimum required configuration, see Required computing power & supported token count.

    • Important

      If no instance types are available, it means the public resource inventory in the region is insufficient. Consider the following options:

      • Switch regions. For example, the China (Ulanqab) region has a larger inventory of Lingjun preemptible resources, such as ml.gu7ef.8xlarge-gu100, ml.gu7xf.8xlarge-gu108, ml.gu8xf.8xlarge-gu108, and ml.gu8tf.8.40xlarge. Because preemptible resources can be reclaimed, be mindful of your bid.

      • Use an EAS resource group. You can purchase dedicated EAS resources from EAS Dedicated Resources Subscription.

    image

Debug online

On the Service Details page, click Online Debugging. An example is shown in the following figure.

image

Call the API

  1. Obtain the service endpoint and token.

    1. In Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service to open the service details page.

    2. Click View Invocation Method to view the Internet Endpoint and token.

      image

  2. The following example shows how to call the /v1/chat/completions endpoint for an SGLang deployment.

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "model": "<model_name, get from the /v1/models API>",
            "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    from openai import OpenAI
    
    ##### API configuration #####
    # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the service.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    models = client.models.list()
    model = models.data[0].id
    print(model)
    
    stream = True
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        model=model,
        max_completion_tokens=2048,
        stream=stream,
    )
    
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)

    Replace <EAS_ENDPOINT> with your service endpoint and <EAS_TOKEN> with your service token.

The invocation method varies by deployment type. For more examples, see Deploy large language models and call APIs.

Integrate third-party applications

To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.

Advanced configuration

You can enable advanced features, such as adjusting the token limit or enabling tool calling, by modifying the service's JSON configuration.

To modify the configuration: On the deployment page, edit the JSON in the Service Configuration section. For a deployed service, you must update it to access its deployment page.

image

Modify the token limit

Qwen3 models natively support a token length of 32,768. You can use RoPE scaling technology to extend this to a maximum of 131,072 tokens, though this might cause slight performance degradation. To do this, modify the containers.script field in the service configuration JSON file as follows:

  • vLLM:

    vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
  • SGLang:

    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'

Parse tool calls

vLLM and SGLang support parsing the model's tool calling output into a structured message. To enable this, modify the containers.script field in the service configuration JSON file as follows:

  • vLLM:

    vllm serve ... --enable-auto-tool-choice --tool-call-parser hermes
  • SGLang:

    python -m sglang.launch_server ... --tool-call-parser qwen25

Control the thinking mode

Qwen3 uses a thinking mode by default. You can control this behavior with a hard switch (to completely disable thinking) or a soft switch (where the model follows the user's instruction on whether to think).

Use a soft switch /no_think

Example request body:

{
  "model": "<MODEL_NAME>",
  "messages": [
    {
      "role": "user",
      "content": "/no_think Hello!"
    }
  ],
  "max_tokens": 1024
}

Use a hard switch

  • Control with an API parameter (for vLLM and SGLang): Add the chat_template_kwargs parameter to your API call. Example:

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "model": "<MODEL_NAME>",
            "messages": [
                {
                    "role": "user",
                    "content": "Give me a short introduction to large language models."
                }
            ],
            "temperature": 0.7,
            "top_p": 0.8,
            "max_tokens": 8192,
            "presence_penalty": 1.5,
            "chat_template_kwargs": {"enable_thinking": false}
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    from openai import OpenAI
    # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the service.
    openai_api_key = "<<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    
    chat_response = client.chat.completions.create(
        model="<MODEL_NAME>",
        messages=[
            {"role": "user", "content": "Give me a short introduction to large language models."},
        ],
        temperature=0.7,
        top_p=0.8,
        presence_penalty=1.5,
        extra_body={"chat_template_kwargs": {"enable_thinking": False}},
    )
    print("Chat response:", chat_response)

    Replace <EAS_ENDPOINT> with your service endpoint, <EAS_TOKEN> with your service token, and <MODEL_NAME> with the model name retrieved from the /v1/models API.

  • Disable by modifying the service configuration (for BladeLLM): Use a chat template that prevents the model from generating thinking content when launching the model.

    • On the model's product page in the Model Gallery, check if a method is provided to disable the thinking mode for BladeLLM. For example, with Qwen3-8B, you can disable the thinking mode by modifying the containers.script field in the service configuration JSON file as follows:

      blade_llm_server ... --chat_template /model_dir/no_thinking.jinja
    • Write a custom chat template, such as no_thinking.jinja, mount it from OSS, and modify the containers.script field in the service configuration JSON file.

      image

Parse thinking content

To output the thinking part separately, modify the containers.script field in the service configuration JSON file as follows:

  • vLLM:

    vllm serve ... --enable-reasoning --reasoning-parser qwen3
  • SGLang:

    python -m sglang.launch_server ... --reasoning-parser deepseek-r1

Model fine-tuning

  • The Qwen3-32B, 14B, 8B, 4B, 1.7B, and 0.6B models support Supervised Fine-Tuning (SFT) with full-parameter, LoRA, or QLoRA fine-tuning, along with Generative Rejection-based Preference Optimization (GRPO) training.

  • Submit one-click training jobs to create models tailored to your business scenarios.

image

image

Model evaluation

For detailed instructions on model evaluation, see Model evaluation and Best practices for LLM evaluation.

Appendix: Required computing power and supported token count

The following table lists the minimum configurations required to deploy Qwen3 models and the maximum supported token counts on different inference frameworks and instance types.

Note

Among the FP8 models, only Qwen3-235B-A22B has a lower computing power requirement than its original counterpart. The requirements for other FP8 models are identical to their non-FP8 versions and are therefore not listed in this table. For example, to find the computing power required for Qwen3-30B-A3B-FP8, refer to Qwen3-30B-A3B.

Model

Maximum token count (input + output)

Minimum configuration

SGLang accelerated deployment

vLLM accelerated deployment

Qwen3-235B-A22B

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

8 × GPU H / GU120

(8 × 96 GB GPU memory)

Qwen3-235B-A22B-FP8

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

4 × GPU H / GU120

(4 × 96 GB GPU memory)

Qwen3-30B-A3B

Qwen3-30B-A3B-Base

Qwen3-32B

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 × GPU H / GU120

(96 GB GPU memory)

Qwen3-14B

Qwen3-14B-Base

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 × GPU L / GU60

(48 GB GPU memory)

Qwen3-8B

Qwen3-4B

Qwen3-1.7B

Qwen3-0.6B

Qwen3-8B-Base

Qwen3-4B-Base

Qwen3-1.7B-Base

Qwen3-0.6B-Base

32,768 (with RoPE scaling: 131,072)

32,768 (with RoPE scaling: 131,072)

1 × A10 / GU30

(24 GB GPU memory)

Important

An 8B model with RoPE scaling requires 48 GB of GPU memory.

FAQ

Q: How can I maintain conversation context across multiple API calls with a model deployed on PAI?

No, model services deployed on PAI are stateless. Each API call is independent, and the server does not retain context between requests.

To implement a multi-turn conversation, you must manage the conversation history on the client side. In each new API call, you need to pass the entire conversation history in the messages payload. For an example, see How do I implement a multi-turn conversation?