Qwen3 is the latest large language model (LLM) series released by the Alibaba Cloud Qwen team on April 29, 2024. It includes two Mixture-of-Experts (MoE) models and six Dense models. Based on extensive training, Qwen3 achieves breakthroughs in reasoning, instruction following, agent capabilities, and multilingual support. The Platform for AI (PAI) Model Gallery provides access to all eight model sizes, along with their corresponding Base and FP8 models, for a total of 22 models. This guide explains how to deploy, fine-tune, and evaluate the Qwen3 model series in the Model Gallery.
Model deployment and invocation
Deploy the model
This section shows how to deploy the Qwen3-235B-A22B model with SGLang.
Go to the Model Gallery page.
Log on to the PAI console and select a region in the upper-left corner. You can switch regions to find one with sufficient computing resources.
In the navigation pane on the left, click Workspace Management and click the name of the target workspace.
In the left navigation pane, choose QuickStart > Model Gallery.
On the Model Gallery page, click the Qwen3-235B-A22B model card to open the model details page.
Click Deploy in the upper-right corner. Configure the following parameters and use the default values for the others to deploy the model to the Elastic Algorithm Service (EAS).
Deployment Method: Set Inference Engine to SGLang and Deployment Template to Single-Node.
Resource Information: Set Resource Type to Public Resources. The system automatically recommends an instance type. For the minimum required configuration, see Required computing power & supported token count.
- Important
If no instance types are available, it means the public resource inventory in the region is insufficient. Consider the following options:
Switch regions. For example, the China (Ulanqab) region has a larger inventory of Lingjun preemptible resources, such as ml.gu7ef.8xlarge-gu100, ml.gu7xf.8xlarge-gu108, ml.gu8xf.8xlarge-gu108, and ml.gu8tf.8.40xlarge. Because preemptible resources can be reclaimed, be mindful of your bid.
Use an EAS resource group. You can purchase dedicated EAS resources from EAS Dedicated Resources Subscription.

Debug online
On the Service Details page, click Online Debugging. An example is shown in the following figure.

Call the API
Obtain the service endpoint and token.
In Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service to open the service details page.
Click View Invocation Method to view the Internet Endpoint and token.

The following example shows how to call the
/v1/chat/completionsendpoint for an SGLang deployment.curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<model_name, get from the /v1/models API>", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completionsfrom openai import OpenAI ##### API configuration ##### # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the service. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)Replace
<EAS_ENDPOINT>with your service endpoint and<EAS_TOKEN>with your service token.
The invocation method varies by deployment type. For more examples, see Deploy large language models and call APIs.
Integrate third-party applications
To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.
Advanced configuration
You can enable advanced features, such as adjusting the token limit or enabling tool calling, by modifying the service's JSON configuration.
To modify the configuration: On the deployment page, edit the JSON in the Service Configuration section. For a deployed service, you must update it to access its deployment page.

Modify the token limit
Qwen3 models natively support a token length of 32,768. You can use RoPE scaling technology to extend this to a maximum of 131,072 tokens, though this might cause slight performance degradation. To do this, modify the containers.script field in the service configuration JSON file as follows:
vLLM:
vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072SGLang:
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
Parse tool calls
vLLM and SGLang support parsing the model's tool calling output into a structured message. To enable this, modify the containers.script field in the service configuration JSON file as follows:
vLLM:
vllm serve ... --enable-auto-tool-choice --tool-call-parser hermesSGLang:
python -m sglang.launch_server ... --tool-call-parser qwen25
Control the thinking mode
Qwen3 uses a thinking mode by default. You can control this behavior with a hard switch (to completely disable thinking) or a soft switch (where the model follows the user's instruction on whether to think).
Use a soft switch /no_think
Example request body:
{
"model": "<MODEL_NAME>",
"messages": [
{
"role": "user",
"content": "/no_think Hello!"
}
],
"max_tokens": 1024
}Use a hard switch
Control with an API parameter (for vLLM and SGLang): Add the
chat_template_kwargsparameter to your API call. Example:curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<MODEL_NAME>", "messages": [ { "role": "user", "content": "Give me a short introduction to large language models." } ], "temperature": 0.7, "top_p": 0.8, "max_tokens": 8192, "presence_penalty": 1.5, "chat_template_kwargs": {"enable_thinking": false} }' \ <EAS_ENDPOINT>/v1/chat/completionsfrom openai import OpenAI # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the service. openai_api_key = "<<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="<MODEL_NAME>", messages=[ {"role": "user", "content": "Give me a short introduction to large language models."}, ], temperature=0.7, top_p=0.8, presence_penalty=1.5, extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) print("Chat response:", chat_response)Replace
<EAS_ENDPOINT>with your service endpoint,<EAS_TOKEN>with your service token, and<MODEL_NAME>with the model name retrieved from the/v1/modelsAPI.Disable by modifying the service configuration (for BladeLLM): Use a chat template that prevents the model from generating thinking content when launching the model.
On the model's product page in the Model Gallery, check if a method is provided to disable the thinking mode for BladeLLM. For example, with Qwen3-8B, you can disable the thinking mode by modifying the
containers.scriptfield in the service configuration JSON file as follows:blade_llm_server ... --chat_template /model_dir/no_thinking.jinjaWrite a custom chat template, such as
no_thinking.jinja, mount it from OSS, and modify thecontainers.scriptfield in the service configuration JSON file.
Parse thinking content
To output the thinking part separately, modify the containers.script field in the service configuration JSON file as follows:
vLLM:
vllm serve ... --enable-reasoning --reasoning-parser qwen3SGLang:
python -m sglang.launch_server ... --reasoning-parser deepseek-r1
Model fine-tuning
The Qwen3-32B, 14B, 8B, 4B, 1.7B, and 0.6B models support Supervised Fine-Tuning (SFT) with full-parameter, LoRA, or QLoRA fine-tuning, along with Generative Rejection-based Preference Optimization (GRPO) training.
Submit one-click training jobs to create models tailored to your business scenarios.


Model evaluation
For detailed instructions on model evaluation, see Model evaluation and Best practices for LLM evaluation.
Appendix: Required computing power and supported token count
The following table lists the minimum configurations required to deploy Qwen3 models and the maximum supported token counts on different inference frameworks and instance types.
Among the FP8 models, only Qwen3-235B-A22B has a lower computing power requirement than its original counterpart. The requirements for other FP8 models are identical to their non-FP8 versions and are therefore not listed in this table. For example, to find the computing power required for Qwen3-30B-A3B-FP8, refer to Qwen3-30B-A3B.
Model | Maximum token count (input + output) | Minimum configuration | |
SGLang accelerated deployment | vLLM accelerated deployment | ||
Qwen3-235B-A22B | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 8 × GPU H / GU120 (8 × 96 GB GPU memory) |
Qwen3-235B-A22B-FP8 | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 4 × GPU H / GU120 (4 × 96 GB GPU memory) |
Qwen3-30B-A3B Qwen3-30B-A3B-Base Qwen3-32B | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 × GPU H / GU120 (96 GB GPU memory) |
Qwen3-14B Qwen3-14B-Base | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 × GPU L / GU60 (48 GB GPU memory) |
Qwen3-8B Qwen3-4B Qwen3-1.7B Qwen3-0.6B Qwen3-8B-Base Qwen3-4B-Base Qwen3-1.7B-Base Qwen3-0.6B-Base | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 × A10 / GU30 (24 GB GPU memory) Important An 8B model with RoPE scaling requires 48 GB of GPU memory. |
FAQ
Q: How can I maintain conversation context across multiple API calls with a model deployed on PAI?
No, model services deployed on PAI are stateless. Each API call is independent, and the server does not retain context between requests.
To implement a multi-turn conversation, you must manage the conversation history on the client side. In each new API call, you need to pass the entire conversation history in the messages payload. For an example, see How do I implement a multi-turn conversation?