Qwen3 is the latest large language model (LLM) series released by Qwen team of Alibaba Cloud on April 29, 2025, including 2 Mixture of Experts (MoE) models and 6 Dense models. Based on extensive training, Qwen3 has made breakthrough progress in reasoning, instruction following, agent capabilities, and multilingual support. Model Gallery of Platform for AI (PAI) has integrated all 8 models, along with their corresponding Base models and FP8 models, for a total of 22 models. This topic describes how to deploy, fine-tune, and evaluate these models in Model Gallery.
Deploy and call the model
Deploy the model
Take SGLang accelerated deployment of Qwen3-235B-A22B as an example.
Go to the Model Gallery page.
Log on to the PAI console, and select a region in the upper-left corner (you can switch to find regions with sufficient computing resource inventory).
In the left-side navigation pane, choose Workspaces. Click a workspace name to enter the corresponding workspace.
In the left-side navigation pane, choose QuickStart > Model Gallery.
In the model list on the right side of the Model Gallery page, click the Qwen3-235B-A22B model card.
Click Deploy in the upper right corner, select the deployment method and resources to deploy the model to Elastic Algorithm Service (EAS).
Deployment Resources: For the minimum resources required by the model, see Computing power required for deployment & supported token length.
EAS Resource Group: Go to EAS pre-payment for dedicated machine to purchase EAS dedicated resources.
Public Resources: Used by default, with recommended specifications. The resource type list has automatically filtered out the public resource specifications available for the model. If all options are grayed out and cannot be selected, it indicates insufficient resource inventory. Consider switching regions.
ImportantLingjun preemptible resources (ml.gu7ef.8xlarge-gu100, ml.gu7xf.8xlarge-gu108, ml.gu8xf.8xlarge-gu108, ml.gu8tf.8.40xlarge) can only be used in the China (Ulanqab) region and do not require whitelist approval. However, they may be preempted, so pay attention to your bid price.
Debug the model online
Click EAS Online Debugging at the bottom of the Service details page.
For vLLM accelerated deployment, you need to add Content-Type in the request Headers with a value of application/json. This is not required for SGLang accelerated deployment.
Call the model by calling an API operation
Obtain the service endpoint and token.
In Model Gallery > Job Management > Deployment Jobs, click the name of the deployed service to go to the service details page.
Click View Call Information to obtain the endpoint and token.
Call the model service by calling an API operation. Sample request by calling the chat API
/v1/chat/completions
(SGLang accelerated deployment):curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<model name, obtained through '/v1/models' API>", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completions
from openai import OpenAI ##### API configuration ##### # <EAS_ENDPOINT> needs to be replaced with the deployed service endpoint, <EAS_TOKEN> needs to be replaced with the deployed service Token. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)
Replace <EAS_ENDPOINT> with the endpoint of the deployed service, <EAS_TOKEN> with the Token of the deployed service.
Different deployment methods correspond to different calling methods, see API calling.
Fine-tune the model
Qwen3-32B/14B/8B/4B/1.7B/0.6B models support SFT (full parameter/LoRA/QLoRA fine-tuning).
One-click submission of training jobs to train scenario-specific models.
Evaluate the model
All models except the 235B model, which is still under testing, are evaluated by referring to the following documents:
Appendix: Computing power required for deployment & supported token length
The following table provides the minimum specifications required for Qwen3 deployment, along with the maximum number of Tokens supported by different inference frameworks and machine types.
Among the FP8 models, only Qwen3-235B-A22B requires less computing power than the original model. Others require the same computing power as non-FP8 models, so they are not listed in the table. For example, for the computing power required by Qwen3-30B-A3B-FP8, refer to Qwen3-30B-A3B.
Model | Maximum token length (input+output) | Minimum specifications | |
SGLang accelerated deployment | vLLM accelerated deployment | ||
Qwen3-235B-A22B | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 8 cards GPU H / GU120 (8 * 96 GB video memory) |
Qwen3-235B-A22B-FP8 | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 4 cards GPU H / GU120 (4 * 96 GB video memory) |
Qwen3-30B-A3B Qwen3-30B-A3B-Base Qwen3-32B | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 card GPU H / GU120 (96 GB video memory) |
Qwen3-14B Qwen3-14B-Base | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 card GPU L / GU60 (48 GB video memory) |
Qwen3-8B Qwen3-4B Qwen3-1.7B Qwen3-0.6B Qwen3-8B-Base Qwen3-4B-Base Qwen3-1.7B-Base Qwen3-0.6B-Base | 32,768 (with RoPE scaling: 131,072) | 32,768 (with RoPE scaling: 131,072) | 1 card A10 / GU30 (24 GB video memory) |
FAQ
How to extend context token length
Qwen3 supports a token length of 32,768. RoPE scaling can extend it to 131,072 (but may cause some performance loss). Edit the script in the service configuration JSON as follows:
vLLM:
vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
SGLang:
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
Function call support
vLLM and SGLang support parsing tool calling information generated by the model into structured format. Edit the script in the service configuration JSON as follows:
vLLM:
vllm serve ... --enable-auto-tool-choice --tool-call-parser hermes
SGLang:
python -m sglang.launch_server ... --tool-call-parser qwen25
Enable or disable thinking mode
Qwen3 supports enabling and disabling its thinking mode. When deploying in Model Gallery, you can turn on/off thinking mode through the following methods:
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <EAS_TOKEN>" \
-d '{
"model": "<MODEL_NAME>",
"messages": [
{
"role": "user",
"content": "Give me a short introduction to large language models."
}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": true}
}' \
<EAS_ENDPOINT>/v1/chat/completions
from openai import OpenAI
# # <EAS_ENDPOINT> needs to be replaced with the deployed service endpoint, <EAS_TOKEN> needs to be replaced with the deployed service Token.
openai_api_key = "<<EAS_TOKEN>"
openai_api_base = "<EAS_ENDPOINT>/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="<MODEL_NAME>",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print("Chat response:", chat_response)
Replace <EAS_ENDPOINT> with the endpoint of the deployed service, <EAS_TOKEN> with the Token of the service. <MODEL_NAME> with the actual model name, obtained through the /v1/models
API.
To distinguish the thinking part in the output, edit the script in the service configuration JSON as follows:
vLLM:
vllm serve ... --enable-reasoning --reasoning-parser qwen3
You must replace the image with eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/pai-quickstart:vllm-v0.8.5-netcat. Note that the region ID in this image is cn-wulanchabu. If your image is deployed in a different region, replace cn-wulanchabu with your region ID.
SGLang:
python -m sglang.launch_server ... --reasoning-parser deepseek-r1
How to connect the deployed model service to Chatbox or Dify
How to edit script
In the Service Configuration drction of the deployment panel, edit the JSON: