Deploy, fine-tune, and evaluate Qwen3 models in Model Gallery using SGLang, vLLM, or BladeLLM inference engines.
Deploy and invoke models
Deploy a model
This example deploys Qwen3-235B-A22B with SGLang.
-
Go to the Model Gallery page.
-
Log on to the PAI console. In the upper-left corner, select a region with available compute resources.
-
In the left navigation pane, select Workspace List, and click the workspace name.
-
In the left navigation pane, choose QuickStart > Model Gallery.
-
-
On the Model Gallery page, find and click Qwen3-235B-A22B to view model details.
-
In the upper-right corner, click Deploy. Configure the following parameters and keep default settings for others to deploy the model to PAI-EAS.
-
Deployment Method: Set Inference Engine to SGLang and Deployment Template to Single Machine.
-
Resource Information: For Resource Type, select Public Resources. Recommended specifications are provided. For minimum configuration requirements, see Computing power and token requirements.
ImportantIf no resource specifications are available, public resources in this region are out of stock. Try the following solutions:
-
Switch regions. For example, China (Ulanqab) has large inventory of Lingjun preemptible resources (ml.gu7ef.8xlarge-gu100, ml.gu7xf.8xlarge-gu108, ml.gu8xf.8xlarge-gu108, ml.gu8tf.8.40xlarge). Preemptible resources may be reclaimed, so monitor your bid.
-
Use an EAS resource group. Go to EAS Subscription for Dedicated Resources to purchase dedicated EAS resources.
-
-
Test the deployment
At the bottom of the Service Details page, click Online Debugging.

Call the API
-
Retrieve the service endpoint and token.
-
Go to Model Gallery > Task Management > Deployment. Click the deployed service name to view details.
-
Click View Endpoint Information to obtain the Internet endpoint and token.

-
-
The following examples show how to call the
/v1/chat/completionschat API for SGLang deployment.curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<Model name, obtained from the '/v1/models' API>", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completionsfrom openai import OpenAI ##### API Configuration ##### # Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_completion_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token.
The invocation method varies by deployment method. For more information, see LLM API calls.
Integrate third-party applications
To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.
Advanced configuration
Modify the service's JSON configuration to enable advanced features, such as adjusting token limits or enabling tool calling (Function Calling).
Procedure: On the deployment page, go to the Service Configuration section and edit the JSON. If the service is already deployed, update it to access its deployment page.

Modify token limits
Qwen3 models natively support token length of 32,768. Use RoPE scaling to support maximum token length of 131,072 (with some performance loss). Modify the containers.script field in the service configuration JSON:
-
vLLM:
vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 -
SGLang:
python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
Parse tool calls
vLLM and SGLang support parsing tool call content generated by the model into structured messages. To enable this, modify the containers.script field in the service configuration JSON:
-
vLLM:
vllm serve ... --enable-auto-tool-choice --tool-call-parser hermes -
SGLang:
python -m sglang.launch_server ... --tool-call-parser qwen25
Control thinking mode
Qwen3 uses thinking mode by default. Control this feature with a hard switch to completely disable thinking or a soft switch where the model follows user instructions on whether to think.
Use soft switch /no_think
Sample request body:
{
"model": "<MODEL_NAME>",
"messages": [
{
"role": "user",
"content": "/no_think Hello!"
}
],
"max_tokens": 1024
}
Use hard switch
-
Control with API parameter (for vLLM and SGLang): Add the
chat_template_kwargsparameter to the API call. Example:curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<MODEL_NAME>", "messages": [ { "role": "user", "content": "Give me a short introduction to large language models." } ], "temperature": 0.7, "top_p": 0.8, "max_tokens": 8192, "presence_penalty": 1.5, "chat_template_kwargs": {"enable_thinking": false} }' \ <EAS_ENDPOINT>/v1/chat/completionsfrom openai import OpenAI # # Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token. openai_api_key = "<<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="<MODEL_NAME>", messages=[ {"role": "user", "content": "Give me a short introduction to large language models."}, ], temperature=0.7, top_p=0.8, presence_penalty=1.5, extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) print("Chat response:", chat_response)Replace <EAS_ENDPOINT> with the service endpoint, <EAS_TOKEN> with the service token, and <MODEL_NAME> with the actual model name retrieved from the
/v1/modelsAPI. -
Disable by modifying service configuration (for BladeLLM): Use a chat template that prevents the model from generating thinking content at startup.
-
On the model's introduction page in Model Gallery, check if a method is provided to disable thinking mode for BladeLLM. For example, with Qwen3-8B, disable thinking mode by modifying the
containers.scriptfield in the service configuration JSON:blade_llm_server ... --chat_template /model_dir/no_thinking.jinja -
Write your own chat template, such as
no_thinking.jinja, mount it from OSS for reading, and modify thecontainers.scriptfield in the service configuration JSON.
-
Parse thinking content
To output the "think" part separately, modify the containers.script field in the service configuration JSON:
-
vLLM:
vllm serve ... --enable-reasoning --reasoning-parser qwen3 -
SGLang:
python -m sglang.launch_server ... --reasoning-parser deepseek-r1
Fine-tune models
-
Qwen3-32B, 14B, 8B, 4B, 1.7B, and 0.6B models support Supervised Fine-Tuning (SFT) (full-parameter, LoRA, or QLoRA) and GRPO training.
-
Submit training tasks with one click to train models for specific business scenarios.


Evaluate models
For detailed instructions on model evaluation, see Model evaluation and Best practices for LLM evaluation.
Computing power and token requirements
The following table lists minimum configuration required for Qwen3 deployment and maximum number of tokens supported on different inference frameworks when using various instance types.
Among FP8 models, only Qwen3-235B-A22B has reduced computing power requirements compared to the original model. Computing power requirements for other FP8 models are the same as their non-FP8 counterparts and are not listed in the table. For example, to find computing power required by Qwen3-30B-A3B-FP8, refer to Qwen3-30B-A3B.
|
Model |
Maximum number of tokens supported (input + output) |
Minimum configuration |
|
|
SGLang accelerated deployment |
vLLM accelerated deployment |
||
|
Qwen3-235B-A22B |
32768 (with RoPE scaling: 131072) |
32768 (with RoPE scaling: 131072) |
8 × GPU H / GU120 (8 × 96 GB VRAM) |
|
Qwen3-235B-A22B-FP8 |
32768 (with RoPE scaling: 131072) |
32768 (with RoPE scaling: 131072) |
4 × GPU H / GU120 (4 × 96 GB VRAM) |
|
Qwen3-30B-A3B Qwen3-30B-A3B-Base Qwen3-32B |
32768 (with RoPE scaling: 131072) |
32768 (with RoPE scaling: 131072) |
1 × GPU H / GU120 (96 GB VRAM) |
|
Qwen3-14B Qwen3-14B-Base |
32768 (with RoPE scaling: 131072) |
32768 (with RoPE scaling: 131072) |
1 × GPU L / GU60 (48 GB VRAM) |
|
Qwen3-8B Qwen3-4B Qwen3-1.7B Qwen3-0.6B Qwen3-8B-Base Qwen3-4B-Base Qwen3-1.7B-Base Qwen3-0.6B-Base |
32768 (with RoPE scaling: 131072) |
32768 (with RoPE scaling: 131072) |
1 × A10 / GU30 (24 GB VRAM) Important
The 8B model requires 48 GB of VRAM when RoPE scaling is enabled. |
Frequently asked questions
Do model services deployed on PAI support session functionality?
No. The model service API deployed on PAI is stateless. Each call is independent, and the server does not retain any context or session state between requests.
To implement multi-turn conversations, the client must save conversation history and include it in subsequent model invocation requests. For a request example, see How do I implement a multi-turn conversation?