Manually deploying a Large Language Models (LLMs) involves complex environment configuration, performance tuning, and cost management. Elastic Algorithm Service (EAS) offers a one-stop solution to deploy popular LLMs like DeepSeek and Qwen with a single click.
Step 1: Deploy an LLM service
This topic demonstrates deploying Qwen3-8B from Public Models.
A Public Model is a model with a pre-configured Deployment Template, enabling one-click deployment without model file preparation. If you select a custom model, you must mount the model files from a service like Object Storage Service (OSS).
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service, and in the Scenario-based Model Deployment area, click LLM Deployment.
On the LLM Deployment page, configure the following key parameters.
Model Settings: Select Public Model, then search for and select Qwen3-8B from the list.
Inference Engine: We recommend SGLang/vLLM for their high compatibility with the OpenAI API standard. This guide uses vLLM. For more information, see Select a suitable inference engine.
Deployment Template: Select Single-Node. The system automatically populates the recommended Instance Type, image, and other parameters from the template.
Click Deploy. The service deployment takes about 5 minutes. When the service status changes to Running, the deployment is complete.
NoteIf the service deployment fails, see Abnormal service status for solutions.
Step 2: Debug Online
After deployment, first verify the service is running correctly. Click the target service name to go to the detail page, switch to the Online Debugging tab, and then construct and send a request as follows.
Select the POST method.
Append the path
/v1/chat/completionsto the end of the auto-filled URL.Make sure that the Headers include
Content-Type: application/json.Fill in the Body: When using the vLLM Inference Engine, you must replace the
modelvalue with the correct model name. To obtain the model name, send a GET request to the/v1/modelsendpoint. Because you deployed Qwen3-8B in Step 1, you must replace<model_name>withQwen3-8B.{ "model": "<model_name>", "messages": [ { "role": "user", "content": "Hello!" } ], "max_tokens": 1024 }

Step 3: Call the LLM service
Before you make a call, go to the Overview tab of the service details page, click View Endpoint Information, and obtain the endpoint and token. These values are referenced as <EAS_ENDPOINT> and <EAS_TOKEN> in the following examples.
API call
The handling of the model parameter differs significantly between Inference Engines:
vLLM/SGLang: The
modelvalue is configured as the model name, which can be obtained by sending a GET request to the/v1/modelsendpoint.BladeLLM: The BladeLLM endpoint itself does not require the
modelparameter. However, when using the OpenAI SDK, this parameter is mandatory on the client side. To ensure compatibility, you can set it to an empty string"". For more information, see BladeLLM service invocation parameter configuration.ImportantWhen using BladeLLM, you must explicitly set the
max_tokensparameter in your request. Otherwise, the output is truncated to 16 tokens by default.
The following code provides examples of how to invoke the service:
OpenAI SDK
We recommend using the official Python SDK to interact with the service. Make sure you have the OpenAI SDK installed: pip install openai.
from openai import OpenAI
# 1. Configure the client
# Replace <EAS_TOKEN> with the token of the deployed service
openai_api_key = "<EAS_TOKEN>"
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
openai_api_base = "<EAS_ENDPOINT>/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# 2. Get the model name
# For BladeLLM, set model = "". BladeLLM does not require the model input parameter and does not support using client.models.list() to get the model name. Set it to an empty string to meet the OpenAI SDK's mandatory parameter requirement.
models = client.models.list()
model = models.data[0].id
print(model)
# 3. Initiate a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
],
model=model,
top_p=0.8,
temperature=0.7,
max_tokens=1024,
stream=stream,
)
if stream:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
else:
result = chat_completion.choices[0].message.content
print(result)cURL
For quick testing or script integration, you can use cURL.
curl -X POST <EAS_ENDPOINT>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: <EAS_TOKEN>" \
-d '{
"model": "<model_name>",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello"
}
],
"max_tokens":1024,
"temperature": 0.7,
"top_p": 0.8,
"stream":true
}' Where:
Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with your service's Endpoint and Token.Replace
<model_name>with the model name. For vLLM/SGLang, you can obtain it from the model list endpoint<EAS_ENDPOINT>/v1/models. For BladeLLM, this endpoint is not supported, and you can omit this field or set it to"".curl -X GET <EAS_ENDPOINT>/v1/models -H "Authorization: <EAS_TOKEN>"
Python requests library
If you prefer not to add the OpenAI SDK dependency, you can use the requests library.
import json
import requests
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
EAS_ENDPOINT = "<EAS_ENDPOINT>"
# Replace <EAS_TOKEN> with the token of the deployed service
EAS_TOKEN = "<EAS_TOKEN>"
# Replace <model_name> with the model name. You can get the name from the model list interface at <EAS_ENDPOINT>/v1/models. For BladeLLM, this interface is not supported. You can omit the "model" field or set it to "".
model = "<model_name>"
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": EAS_TOKEN,
}
stream = True
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
]
req = {
"messages": messages,
"stream": stream,
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"model": model,
}
response = requests.post(
url,
json=req,
headers=headers,
stream=stream,
)
if stream:
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
# The following code processes streaming responses in Server-Sent Events (SSE) format
if msg.startswith("data:"):
info = msg[6:]
if info == "[DONE]":
break
else:
resp = json.loads(info)
if resp["choices"][0]["delta"].get("content") is not None:
print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
resp = json.loads(response.text)
print(resp["choices"][0]["message"]["content"])Build a local WebUI
Gradio is a user-friendly Python library for quickly creating interactive interfaces for machine learning models. Follow these steps to run a Gradio WebUI locally.
Download the code: Download the appropriate code based on the inference engine you selected during deployment. Use the GitHub link if you have stable network access to GitHub. Otherwise, use the OSS link.
vLLM, SGLang: vLLM/SGLang_github, vLLM/SGLang_oss
BladeLLM: BladeLLM_github, BladeLLM_oss
Prepare the environment: Python 3.10 or later is required. Install the dependencies:
pip install openai gradio.Start the web application: Run the following command in your terminal. Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with your service's Endpoint and Token.python webui_client.py --eas_endpoint "<EAS_ENDPOINT>" --eas_token "<EAS_TOKEN>"After the application starts successfully, a local URL (usually
http://127.0.0.1:7860) is printed to your console. Open this URL in your browser to access the WebUI.
Integrate with third-party applications
You can integrate EAS services with various clients and development tools that support the OpenAI API. The core configuration requires the service endpoint and token, and model name.
Dify
Install the OpenAI-API-compatible model provider
Click your profile picture in the upper-right corner and select Settings. In the left-side navigation pane, choose Model Provider. If OpenAI-API-compatible is not in the Models, find and Install it from the list below.

Add model
Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the following parameters:
Model Type: Select LLM.
Model Name: For vLLM deployments, obtain the name by sending a GET request to the
/v1/modelsendpoint. This example uses Qwen3-8B.API Key: Enter the EAS service token.
API endpoint URL: Enter the public Endpoint of the EAS service. Note: Append /v1 to the end.
Test the model
On the Dify main page, click Create from Blank. Select the Chatflow type, enter an application name and other information, and then click Create.
Click the LLM node, select the model you added, and set the context and Prompt.

Click Preview in the upper-right corner and enter a question.

Chatbox
Go to Chatbox, download and install the appropriate version for your device, or directly Launch Web App. This guide uses macOS M3 as an example.
Add a model provider. Click Settings, add a model provider, and enter a name such as
pai. For API Mode, select OpenAI API Compatible.
Select the pai model provider and configure the following parameters.
API Key: The EAS service token.
API Host: Enter the public endpoint of the EAS service. Note: Append
/v1to the end of the URL.API Path: Leave this field blank.
Model: Click Fetch to add models. If the Inference Engine is BladeLLM, you cannot get models through this interface. Click New and enter the model name manually.

Test the chat. Click New Chat, and select the model service in the lower-right corner of the text input box.

Cherry Studio
Install the client
Visit Cherry Studio to download and install the client.
You can also download it from
https://github.com/CherryHQ/cherry-studio/releases.Configure the model service.
Click the settings button in the lower-left corner, and then click Add under the Model Provider section. Enter a custom name, such as PAI, in the Provider Name field, and select OpenAI as the provider type. Click OK.
Enter the EAS service Token in the API Key field and the public Endpoint of the EAS service in the API Host field.
Click Add. In the Model ID field, enter the model name. For vLLM deployment, obtain the name by sending a GET request to the
/v1/modelsendpoint. This example usesQwen3-8B. Note that the name is case-sensitive.
Click Check next to the API Key input box to verify connectivity.
Test the model
Return to the chat interface, select the model at the top, and start a conversation.

Billing
Costs may include the following. For more information, see Elastic Algorithm Service (EAS) billing details.
Compute fees: This is the main source of cost. When creating an EAS service, choose pay-as-you-go or subscription resources based on your needs.
Storage fees: If you use a custom model, files stored in OSS will incur storage fees.
Going live
Choose a suitable model
Define your application scenario:
General conversation: Be sure to choose an Instruction-Tuned Model, not a Foundation Model, to ensure the model can understand and follow your instructions.
Code generation: Choose specialized code models, such as the
Qwen3-Coderseries. They typically perform much better on code-related tasks than general-purpose models.Domain-specific tasks: If the task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned for that domain or fine-tuning a general-purpose model yourself.
Balance performance and cost: Generally, a larger parameter count means a more capable model, but also one that requires more Computing Power for deployment, leading to higher inference costs. We recommend starting with a smaller model (such as a 7B model) for validation. If its performance does not meet your requirements, gradually try larger models.
Refer to authoritative benchmarks: You can refer to industry-recognized leaderboards like OpenCompass and LMSys Chatbot Arena. These benchmarks provide objective evaluations of models across dimensions, such as reasoning, coding, and math, offering valuable guidance for model selection.
Choose a suitable inference engine
vLLM/SGLang: As mainstream choices in the open-source community, they offer broad model support and extensive community documentation and examples, making them easy to integrate and troubleshoot.
BladeLLM: Developed by the Alibaba Cloud PAI team, BladeLLM is deeply optimized for specific models, especially the Qwen series, often achieving higher performance and lower GPU Memory consumption.
Optimize inference
LLM intelligent router: Dynamically distributes requests based on real-time metrics like token Throughput and GPU Memory usage. This balances the Computing Power and GPU Memory allocation across inference instances, improving cluster resource utilization and system stability. It is suitable for scenarios with multiple inference instances and an expected uneven request load.
Deploy MoE models based on expert parallelism and PD separation: For Mixture-of-Experts (MoE) models, this approach uses technologies like expert parallelism (EP) and Prefill-Decode (PD) separation to increase inference Throughput and reduce deployment costs.
FAQ
Error:
Unsupported Media Type: Only 'application/json' is allowedEnsure the request
HeadersincludeContent-Type: application/json.Error:
The model '<model_name>' does not exist.The vLLM Inference Engine requires a correct
modelfield. Obtain the model name by sending a GET request to the/v1/modelsendpoint.
For more information, see EAS FAQ.