Elastic Algorithm Service (EAS) provides a one-stop solution for deploying large language models. Deploy popular models like DeepSeek and Qwen with a single click, simplifying environment configuration, performance tuning, and cost management.
Quick Start: Deploy an open source model
This section uses the deployment of the open source model Qwen3-8B as an example. The same process applies to other supported models.
Step 1: Create a service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Scenario-based Model Deployment area, click LLM Deployment.
Configure the following key parameters:
Configuration Item
Value
Model Settings
Select Public Model. Search for and select Qwen3-8B.
Inference Engine
Select vLLM, which is recommended and compatible with the OpenAI API.
Deployment Template
Select Single Machine to automatically fill in the recommended instance type, runtime image, and other parameters.
Click Deploy. Deployment takes about 5 minutes and completes when the service status changes to Running.
NoteIf the service deployment fails, see Service deployment and status issues.
Step 2: Verify with online debugging
After the service is deployed, you can use online debugging to verify that it is running correctly.
Click the service name to go to the service details page. Switch to the Online Debugging tab.
Configure the request parameters as follows:
Configuration Item
Value
Request Method
POST
URL Path
Append
/v1/chat/completionsto your service URL. For example:/api/predict/llm_qwen3_8b_test/v1/chat/completions.Body
{ "model": "Qwen3-8B", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 1024 }Headers
Include
Content-Type: application/json.Click Send Request to receive a response containing the model's reply.

Call using an API
Before you make a call, go to the Overview tab on the service details page. Click View Endpoint Information to obtain the endpoint and token.
Call the service using this API code:
cURL
curl -X POST /v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: " \
-d '{
"model": "",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello"
}
],
"max_tokens":1024,
"temperature": 0.7,
"top_p": 0.8,
"stream":true
}'Where:
Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with the endpoint and token of your deployed service.Replace
<model_name>with the model name. For vLLM/SGLang, you can retrieve the model name from the model list API at/v1/models.curl -X GET /v1/models -H "Authorization: "
OpenAI SDK
Install the OpenAI SDK: pip install openai. Use it to interact with the service.
from openai import OpenAI
# 1. Configure the client
# Replace with the token of your deployed service
openai_api_key = ""
# Replace with the endpoint of your deployed service
openai_api_base = "/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# 2. Get the model name
# For BladeLLM, set model = "". BladeLLM does not require the model parameter and does not support getting the model name using client.models.list(). Set it to an empty string to meet the OpenAI SDK's mandatory parameter requirement.
models = client.models.list()
model = models.data[0].id
print(model)
# 3. Send a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
],
model=model,
top_p=0.8,
temperature=0.7,
max_tokens=1024,
stream=stream,
)
if stream:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
else:
result = chat_completion.choices[0].message.content
print(result)Python requests library
For scenarios without OpenAI SDK dependency, use the requests library.
import json
import requests
# Replace with the endpoint of your deployed service
EAS_ENDPOINT = ""
# Replace with the token of your deployed service
EAS_TOKEN = ""
# Replace with the model name. You can get it from the model list API at /v1/models. For BladeLLM, this API is not supported. You can omit the "model" field or set it to "".
model = ""
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": EAS_TOKEN,
}
stream = True
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
]
req = {
"messages": messages,
"stream": stream,
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"model": model,
}
response = requests.post(
url,
json=req,
headers=headers,
stream=stream,
)
if stream:
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
# The following code processes streaming responses in Server-Sent Events (SSE) format
if msg.startswith("data:"):
info = msg[6:]
if info == "[DONE]":
break
else:
resp = json.loads(info)
if resp["choices"][0]["delta"].get("content") is not None:
print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
resp = json.loads(response.text)
print(resp["choices"][0]["message"]["content"])Build a local Web UI with Gradio
Gradio is a user-friendly Python library for creating interactive machine learning interfaces. Follow these steps to run the Gradio WebUI locally.
Download the code
Prepare the environment
Requires Python 3.10 or later. Install the dependencies:
pip install openai gradio.Start the web application
Run the following command in your terminal. Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with the endpoint and token of your deployed service.python webui_client.py --eas_endpoint "" --eas_token ""After the application starts successfully, a local URL is displayed, usually
http://127.0.0.1:7860. Open this URL in a browser to access the UI.
Integrate with third-party applications
Integrate EAS services with clients and development tools that support the OpenAI API using service endpoint, token, and model name.
Dify
Install the OpenAI-API-compatible model provider
Click your profile picture in the upper-right corner and select Settings. In the navigation pane on the left, select Model Providers. If OpenAI-API-compatible is not in the Model List, find it in the list below and click Install.

Add a model
Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the parameters as follows:
Model Type: Select LLM.
Model Name: For vLLM deployments, send a GET request to the
/v1/modelsAPI to retrieve the name. For example, enter Qwen3-8B.API Key: Enter the EAS service token.
API endpoint URL: Enter the Internet endpoint of the EAS service. Note: Append /v1 to the end.
Usage
On the Dify main page, click Create Blank App. Select the Chatflow type, enter the application name and other information, and then click Create.
Click the LLM node, select the model you added, and set the context and prompt.

Click Preview in the upper-right corner and enter a question.

Chatbox
Go to Chatbox. Download and install the version for your device, or click Launch Web App to use the web version. This example uses macOS on an M3 chip.
Add a model provider. Click Settings, add a model provider, enter a name, such as pai, and select OpenAI API Compatible for the API Mode.

Select the pai model provider and configure the following parameters.
API Key: Enter the EAS service token.
API Host: Enter the Internet endpoint of the EAS service. Note: Append /v1 to the end.
API Path: Leave this empty.
Model: Click Get to add a model. If the inference engine is BladeLLM, which does not support retrieval through the API, click New to enter the model name.

Test the conversation. Click New Chat. In the lower-right corner of the text input box, select the model service.

Cherry Studio
Install the client
Visit Cherry Studio to download and install the client.
You can also download it from
https://github.com/CherryHQ/cherry-studio/releases.Configure the model service.
Click the Settings button in the lower-left corner. In the Model Service section, click Add. For Provider Name, enter a custom name, such as PAI. Set the Provider Type to OpenAI. Click OK.
For API Key, enter the EAS service token. For API Address, enter the Internet endpoint of the EAS service.
Click Add. For Model ID, enter the model name. For vLLM deployments, send a GET request to the
/v1/modelsAPI to retrieve the name. For example, enterQwen3-8B. Note that the name is case-sensitive.
Next to the API Key input box, click Test to confirm connectivity.
Quickly test the model
Return to the dialog box. At the top, select the model and start a conversation.

Billing
The following items are billable. For more information, see Billing of Elastic Algorithm Service (EAS).
Compute fees: These fees are the primary cost. When creating an EAS service, choose pay-as-you-go or subscription resources based on your needs.
Storage fees: If you use a custom model, the model files are stored in Object Storage Service (OSS). You will be charged for OSS storage based on your usage.
Going live
Choose the right model
Define your application scenario:
General-purpose conversation: Choose an instruction-tuned model, not a foundation model. This ensures the model can understand and follow your instructions.
Code generation: Choose a specialized code model, such as the
Qwen3-Coderseries. They typically perform much better on code-related tasks than general-purpose models.Domain-specific tasks: If the task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned for that domain. You can also fine-tune a general-purpose model yourself.
Performance and cost: Generally, models with more parameters are more capable. However, they also require more compute resources to deploy, which increases inference costs. We recommend starting with a smaller model, such as a 7B model, for validation. If its performance does not meet your needs, gradually try larger models.
Consult authoritative benchmarks: You can refer to industry-recognized leaderboards such as OpenCompass and the LMSys Chatbot Arena. These leaderboards provide objective evaluations of models across multiple dimensions, such as reasoning, coding, and math. They can offer valuable guidance for model selection.
Choose the right inference engine
vLLM/SGLang: These are mainstream choices in the open source community. They offer broad model support and extensive community documentation and examples. This makes them easy to integrate and troubleshoot.
BladeLLM: This is an inference engine developed by the Alibaba Cloud PAI team. It is deeply optimized for specific models, especially the Qwen series. It may provide higher performance and lower GPU memory usage.
Inference optimization
Deploy an LLM intelligent router: This feature dynamically distributes requests based on real-time metrics such as token throughput and GPU memory usage. It balances the computing power and memory allocation across inference instances. This is suitable for scenarios where you deploy multiple inference instances and load imbalance is expected. It improves cluster resource utilization and system stability.
Deploy MoE models using expert parallelism and PD separation: For Mixture-of-Experts (MoE) models, this approach uses techniques such as expert parallelism (EP) and Prefill-Decode (PD) separation to increase inference throughput and reduce deployment costs.
FAQ
Q: What should I do if the service is stuck in the Pending state and won't start?
Follow these steps to troubleshoot the issue:
Check the instance status: On the service list page, click the service name to go to the service details page. In the Service Instance section, check the instance status. If it shows Out of Stock, this indicates that the public resource group has insufficient resources.
Solutions (in order of priority):
Solution 1: Change the instance type. Return to the deployment page and select a different GPU model.
Solution 2: Use dedicated resources. Set Resource Type to EAS Resource Group to use dedicated resources. You must create the resource group in advance.
Preventive measures:
To avoid being limited by public resources, enterprise users should create dedicated resource groups.
During peak hours, we recommend testing in multiple regions.
Q: Call errors
The call returns the error
Unsupported Media Type: Only 'application/json' is allowedEnsure that the request headers include
Content-Type: application/json.The call returns the error
The model '<model_name>' does not exist.The vLLM inference engine requires the model field to be completed correctly. You can retrieve the model name by calling the
/v1/modelsAPI with a GET request.
For more information, see the EAS FAQ.