EAS provides one-click LLM deployment for models like DeepSeek and Qwen, with built-in environment setup, performance tuning, and cost management.
Quick start: Deploy an open source model
This example deploys Qwen3-8B. The same process applies to other supported models.
Prerequisites
Before you begin, ensure the following:
-
Model file prepared in supported format (ONNX, PyTorch, TensorFlow SavedModel)
-
Model uploaded to an OSS bucket in the same region as EAS (Regions and zones).
-
OSS bucket accessible (test download:
ossutil64 cp oss://bucket/path/model.onnx ./) -
Account has sufficient balance for selected instance type
-
AccessKey configured with AliyunPAIFullAccess permission
-
RAM role (if using) has OSS read permission
Estimated time: 10-15 minutes (first-time setup: 30 minutes)
Step 1: Create a service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
Click Deploy Service. In the Scenario-based Model Deployment area, click LLM Deployment.
-
Configure these parameters:
Parameter
Value
Model Settings
Select Public Model. Search for and select Qwen3-8B.
Inference Engine
Select vLLM (recommended, OpenAI API-compatible).
Deployment Template
Select Single Machine to auto-fill the recommended instance type and runtime image.
-
Click Deploy. Deployment takes about 5 minutes. Service status changes to Running when complete.
NoteIf deployment fails, see Service deployment and status issues.
Verification:
-
Service status should change from "Pending" → "Running" within 5-10 minutes
-
If stuck in "Pending" for more than 15 minutes, check Troubleshooting section
-
Step 2: Verify deployment
Verify the service using online debugging.
-
Click the service name to go to the service details page. Switch to the Online Debugging tab.
-
Configure request parameters:
Parameter
Value
Request Method
POST
URL Path
Append
/v1/chat/completionsto your service URL. For example:/api/predict/llm_qwen3_8b_test/v1/chat/completions.Body
{ "model": "Qwen3-8B", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 1024 }Headers
Include
Content-Type: application/json. -
Click Send Request to receive a response containing the model's reply.
Verification:
-
API response should return HTTP 200 OK
-
Response time should be less than 30 seconds for first request (model loading)
-
Subsequent requests should be less than 5 seconds
-

Call using an API
Before making calls, go to the Overview tab on the service details page. Click View Endpoint Information to obtain endpoint and token.
Call the service using this code:
cURL
curl -X POST /v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: " \
-d '{
"model": "",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello"
}
],
"max_tokens":1024,
"temperature": 0.7,
"top_p": 0.8,
"stream":true
}'
Where:
-
Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with the endpoint and token of your deployed service. -
Replace
<model_name>with the model name. For vLLM/SGLang, you can retrieve the model name from the model list API at/v1/models.curl -X GET /v1/models -H "Authorization: "
OpenAI SDK
Install the OpenAI SDK: pip install openai. Use it to interact with the service.
from openai import OpenAI
# 1. Configure the client
# Replace with the token of your deployed service
openai_api_key = ""
# Replace with the endpoint of your deployed service
openai_api_base = "/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# 2. Get the model name
# For BladeLLM, set model = "". It doesn't require model or support client.models.list().
models = client.models.list()
model = models.data[0].id
print(model)
# 3. Send a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
],
model=model,
top_p=0.8,
temperature=0.7,
max_tokens=1024,
stream=stream,
)
if stream:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
else:
result = chat_completion.choices[0].message.content
print(result)
Python requests library
For scenarios without OpenAI SDK dependency, use the requests library.
import json
import requests
# Replace with your deployed service endpoint
EAS_ENDPOINT = ""
# Replace with your deployed service token
EAS_TOKEN = ""
# Replace with model name from /v1/models API
# For BladeLLM: Omit "model" field or set to ""
model = ""
# Construct API endpoint URL
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": EAS_TOKEN,
}
# Enable streaming for real-time responses
stream = True
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
]
# Build request payload
req = {
"messages": messages,
"stream": stream,
"temperature": 0.7, # Controls randomness (0.0-2.0)
"top_p": 0.8, # Controls diversity (0.0-1.0)
"max_tokens": 1024, # Maximum response length
"model": model,
}
# Send POST request with timeout and error handling
try:
response = requests.post(
url,
json=req,
headers=headers,
stream=stream,
timeout=30 # Prevent hanging on slow model
)
response.raise_for_status() # Raise error for 4xx/5xx
# Process response based on streaming mode
if stream:
# Handle Server-Sent Events (SSE) format
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
if msg.startswith("data:"):
info = msg[6:]
if info == "[DONE]":
break
else:
resp = json.loads(info)
if resp["choices"][0]["delta"].get("content") is not None:
print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
# Handle non-streaming response
resp = json.loads(response.text)
print(resp["choices"][0]["message"]["content"])
except requests.exceptions.Timeout:
print("Error: Request timeout - model may be loading or overloaded")
except requests.exceptions.HTTPError as e:
print(f"HTTP Error {response.status_code}: {response.text}")
except Exception as e:
print(f"Unexpected error: {e}")
Build a local Web UI with Gradio
Gradio is a Python library for interactive ML interfaces. To run the WebUI locally:
-
Download the code
-
Prepare the environment
Requires Python 3.10 or later. Install the dependencies:
pip install openai gradio. -
Start the web application
Run the following command in your terminal. Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with the endpoint and token of your deployed service.python webui_client.py --eas_endpoint "" --eas_token "" -
After the application starts successfully, a local URL is displayed, usually
http://127.0.0.1:7860. Open this URL in a browser to access the UI.
Integrate with third-party applications
Integrate EAS with OpenAI API-compatible clients and tools using your service endpoint, token, and model name.
Dify
-
Install the OpenAI-API-compatible model provider
Go to your profile picture > Settings > Model Providers. If OpenAI-API-compatible is not in the Model List, find it below and click Install.

-
Add a model
Click Add Model on the OpenAI-API-compatible card and configure:
-
Model Type: Select LLM.
-
Model Name: Retrieve from the
/v1/modelsAPI (vLLM). Example: Qwen3-8B. -
API Key: Enter EAS service token.
-
API endpoint URL: EAS Internet endpoint with /v1 appended.
-
-
Usage
-
On the Dify main page, click Create Blank App. Select Chatflow, enter the app name, and click Create.
-
Click the LLM node, select the model you added, and set the context and prompt.

-
Click Preview in the upper-right corner and enter a question.

-
Chatbox
-
Go to Chatbox and install the client, or click Launch Web App for the web version.
-
Click Settings and add a model provider. Enter a name (such as pai) and select OpenAI API Compatible as the API Mode.

-
Select the pai provider and configure:
-
API Key: Enter EAS service token.
-
API Host: Enter Internet endpoint of EAS service. Note: Append /v1 to end.
-
API Path: Leave empty.
-
Model: Click Get to retrieve the model. For BladeLLM (does not support API retrieval), click New and enter the model name manually.

-
-
Click New Chat and select the model in the lower-right corner of the input box.

Cherry Studio
-
Install the client
Visit Cherry Studio to download and install the client.
You can also download it from
https://github.com/CherryHQ/cherry-studio/releases. -
Configure the model service.
-
Click Settings in the lower-left corner. In Model Service, click Add. Set Provider Name (such as PAI) and Provider Type to OpenAI. Click OK.
-
Enter the EAS service token as API Key and the Internet endpoint as API Address.
-
Click Add and enter the model name as Model ID. Retrieve the name from the
/v1/modelsAPI (for example,Qwen3-8B). Case-sensitive.
-
Click Test next to API Key to verify connectivity.
-
-
Quickly test model
Return to the dialog box, select the model at the top, and start a conversation.

Billing
The following items are billable. Billing of Elastic Algorithm Service (EAS).
-
Compute fees: Primary cost component. Choose pay-as-you-go or subscription billing when creating the service.
-
Storage fees: Custom model files stored in OSS incur standard OSS storage fees.
Production deployment
Choose model
-
Define your application scenario:
-
General-purpose conversation: Choose an instruction-tuned model, not a foundation model.
-
Code generation: Choose a specialized code model such as the
Qwen3-Coderseries for better performance on code tasks. -
Domain-specific tasks: For specialized domains such as finance or law, use a domain-specific fine-tuned model or fine-tune a general-purpose model.
-
-
Performance and cost: Larger models are more capable but cost more to deploy. Start with a smaller model (such as 7B) and scale up if needed.
-
Consult authoritative benchmarks: Use leaderboards such as OpenCompass and LMSys Chatbot Arena for objective evaluations across reasoning, coding, and math.
Choose inference engine
-
vLLM/SGLang: Mainstream open-source engines with broad model support and extensive community documentation. Easy to integrate and troubleshoot.
-
BladeLLM: Alibaba Cloud PAI inference engine, optimized for Qwen series with higher performance and lower GPU memory usage.
Inference optimization
-
Deploy an LLM intelligent router: Dynamically distributes requests across inference instances based on real-time metrics (token throughput, GPU memory). Improves resource utilization for multi-instance deployments.
-
Deploy MoE models using expert parallelism and PD separation: Uses expert parallelism (EP) and Prefill-Decode (PD) separation to increase MoE model throughput and reduce deployment costs.
Troubleshooting
Model not visible in dropdown after deployment
Root cause: Model deployment incomplete or UI cache issue.
Solution:
-
Check if service status shows Running. If Pending or Starting, wait for deployment to complete.
-
Refresh the browser (F5) to clear the UI cache.
-
On the service details page, click the Logs tab to check for errors.
Prevention: Wait for Running status (5–10 minutes) before accessing the model.
Deployment failed - insufficient resources
Root cause: Selected instance type unavailable in the current region. For regional availability, see Regions and zones.
Solution:
-
Return to the deployment page and select a different GPU instance type (for example, A10 to T4).
-
Try a region with higher resource availability, such as cn-hangzhou, cn-beijing, or cn-shanghai.
-
Set Resource Type to EAS Resource Group to use dedicated resources.
Prevention: Check resource availability before deployment.
Model loading timeout
Root cause: Model file too large, OSS access permissions missing, or network latency.
Solution:
-
Verify that the service account has read access to the model's OSS bucket.
-
For models exceeding 50 GB, use a quantized version (INT8 or INT4) to reduce loading time.
-
Set a longer initialization timeout in the deployment configuration (for example, 600 seconds for large models).
Prevention: Test OSS access before deployment. Expect approximately 1–2 minutes per 10 GB for model loading.
FAQ
Q: What do I do if the service is stuck in Pending?
Troubleshoot as follows:
-
Check the instance status: On the service details page, check instance status in the Service Instance section. Out of Stock indicates insufficient public resources.
-
Solutions (in order of priority):
-
Change the instance type. Select a different GPU model on the deployment page.
-
Use dedicated resources. Set Resource Type to EAS Resource Group. Create the resource group in advance.
-
-
Preventive measures:
-
Enterprise users: create dedicated resource groups to avoid public resource limits.
-
During peak hours, test in multiple regions.
-
Q: Call errors
-
Error:
Unsupported Media Type: Only 'application/json' is allowedEnsure request headers include
Content-Type: application/json. -
Error:
The model '<model_name>' does not exist.vLLM requires the correct model name. Retrieve it from the
/v1/modelsAPI. -
Error:
403 Forbidden - Disable the 'use free tier only' modeOccurs when calling Model Studio API models (such as Qwen2.5-VL via DashScope) with the free quota exhausted. Disable the "use free tier only" setting in Model Studio console. Model Studio free quota management. PAI-EAS deployed models are not affected and use standard EAS pricing.
For more information, see the EAS FAQ.