Deploying large language models on PAI-EAS - Platform For AI

Quick start: Deploy an open source model

This example deploys Qwen3-8B. The same process applies to other supported models.

Prerequisites

Before you begin, ensure the following:

Model file prepared in supported format (ONNX, PyTorch, TensorFlow SavedModel)
Model uploaded to an OSS bucket in the same region as EAS (Regions and zones).
OSS bucket accessible (test download: ossutil64 cp oss://bucket/path/model.onnx ./)
Account has sufficient balance for selected instance type
AccessKey configured with AliyunPAIFullAccess permission
RAM role (if using) has OSS read permission

Estimated time: 10-15 minutes (first-time setup: 30 minutes)

Step 1: Create a service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Scenario-based Model Deployment area, click LLM Deployment.

Configure these parameters:

Parameter	Value
Model Settings	Select Public Model. Search for and select Qwen3-8B.
Inference Engine	Select vLLM (recommended, OpenAI API-compatible).
Deployment Template	Select Single Machine to auto-fill the recommended instance type and runtime image.

Click Deploy. Deployment takes about 5 minutes. Service status changes to Running when complete.

Note
If deployment fails, see Service deployment and status issues.

Verification:
- Service status should change from "Pending" → "Running" within 5-10 minutes
- If stuck in "Pending" for more than 15 minutes, check Troubleshooting section

Step 2: Verify deployment

Verify the service using online debugging.

Click the service name to go to the service details page. Switch to the Online Debugging tab.

Configure request parameters:

Parameter	Value
Request Method	POST
URL Path	Append `/v1/chat/completions` to your service URL. For example: `/api/predict/llm_qwen3_8b_test/v1/chat/completions`.
Body	`{ "model": "Qwen3-8B", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 1024 }`
Headers	Include `Content-Type: application/json`.

Click Send Request to receive a response containing the model's reply.

Verification:
- API response should return HTTP 200 OK
- Response time should be less than 30 seconds for first request (model loading)
- Subsequent requests should be less than 5 seconds

Call using an API

Before making calls, go to the Overview tab on the service details page. Click View Endpoint Information to obtain endpoint and token.

Call the service using this code:

cURL

curl -X POST /v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: " \
    -d '{
        "model": "",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "hello"
        }
        ],
        "max_tokens":1024,
        "temperature": 0.7,
        "top_p": 0.8,
        "stream":true
    }'

Where:

Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of your deployed service.
Replace <model_name> with the model name. For vLLM/SGLang, you can retrieve the model name from the model list API at /v1/models.
```
curl -X GET /v1/models -H "Authorization: "
```

OpenAI SDK

Install the OpenAI SDK: pip install openai. Use it to interact with the service.

from openai import OpenAI

# 1. Configure the client
# Replace  with the token of your deployed service
openai_api_key = ""
# Replace  with the endpoint of your deployed service
openai_api_base = "/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 2. Get the model name
# For BladeLLM, set model = "". It doesn't require model or support client.models.list().
models = client.models.list()
model = models.data[0].id
print(model)

# 3. Send a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "hello"},
    ],
    model=model,
    top_p=0.8,
    temperature=0.7,
    max_tokens=1024,
    stream=stream,
)

if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

Python requests library

For scenarios without OpenAI SDK dependency, use the requests library.

import json
import requests

# Replace with your deployed service endpoint
EAS_ENDPOINT = ""
# Replace with your deployed service token
EAS_TOKEN = ""
# Replace with model name from /v1/models API
# For BladeLLM: Omit "model" field or set to ""
model = ""

# Construct API endpoint URL
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}

# Enable streaming for real-time responses
stream = True
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "hello"},
]

# Build request payload
req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.7,  # Controls randomness (0.0-2.0)
    "top_p": 0.8,        # Controls diversity (0.0-1.0)
    "max_tokens": 1024,  # Maximum response length
    "model": model,
}

# Send POST request with timeout and error handling
try:
    response = requests.post(
        url,
        json=req,
        headers=headers,
        stream=stream,
        timeout=30  # Prevent hanging on slow model
    )
    response.raise_for_status()  # Raise error for 4xx/5xx

    # Process response based on streaming mode
    if stream:
        # Handle Server-Sent Events (SSE) format
        for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
            msg = chunk.decode("utf-8")
            if msg.startswith("data:"):
                info = msg[6:]
                if info == "[DONE]":
                    break
                else:
                    resp = json.loads(info)
                    if resp["choices"][0]["delta"].get("content") is not None:
                        print(resp["choices"][0]["delta"]["content"], end="", flush=True)
    else:
        # Handle non-streaming response
        resp = json.loads(response.text)
        print(resp["choices"][0]["message"]["content"])

except requests.exceptions.Timeout:
    print("Error: Request timeout - model may be loading or overloaded")
except requests.exceptions.HTTPError as e:
    print(f"HTTP Error {response.status_code}: {response.text}")
except Exception as e:
    print(f"Unexpected error: {e}")

Build a local Web UI with Gradio

Gradio is a Python library for interactive ML interfaces. To run the WebUI locally:

Download the code

GitHub link | OSS link
Prepare the environment

Requires Python 3.10 or later. Install the dependencies: pip install openai gradio.
Start the web application

Run the following command in your terminal. Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of your deployed service.
```
python webui_client.py --eas_endpoint "" --eas_token ""
```
After the application starts successfully, a local URL is displayed, usually http://127.0.0.1:7860. Open this URL in a browser to access the UI.

Integrate with third-party applications

Integrate EAS with OpenAI API-compatible clients and tools using your service endpoint, token, and model name.

Dify

Install the OpenAI-API-compatible model provider

Go to your profile picture > Settings > Model Providers. If OpenAI-API-compatible is not in the Model List, find it below and click Install.
Add a model

Click Add Model on the OpenAI-API-compatible card and configure:
- Model Type: Select LLM.
- Model Name: Retrieve from the /v1/models API (vLLM). Example: Qwen3-8B.
- API Key: Enter EAS service token.
- API endpoint URL: EAS Internet endpoint with /v1 appended.
Usage
1. On the Dify main page, click Create Blank App. Select Chatflow, enter the app name, and click Create.
2. Click the LLM node, select the model you added, and set the context and prompt.
3. Click Preview in the upper-right corner and enter a question.

Chatbox

Go to Chatbox and install the client, or click Launch Web App for the web version.
Click Settings and add a model provider. Enter a name (such as pai) and select OpenAI API Compatible as the API Mode.
Select the pai provider and configure:
- API Key: Enter EAS service token.
- API Host: Enter Internet endpoint of EAS service. Note: Append /v1 to end.
- API Path: Leave empty.
- Model: Click Get to retrieve the model. For BladeLLM (does not support API retrieval), click New and enter the model name manually.
Click New Chat and select the model in the lower-right corner of the input box.

Cherry Studio

Install the client

Visit Cherry Studio to download and install the client.

You can also download it from https://github.com/CherryHQ/cherry-studio/releases.
Configure the model service.
1. Click Settings in the lower-left corner. In Model Service, click Add. Set Provider Name (such as PAI) and Provider Type to OpenAI. Click OK.
2. Enter the EAS service token as API Key and the Internet endpoint as API Address.
3. Click Add and enter the model name as Model ID. Retrieve the name from the /v1/models API (for example, Qwen3-8B). Case-sensitive.
4. Click Test next to API Key to verify connectivity.
Quickly test model

Return to the dialog box, select the model at the top, and start a conversation.

Billing

The following items are billable. Billing of Elastic Algorithm Service (EAS).

Compute fees: Primary cost component. Choose pay-as-you-go or subscription billing when creating the service.
Storage fees: Custom model files stored in OSS incur standard OSS storage fees.

Production deployment

Choose model

Define your application scenario:
- General-purpose conversation: Choose an instruction-tuned model, not a foundation model.
- Code generation: Choose a specialized code model such as the Qwen3-Coder series for better performance on code tasks.
- Domain-specific tasks: For specialized domains such as finance or law, use a domain-specific fine-tuned model or fine-tune a general-purpose model.
Performance and cost: Larger models are more capable but cost more to deploy. Start with a smaller model (such as 7B) and scale up if needed.
Consult authoritative benchmarks: Use leaderboards such as OpenCompass and LMSys Chatbot Arena for objective evaluations across reasoning, coding, and math.

Choose inference engine

vLLM/SGLang: Mainstream open-source engines with broad model support and extensive community documentation. Easy to integrate and troubleshoot.
BladeLLM: Alibaba Cloud PAI inference engine, optimized for Qwen series with higher performance and lower GPU memory usage.

Inference optimization

Deploy an LLM intelligent router: Dynamically distributes requests across inference instances based on real-time metrics (token throughput, GPU memory). Improves resource utilization for multi-instance deployments.
Deploy MoE models using expert parallelism and PD separation: Uses expert parallelism (EP) and Prefill-Decode (PD) separation to increase MoE model throughput and reduce deployment costs.

Troubleshooting

Model not visible in dropdown after deployment

Root cause: Model deployment incomplete or UI cache issue.

Solution:

Check if service status shows Running. If Pending or Starting, wait for deployment to complete.
Refresh the browser (F5) to clear the UI cache.
On the service details page, click the Logs tab to check for errors.

Prevention: Wait for Running status (5–10 minutes) before accessing the model.

Deployment failed - insufficient resources

Root cause: Selected instance type unavailable in the current region. For regional availability, see Regions and zones.

Solution:

Return to the deployment page and select a different GPU instance type (for example, A10 to T4).
Try a region with higher resource availability, such as cn-hangzhou, cn-beijing, or cn-shanghai.
Set Resource Type to EAS Resource Group to use dedicated resources.

Prevention: Check resource availability before deployment.

Model loading timeout

Root cause: Model file too large, OSS access permissions missing, or network latency.

Solution:

Verify that the service account has read access to the model's OSS bucket.
For models exceeding 50 GB, use a quantized version (INT8 or INT4) to reduce loading time.
Set a longer initialization timeout in the deployment configuration (for example, 600 seconds for large models).

Prevention: Test OSS access before deployment. Expect approximately 1–2 minutes per 10 GB for model loading.

FAQ

Q: What do I do if the service is stuck in Pending?

Troubleshoot as follows:

Check the instance status: On the service details page, check instance status in the Service Instance section. Out of Stock indicates insufficient public resources.
Solutions (in order of priority):
1. Change the instance type. Select a different GPU model on the deployment page.
2. Use dedicated resources. Set Resource Type to EAS Resource Group. Create the resource group in advance.
Preventive measures:
1. Enterprise users: create dedicated resource groups to avoid public resource limits.
2. During peak hours, test in multiple regions.

Q: Call errors

Error: Unsupported Media Type: Only 'application/json' is allowed

Ensure request headers include Content-Type: application/json.
Error: The model '<model_name>' does not exist.

vLLM requires the correct model name. Retrieve it from the /v1/models API.
Error: 403 Forbidden - Disable the 'use free tier only' mode

Occurs when calling Model Studio API models (such as Qwen2.5-VL via DashScope) with the free quota exhausted. Disable the "use free tier only" setting in Model Studio console. Model Studio free quota management. PAI-EAS deployed models are not affected and use standard EAS pricing.

For more information, see the EAS FAQ.