Deploy large language models using EAS - Platform For AI

Deploying a large language model (LLM) manually is complex and involves environment setup, performance tuning, and cost management. Elastic Algorithm Service (EAS) offers a one-stop solution for deploying popular LLMs like DeepSeek and Qwen with a single click.

Step 1: Deploy an LLM service

This guide uses the deployment of the Qwen3-8B public model as an example.

Note

A Public Model has a pre-configured deployment template, which lets you deploy it with one click without preparing model files. To use a custom model, mount the model files from a source such as Object Storage Service (OSS).

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service, and in the Scenario-based Model Deployment area, click LLM Deployment.
On the LLM Deployment page, configure the following key parameters.
- Model Settings: Select Public Model, then search for and select Qwen3-8B from the list.
- Inference Engine: We recommend SGLang/vLLM for their high compatibility with the OpenAI API standard. This guide uses vLLM. For more information, see Select a suitable inference engine.
- Deployment Template: Select Single-Node. The system automatically populates the recommended Instance Type, image, and other parameters from the template.
Click Deploy. The service deployment takes about 5 minutes. When the service status changes to Running, the deployment is complete.
Note
If the service deployment fails, see Service deployment and abnormal status for solutions.

Step 2: Use online debugging

After the service is deployed, verify that it is running correctly using the online debugging tool. Click the name of your target service to go to its details page. Switch to the Online Debugging tab and construct and send a request as follows.

Select the POST method.
Append the path /v1/chat/completions to the auto-filled URL.
Ensure that the Headers include Content-Type: application/json.
Fill in the Body: When using the vLLM **Inference Engine**, replace the `model` value with the correct model name. You can obtain this by sending a GET request to the `/v1/models` endpoint. Because we deployed Qwen3-8B in Step 1, replace <model_name> with Qwen3-8B.
```
{
  "model": "<model_name>",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "max_tokens": 1024
}
```

Step 3: Call the LLM service

Before you call the service, go to the Overview tab of the service details page, click View Endpoint Information, and obtain the endpoint and token. These values are referenced as <EAS_ENDPOINT> and <EAS_TOKEN> in the following examples.

Call using an API

The handling of the model parameter differs significantly between Inference Engines:

vLLM/SGLang: The model value is configured as the model name, which can be obtained by sending a GET request to the /v1/models endpoint.
BladeLLM: The BladeLLM endpoint itself does not require a model parameter. However, when using the OpenAI SDK, the client forces this parameter. You can set it to an empty string "" for compatibility. For more information, see BladeLLM service invocation parameter configuration.
Important
When using BladeLLM, you must explicitly set the max_tokens parameter in your request. Otherwise, the output is truncated by default after 16 tokens.

The following code examples show how to call the service:

OpenAI SDK

Use the official Python SDK to interact with the service. Ensure you have the OpenAI SDK installed: pip install openai.

from openai import OpenAI

# 1. Configure the client
# Replace <EAS_TOKEN> with the token of the deployed service
openai_api_key = "<EAS_TOKEN>"
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
openai_api_base = "<EAS_ENDPOINT>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 2. Get the model name
# For BladeLLM, set model = "". BladeLLM does not require the model input parameter and does not support using client.models.list() to get the model name. Set it to an empty string to meet the OpenAI SDK's mandatory parameter requirement.
models = client.models.list()
model = models.data[0].id
print(model)

# 3. Initiate a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "hello"},          
    ],
    model=model,
    top_p=0.8,
    temperature=0.7,
    max_tokens=1024,
    stream=stream,
)

if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

cURL

For quick tests or script integration, you can use cURL.

curl -X POST <EAS_ENDPOINT>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "hello"
        }
        ],
        "max_tokens":1024,
        "temperature": 0.7,
        "top_p": 0.8,
        "stream":true
    }'

Where:

Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your service's Endpoint and Token.
Replace <model_name> with the model name. For vLLM/SGLang, you can obtain it from the model list endpoint <EAS_ENDPOINT>/v1/models. For BladeLLM, this endpoint is not supported, and you can omit this field or set it to "".

curl -X GET <EAS_ENDPOINT>/v1/models -H "Authorization: <EAS_TOKEN>"

Python requests library

If you prefer not to add the OpenAI SDK as a dependency, you can use the requests library.

import json
import requests

# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
EAS_ENDPOINT = "<EAS_ENDPOINT>"
# Replace <EAS_TOKEN> with the token of the deployed service
EAS_TOKEN = "<EAS_TOKEN>"
# Replace <model_name> with the model name. You can get the name from the model list interface at <EAS_ENDPOINT>/v1/models. For BladeLLM, this interface is not supported. You can omit the "model" field or set it to "".
model = "<model_name>"

url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}

stream = True
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "hello"},
]

req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024,
    "model": model,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)

if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        # The following code processes streaming responses in Server-Sent Events (SSE) format
        if msg.startswith("data:"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                if resp["choices"][0]["delta"].get("content") is not None:
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

Build a local WebUI

Gradio is a user-friendly Python library for quickly creating interactive interfaces for machine learning models. Follow these steps to run a Gradio WebUI locally.

Download the code: Download the code corresponding to the **Inference Engine** you selected during deployment. Use the GitHub link if you have stable access; otherwise, use the OSS link.
- vLLM, SGLang: vLLM/SGLang_github, vLLM/SGLang_oss
- BladeLLM: BladeLLM_github, BladeLLM_oss
Prepare the environment: Python 3.10 or later is required. Install the dependencies: pip install openai gradio.
Start the web application: Run the following command in your terminal. Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your service's Endpoint and Token.
```
python webui_client.py --eas_endpoint "<EAS_ENDPOINT>" --eas_token "<EAS_TOKEN>"
```
After the application starts, it provides a local URL (usually http://127.0.0.1:7860). To access the WebUI, open this URL in a browser.

Integrate with third-party applications

EAS services can be integrated into various clients and development tools that support the OpenAI API. The core configuration elements are the service endpoint and token, and model name.

Dify

Install the OpenAI-API-compatible model provider
Click your profile picture in the upper-right corner and select Settings. In the left-side navigation pane, choose Model Provider. If OpenAI-API-compatible is not in the Models, find and Install it from the list below.
Add model
Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the following parameters:
- Model Type: Select LLM.
- Model Name: For vLLM deployments, obtain the name by sending a GET request to the /v1/models endpoint. This example uses Qwen3-8B.
- API Key: Enter the EAS service token.
- API endpoint URL: Enter the public Endpoint of the EAS service. Note: Append /v1 to the end.
Test the model
1. On the Dify main page, click Create from Blank. Select the Chatflow type, enter an application name and other information, and then click Create.
2. Click the LLM node, select the model you added, and set the context and Prompt.
3. Click Preview in the upper-right corner and enter a question.

Chatbox

Go to Chatbox, download and install the appropriate version for your device, or directly Launch Web App. This guide uses macOS M3 as an example.
Add a model provider. Click Settings, add a model provider, and enter a name such as pai. For API Mode, select OpenAI API Compatible.

Select the pai model provider and configure the following parameters.
- API Key: The EAS service token.
- API Host: Enter the public endpoint of the EAS service. Note: Append /v1 to the end of the URL.
- API Path: Leave this field blank.
- Model: Click Fetch to add models. If the Inference Engine is BladeLLM, you cannot get models through this interface. Click New and enter the model name manually.
Test the chat. Click New Chat, and select the model service in the lower-right corner of the text input box.

Cherry Studio

Install the client
Visit Cherry Studio to download and install the client.
You can also download it from https://github.com/CherryHQ/cherry-studio/releases.
Configure the model service.
1. Click the settings button in the lower-left corner, and then click Add under the Model Provider section. Enter a custom name, such as PAI, in the Provider Name field, and select OpenAI as the provider type. Click OK.
2. Enter the EAS service Token in the API Key field and the public Endpoint of the EAS service in the API Host field.
3. Click Add. In the Model ID field, enter the model name. For vLLM deployment, obtain the name by sending a GET request to the /v1/models endpoint. This example uses Qwen3-8B. Note that the name is case-sensitive.
4. Click Check next to the API Key input box to verify connectivity.
Test the model
Return to the chat interface, select the model at the top, and start a conversation.

Billing

Costs may include the following. For more information, see Elastic Algorithm Service (EAS) billing details.

Compute fees: This is the main source of cost. When creating an EAS service, choose pay-as-you-go or subscription resources based on your needs.
Storage fees: If you use a custom model, files stored in OSS will incur storage fees.

Going live

Choose a suitable model

Define your application scenario:
- General conversation: Choose an instruction-tuned model, not a foundation model, to ensure the model can understand and follow your instructions.
- Code generation: Choose a specialized code model, such as the Qwen3-Coder series. They typically perform much better on code-related tasks than general-purpose models.
- Domain-specific tasks: If your task is highly specialized (such as in finance or law), consider finding a model that has been fine-tuned in that domain or fine-tuning a general-purpose model yourself.
Balance performance and cost: Typically, the larger the model's parameter count, the more powerful it is, but the more compute resources it requires for deployment, leading to higher inference costs. Start with a smaller model (such as a 7B-level model) for validation. If its performance does not meet your needs, gradually try larger models.
Consult authoritative benchmarks: You can refer to industry-recognized leaderboards like OpenCompass or the LMSys Chatbot Arena. These platforms provide objective evaluations of models across multiple dimensions, such as reasoning, coding, and math, offering valuable guidance for model selection.

Choose a suitable inference engine

vLLM/SGLang: As mainstream open-source options, they offer broad model support and extensive community documentation and examples, making them easy to integrate and troubleshoot.
BladeLLM: An in-house inference engine developed by the Alibaba Cloud PAI team. It is deeply optimized for specific models (especially the Qwen series) and may achieve higher performance with lower memory usage.

Optimize inference

LLM intelligent router: This router dynamically distributes requests based on real-time metrics like token throughput and GPU memory usage. It balances the computing power and memory allocation across inference instances. This approach is ideal for deployments with multiple inference instances and expected uneven request loads, improving cluster resource utilization and system stability.
Deploy MoE models based on expert parallelism and PD separation: For Mixture-of-Experts (MoE) models, this approach uses techniques like expert parallelism (EP) and Prefill-Decode (PD) separation to improve inference throughput and reduce deployment costs.

FAQ

Error: Unsupported Media Type: Only 'application/json' is allowed
Ensure the request Headers include Content-Type: application/json.
Error: The model '<model_name>' does not exist.
The vLLM Inference Engine requires a correct model field. Obtain the model name by sending a GET request to the /v1/models endpoint.

For more information, see EAS FAQ.