All Products
Search
Document Center

Platform For AI:Deploying large language models on PAI-EAS

Last Updated:Feb 06, 2026

Elastic Algorithm Service (EAS) provides a one-stop solution for deploying large language models. Deploy popular models like DeepSeek and Qwen with a single click, simplifying environment configuration, performance tuning, and cost management.

Quick Start: Deploy an open source model

This section uses the deployment of the open source model Qwen3-8B as an example. The same process applies to other supported models.

Step 1: Create a service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. Click Deploy Service. In the Scenario-based Model Deployment area, click LLM Deployment.

  3. Configure the following key parameters:

    Configuration Item

    Value

    Model Settings

    Select Public Model. Search for and select Qwen3-8B.

    Inference Engine

    Select vLLM, which is recommended and compatible with the OpenAI API.

    Deployment Template

    Select Single Machine to automatically fill in the recommended instance type, runtime image, and other parameters.

  4. Click Deploy. Deployment takes about 5 minutes and completes when the service status changes to Running.

    Note

    If the service deployment fails, see Service deployment and status issues.

Step 2: Verify with online debugging

After the service is deployed, you can use online debugging to verify that it is running correctly.

  1. Click the service name to go to the service details page. Switch to the Online Debugging tab.

  2. Configure the request parameters as follows:

    Configuration Item

    Value

    Request Method

    POST

    URL Path

    Append /v1/chat/completions to your service URL. For example: /api/predict/llm_qwen3_8b_test/v1/chat/completions.

    Body

    {
      "model": "Qwen3-8B",
      "messages": [
        {"role": "user", "content": "Hello!"}
      ],
      "max_tokens": 1024
    }

    Headers

    Include Content-Type: application/json.

  3. Click Send Request to receive a response containing the model's reply.

image

Call using an API

Before you make a call, go to the Overview tab on the service details page. Click View Endpoint Information to obtain the endpoint and token.

Call the service using this API code:

cURL

curl -X POST /v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: " \
    -d '{
        "model": "",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "hello"
        }
        ],
        "max_tokens":1024,
        "temperature": 0.7,
        "top_p": 0.8,
        "stream":true
    }'

Where:

  • Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of your deployed service.

  • Replace <model_name> with the model name. For vLLM/SGLang, you can retrieve the model name from the model list API at /v1/models.

    curl -X GET /v1/models -H "Authorization: "

OpenAI SDK

Install the OpenAI SDK:  pip install openai. Use it to interact with the service.

from openai import OpenAI

# 1. Configure the client
# Replace  with the token of your deployed service
openai_api_key = ""
# Replace  with the endpoint of your deployed service
openai_api_base = "/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 2. Get the model name
# For BladeLLM, set model = "". BladeLLM does not require the model parameter and does not support getting the model name using client.models.list(). Set it to an empty string to meet the OpenAI SDK's mandatory parameter requirement.
models = client.models.list()
model = models.data[0].id
print(model)

# 3. Send a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "hello"},
    ],
    model=model,
    top_p=0.8,
    temperature=0.7,
    max_tokens=1024,
    stream=stream,
)

if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

Python requests library

For scenarios without OpenAI SDK dependency, use the requests library.

import json
import requests

# Replace  with the endpoint of your deployed service
EAS_ENDPOINT = ""
# Replace  with the token of your deployed service
EAS_TOKEN = ""
# Replace  with the model name. You can get it from the model list API at /v1/models. For BladeLLM, this API is not supported. You can omit the "model" field or set it to "".
model = ""

url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}

stream = True
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "hello"},
]

req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024,
    "model": model,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)

if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        # The following code processes streaming responses in Server-Sent Events (SSE) format
        if msg.startswith("data:"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                if resp["choices"][0]["delta"].get("content") is not None:
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

Build a local Web UI with Gradio

Gradio is a user-friendly Python library for creating interactive machine learning interfaces. Follow these steps to run the Gradio WebUI locally.

  1. Download the code

    GitHub link | OSS link

  2. Prepare the environment

    Requires Python 3.10 or later. Install the dependencies: pip install openai gradio.

  3. Start the web application

    Run the following command in your terminal. Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of your deployed service.

    python webui_client.py --eas_endpoint "" --eas_token ""
  4. After the application starts successfully, a local URL is displayed, usually http://127.0.0.1:7860. Open this URL in a browser to access the UI.

Integrate with third-party applications

Integrate EAS services with clients and development tools that support the OpenAI API using service endpoint, token, and model name.

Dify

  1. Install the OpenAI-API-compatible model provider

    Click your profile picture in the upper-right corner and select Settings. In the navigation pane on the left, select Model Providers. If OpenAI-API-compatible is not in the Model List, find it in the list below and click Install.

    image

  2. Add a model

    Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the parameters as follows:

    • Model Type: Select LLM.

    • Model Name: For vLLM deployments, send a GET request to the /v1/models API to retrieve the name. For example, enter Qwen3-8B.

    • API Key: Enter the EAS service token.

    • API endpoint URL: Enter the Internet endpoint of the EAS service. Note: Append /v1 to the end.

  3. Usage

    1. On the Dify main page, click Create Blank App. Select the Chatflow type, enter the application name and other information, and then click Create.

    2. Click the LLM node, select the model you added, and set the context and prompt.image

    3. Click Preview in the upper-right corner and enter a question.

      image

Chatbox

  1. Go to Chatbox. Download and install the version for your device, or click Launch Web App to use the web version. This example uses macOS on an M3 chip.

  2. Add a model provider. Click Settings, add a model provider, enter a name, such as pai, and select OpenAI API Compatible for the API Mode.

    image

  3. Select the pai model provider and configure the following parameters.

    • API Key: Enter the EAS service token.

    • API Host: Enter the Internet endpoint of the EAS service. Note: Append /v1 to the end.

    • API Path: Leave this empty.

    • Model: Click Get to add a model. If the inference engine is BladeLLM, which does not support retrieval through the API, click New to enter the model name.

    image

  4. Test the conversation. Click New Chat. In the lower-right corner of the text input box, select the model service.

    image

Cherry Studio

Billing

The following items are billable. For more information, see Billing of Elastic Algorithm Service (EAS).

  • Compute fees: These fees are the primary cost. When creating an EAS service, choose pay-as-you-go or subscription resources based on your needs.

  • Storage fees: If you use a custom model, the model files are stored in Object Storage Service (OSS). You will be charged for OSS storage based on your usage.

Going live

Choose the right model

  1. Define your application scenario:

    • General-purpose conversation: Choose an instruction-tuned model, not a foundation model. This ensures the model can understand and follow your instructions.

    • Code generation: Choose a specialized code model, such as the Qwen3-Coder series. They typically perform much better on code-related tasks than general-purpose models.

    • Domain-specific tasks: If the task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned for that domain. You can also fine-tune a general-purpose model yourself.

  2. Performance and cost: Generally, models with more parameters are more capable. However, they also require more compute resources to deploy, which increases inference costs. We recommend starting with a smaller model, such as a 7B model, for validation. If its performance does not meet your needs, gradually try larger models.

  3. Consult authoritative benchmarks: You can refer to industry-recognized leaderboards such as OpenCompass and the LMSys Chatbot Arena. These leaderboards provide objective evaluations of models across multiple dimensions, such as reasoning, coding, and math. They can offer valuable guidance for model selection.

Choose the right inference engine

  • vLLM/SGLang: These are mainstream choices in the open source community. They offer broad model support and extensive community documentation and examples. This makes them easy to integrate and troubleshoot.

  • BladeLLM: This is an inference engine developed by the Alibaba Cloud PAI team. It is deeply optimized for specific models, especially the Qwen series. It may provide higher performance and lower GPU memory usage.

Inference optimization

  • Deploy an LLM intelligent router: This feature dynamically distributes requests based on real-time metrics such as token throughput and GPU memory usage. It balances the computing power and memory allocation across inference instances. This is suitable for scenarios where you deploy multiple inference instances and load imbalance is expected. It improves cluster resource utilization and system stability.

  • Deploy MoE models using expert parallelism and PD separation: For Mixture-of-Experts (MoE) models, this approach uses techniques such as expert parallelism (EP) and Prefill-Decode (PD) separation to increase inference throughput and reduce deployment costs.

FAQ

Q: What should I do if the service is stuck in the Pending state and won't start?

Follow these steps to troubleshoot the issue:

  1. Check the instance status: On the service list page, click the service name to go to the service details page. In the Service Instance section, check the instance status. If it shows Out of Stock, this indicates that the public resource group has insufficient resources.

  2. Solutions (in order of priority):

    1. Solution 1: Change the instance type. Return to the deployment page and select a different GPU model.

    2. Solution 2: Use dedicated resources. Set Resource Type to EAS Resource Group to use dedicated resources. You must create the resource group in advance.

  3. Preventive measures:

    1. To avoid being limited by public resources, enterprise users should create dedicated resource groups.

    2. During peak hours, we recommend testing in multiple regions.

Q: Call errors

  1. The call returns the error Unsupported Media Type: Only 'application/json' is allowed

    Ensure that the request headers include Content-Type: application/json.

  2. The call returns the error The model '<model_name>' does not exist.

    The vLLM inference engine requires the model field to be completed correctly. You can retrieve the model name by calling the /v1/models API with a GET request.

For more information, see the EAS FAQ.