All Products
Search
Document Center

Platform For AI:Deploy large language models using EAS

Last Updated:Dec 09, 2025

Manually deploying a Large Language Models (LLMs) involves complex environment configuration, performance tuning, and cost management. Elastic Algorithm Service (EAS) offers a one-stop solution to deploy popular LLMs like DeepSeek and Qwen with a single click.

Step 1: Deploy an LLM service

This topic demonstrates deploying Qwen3-8B from Public Models.

Note

A Public Model is a model with a pre-configured Deployment Template, enabling one-click deployment without model file preparation. If you select a custom model, you must mount the model files from a service like Object Storage Service (OSS).

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service, and in the Scenario-based Model Deployment area, click LLM Deployment.

  3. On the LLM Deployment page, configure the following key parameters.

    • Model Settings: Select Public Model, then search for and select Qwen3-8B from the list.

    • Inference Engine: We recommend SGLang/vLLM for their high compatibility with the OpenAI API standard. This guide uses vLLM. For more information, see Select a suitable inference engine.

    • Deployment Template: Select Single-Node. The system automatically populates the recommended Instance Type, image, and other parameters from the template.

  4. Click Deploy. The service deployment takes about 5 minutes. When the service status changes to Running, the deployment is complete.

    Note

    If the service deployment fails, see Abnormal service status for solutions.

Step 2: Debug Online

After deployment, first verify the service is running correctly. Click the target service name to go to the detail page, switch to the Online Debugging tab, and then construct and send a request as follows.

  1. Select the POST method.

  2. Append the path /v1/chat/completions to the end of the auto-filled URL.

  3. Make sure that the Headers include Content-Type: application/json.

  4. Fill in the Body: When using the vLLM Inference Engine, you must replace the model value with the correct model name. To obtain the model name, send a GET request to the /v1/models endpoint. Because you deployed Qwen3-8B in Step 1, you must replace <model_name> with Qwen3-8B.

    {
      "model": "<model_name>",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ],
      "max_tokens": 1024
    }

image

Step 3: Call the LLM service

Before you make a call, go to the Overview tab of the service details page, click View Endpoint Information, and obtain the endpoint and token. These values are referenced as <EAS_ENDPOINT> and <EAS_TOKEN> in the following examples.

API call

The handling of the model parameter differs significantly between Inference Engines:

  • vLLM/SGLang: The model value is configured as the model name, which can be obtained by sending a GET request to the /v1/models endpoint.

  • BladeLLM: The BladeLLM endpoint itself does not require the model parameter. However, when using the OpenAI SDK, this parameter is mandatory on the client side. To ensure compatibility, you can set it to an empty string "". For more information, see BladeLLM service invocation parameter configuration.

    Important

    When using BladeLLM, you must explicitly set the max_tokens parameter in your request. Otherwise, the output is truncated to 16 tokens by default.

The following code provides examples of how to invoke the service:

OpenAI SDK

We recommend using the official Python SDK to interact with the service. Make sure you have the OpenAI SDK installed: pip install openai.

from openai import OpenAI

# 1. Configure the client
# Replace <EAS_TOKEN> with the token of the deployed service
openai_api_key = "<EAS_TOKEN>"
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
openai_api_base = "<EAS_ENDPOINT>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 2. Get the model name
# For BladeLLM, set model = "". BladeLLM does not require the model input parameter and does not support using client.models.list() to get the model name. Set it to an empty string to meet the OpenAI SDK's mandatory parameter requirement.
models = client.models.list()
model = models.data[0].id
print(model)

# 3. Initiate a chat request
# Supports streaming (stream=True) and non-streaming (stream=False) output
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "hello"},          
    ],
    model=model,
    top_p=0.8,
    temperature=0.7,
    max_tokens=1024,
    stream=stream,
)

if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

cURL

For quick testing or script integration, you can use cURL.

curl -X POST <EAS_ENDPOINT>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "hello"
        }
        ],
        "max_tokens":1024,
        "temperature": 0.7,
        "top_p": 0.8,
        "stream":true
    }' 

Where:

  • Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your service's Endpoint and Token.

  • Replace <model_name> with the model name. For vLLM/SGLang, you can obtain it from the model list endpoint <EAS_ENDPOINT>/v1/models. For BladeLLM, this endpoint is not supported, and you can omit this field or set it to "".

  • curl -X GET <EAS_ENDPOINT>/v1/models -H "Authorization: <EAS_TOKEN>"

Python requests library

If you prefer not to add the OpenAI SDK dependency, you can use the requests library.

import json
import requests

# Replace <EAS_ENDPOINT> with the endpoint of the deployed service
EAS_ENDPOINT = "<EAS_ENDPOINT>"
# Replace <EAS_TOKEN> with the token of the deployed service
EAS_TOKEN = "<EAS_TOKEN>"
# Replace <model_name> with the model name. You can get the name from the model list interface at <EAS_ENDPOINT>/v1/models. For BladeLLM, this interface is not supported. You can omit the "model" field or set it to "".
model = "<model_name>"

url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}

stream = True
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "hello"},
]

req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024,
    "model": model,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)

if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        # The following code processes streaming responses in Server-Sent Events (SSE) format
        if msg.startswith("data:"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                if resp["choices"][0]["delta"].get("content") is not None:
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

Build a local WebUI

Gradio is a user-friendly Python library for quickly creating interactive interfaces for machine learning models. Follow these steps to run a Gradio WebUI locally.

  1. Download the code: Download the appropriate code based on the inference engine you selected during deployment. Use the GitHub link if you have stable network access to GitHub. Otherwise, use the OSS link.

  2. Prepare the environment: Python 3.10 or later is required. Install the dependencies: pip install openai gradio.

  3. Start the web application: Run the following command in your terminal. Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your service's Endpoint and Token.

    python webui_client.py --eas_endpoint "<EAS_ENDPOINT>" --eas_token "<EAS_TOKEN>"
  4. After the application starts successfully, a local URL (usually http://127.0.0.1:7860) is printed to your console. Open this URL in your browser to access the WebUI.

Integrate with third-party applications

You can integrate EAS services with various clients and development tools that support the OpenAI API. The core configuration requires the service endpoint and token, and model name.

Dify

  1. Install the OpenAI-API-compatible model provider

    Click your profile picture in the upper-right corner and select Settings. In the left-side navigation pane, choose Model Provider. If OpenAI-API-compatible is not in the Models, find and Install it from the list below.

    image

  2. Add model

    Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the following parameters:

    • Model Type: Select LLM.

    • Model Name: For vLLM deployments, obtain the name by sending a GET request to the /v1/models endpoint. This example uses Qwen3-8B.

    • API Key: Enter the EAS service token.

    • API endpoint URL: Enter the public Endpoint of the EAS service. Note: Append /v1 to the end.

  3. Test the model

    1. On the Dify main page, click Create from Blank. Select the Chatflow type, enter an application name and other information, and then click Create.

    2. Click the LLM node, select the model you added, and set the context and Prompt.image

    3. Click Preview in the upper-right corner and enter a question.

      image

Chatbox

  1. Go to Chatbox, download and install the appropriate version for your device, or directly Launch Web App. This guide uses macOS M3 as an example.

  2. Add a model provider. Click Settings, add a model provider, and enter a name such as pai. For API Mode, select OpenAI API Compatible.

    image

  1. Select the pai model provider and configure the following parameters.

    • API Key: The EAS service token.

    • API Host: Enter the public endpoint of the EAS service. Note: Append /v1 to the end of the URL.

    • API Path: Leave this field blank.

    • Model: Click Fetch to add models. If the Inference Engine is BladeLLM, you cannot get models through this interface. Click New and enter the model name manually.

    image

  2. Test the chat. Click New Chat, and select the model service in the lower-right corner of the text input box.

    image

Cherry Studio

Billing

Costs may include the following. For more information, see Elastic Algorithm Service (EAS) billing details.

  • Compute fees: This is the main source of cost. When creating an EAS service, choose pay-as-you-go or subscription resources based on your needs.

  • Storage fees: If you use a custom model, files stored in OSS will incur storage fees.

Going live

Choose a suitable model

  1. Define your application scenario:

    • General conversation: Be sure to choose an Instruction-Tuned Model, not a Foundation Model, to ensure the model can understand and follow your instructions.

    • Code generation: Choose specialized code models, such as the Qwen3-Coder series. They typically perform much better on code-related tasks than general-purpose models.

    • Domain-specific tasks: If the task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned for that domain or fine-tuning a general-purpose model yourself.

  2. Balance performance and cost: Generally, a larger parameter count means a more capable model, but also one that requires more Computing Power for deployment, leading to higher inference costs. We recommend starting with a smaller model (such as a 7B model) for validation. If its performance does not meet your requirements, gradually try larger models.

  3. Refer to authoritative benchmarks: You can refer to industry-recognized leaderboards like OpenCompass and LMSys Chatbot Arena. These benchmarks provide objective evaluations of models across dimensions, such as reasoning, coding, and math, offering valuable guidance for model selection.

Choose a suitable inference engine

  • vLLM/SGLang: As mainstream choices in the open-source community, they offer broad model support and extensive community documentation and examples, making them easy to integrate and troubleshoot.

  • BladeLLM: Developed by the Alibaba Cloud PAI team, BladeLLM is deeply optimized for specific models, especially the Qwen series, often achieving higher performance and lower GPU Memory consumption.

Optimize inference

  • LLM intelligent router: Dynamically distributes requests based on real-time metrics like token Throughput and GPU Memory usage. This balances the Computing Power and GPU Memory allocation across inference instances, improving cluster resource utilization and system stability. It is suitable for scenarios with multiple inference instances and an expected uneven request load.

  • Deploy MoE models based on expert parallelism and PD separation: For Mixture-of-Experts (MoE) models, this approach uses technologies like expert parallelism (EP) and Prefill-Decode (PD) separation to increase inference Throughput and reduce deployment costs.

FAQ

  1. Error: Unsupported Media Type: Only 'application/json' is allowed

    Ensure the request Headers include Content-Type: application/json.

  2. Error: The model '<model_name>' does not exist.

    The vLLM Inference Engine requires a correct model field. Obtain the model name by sending a GET request to the /v1/models endpoint.

For more information, see EAS FAQ.