Elastic Algorithm Service (EAS) of Platform for AI (PAI) is an online model service for online inference scenarios and provides a one-click solution for automatic deployment of large language models (LLMs). EAS allows you to deploy multiple open source LLMs in an efficient manner and supports both standard and accelerated deployment methods. Accelerated deployment ensures high concurrency and low latency. This topic describes how to quickly deploy and call an LLM in EAS and the related FAQ.
Prerequisites
If you use a RAM user to deploy the model, the RAM user must have the permissions to use EAS.
Deploy an EAS service
Go to the EAS-Online Model Services page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.
On the LLM Deployment page, configure the parameters described in the following table. Retain the default values for other parameters.
Parameter
Description
Basic Information
Service Name
Specify a name for the service. Example: llm_demo001.
Version
Set this parameter to Open-source Model Quick Deployment.
Model Type
Set this parameter to qwen2.5-7b-instruct. EAS provides various model types to meet your business requirements, such as DeepSeek-R1, Qwen2-VL, and Meta-Llama-3.2-1B.
Deployment Method
Specify a deployment method based on model type. In this example, select SGLang Accelerate Deployment and then Single-Node-Standard. If you select Transformers Deployment (do not use any accelerated framework), API calling and web UI calling are supported. Accelerated deployment supports only API calling.
Resource Information
Resource Type
Set this parameter to Public Resources.
Deployment Resources
After you select a model type, the system automatically recommends a suitable instance type.
The following figure shows sample configurations.
Click Deploy. The model deployment requires approximately five minutes.
Call the model service
Accelerated deployment supports only API calling. This section describes how to call API operations and debug the model service in the SGLang Accelerate Deployment scenario. For more information about how to call API operations when other deployment methods are used, see LLM deployment.
Online debugging
On the Elastic Algorithm Service (EAS) page, find the desired model service, and click
and select Online Debugging in the Actions column.
Initiate a POST request. Enter a request address and configure the request body based on the deployment method that you use and click Send Request.
Request API:
/v1/chat/completions
Sample request body:
{ "model": "Qwen2.5-7B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of Canada?" } ] }
API calling
View the access address and token of the model service.
On the Elastic Algorithm Service (EAS) page, find the desired model service and click Invocation Method in the Service Type column.
In the Invocation Method dialog box, view the access address and token of the service.
Run the following code in the terminal to call the model service.
Python
from openai import OpenAI ##### API configurations ##### openai_api_key = "<EAS API KEY>" openai_api_base = "<EAS API Endpoint>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id print(model) def main(): stream = True chat_completion = client.chat.completions.create( messages=[ { "role": "user", "content": [ { "type": "text", "text": "What is the capital of Canada?", } ], } ], model=model, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result) if __name__ == "__main__": main()
Take note of the following parameters:
<EAS API KEY>: the token of the queried model service.
<EAS API Endpoint>: the endpoint of the queried model service.
Command lines
curl -X POST <service_url>/v1/chat/completions -d '{ "model": "<model_name>", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "You are a helpful and harmless assistant." } ] }, { "role": "user", "content": "What is the capital of Canada?" } ] }' -H "Content-Type: application/json" -H "Authorization: <token>"
Take note of the following parameters:
<service_url>: the URL of the queried service.
<token>: the token of the queried service.
<model_name>: the model name. You can obtain the model name by calling
<service_url>/v1/modes
.curl -X GET \ -H "Authorization: <token>" \ <service_url>/v1/models
FAQ
How do I switch to another open source LLM?
EAS provides the following open source LLMs: DeepSeek-R1, Llama, UI-TARS, QVQ, Gemma2, and Baichuan2. To switch between these models, perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
On the LLM Deployment page, set the Model Type parameter to the desired open source LLM. The system automatically updates the value of the Deployment Resources parameter.
Click Update.
How do I improve concurrency and reduce latency for the inference service?
EAS provides BladeLLM and vLLM, which are inference acceleration engines that you can use to ensure high concurrency and low latency. To use the inference acceleration engines, perform the following steps:
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
In the Basic Information section of the LLM Deployment page, select an accelerated deployment method for the Deployment Method parameter. Then, click Update.
You can also set the Version parameter to High-performance Deployment on the LLM Deployment page. This way, you can implement fast deployment based on the BladeLLM engine developed in PAI.
How do I mount a custom model?
If you set the Version parameter to High-performance Deployment on the LLM Deployment page, you can mount a custom model. Only Qwen and Llama text models of the open source, fine-tuned, and quantized versions can be deployed. In this example, Object Storage Service (OSS) is used to mount a custom model.
Upload the custom model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure shows the sample model files to be prepared.
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update in the Actions column.
On the LLM Deployment page, specify the following parameters and click Update
Parameter
Description
Basic Information
Version
Set this parameter to High-performance Deployment.
Image Version
Set this parameter to blade-llm:0.9.0.
Model Settings
Set this parameter to Custom Model and click OSS. Select the OSS path in which the custom model is stored.
Resource Information
Deployment Resources
Select an instance type. For more information, see Limits.
How do I call API operations to perform inference?
For more information, see Api invocation.
References
If you deploy a model service by using public resources, the billing stops when the service is stopped. For more information about billing, see Billing of EAS.
For more information about EAS, see EAS overview.
After you use the LangChain framework on the WebUI page, you can perform API-based model inference. We recommend that you use the search feature of your on-premises knowledge base. For more information, see RAG-based LLM chatbot.
For more information about the release notes of major versions of ChatLLM-WebUI, see Release notes for ChatLLM WebUI.
If you deploy the DeepSeek full version, we recommend that you use a multi-machine distributed inference solution.