Elastic Algorithm Service (EAS) of Platform for AI (PAI) provides a scenario-based deployment mode that allows you to deploy and call an open source large language model (LLM) by configuring a small number of parameters. This topic describes how to use EAS to deploy and call an LLM service.
Feature overview
The application of LLMs, such as ChatGPT and Qwen model series, garnered significant attention, especially in inference tasks. EAS allows you to deploy an LLM in a convenient and efficient manner and supports the following deployment options:
Quick deployment of open-source models: EAS allows you to deploy various open source LLMs, including DeepSeek-R1, DeepSeek-V3, QVQ-72B-Preview, QwQ-32B, QwQ-32B-Preview, Llama, Qwen, Marco, internlm3, Qwen2-VL, and AlphaFold2. The following deployment modes are supported: standard deployment, BladeLLM-based accelerated deployment, and vLLM-based accelerated deployment.
High-performance deployment: The BladeLLM engine developed by using PAI is used for efficient deployment to implement LLM inference with low latency and high throughput. High-performance deployment supports the deployment of open source public models and custom models. To deploy a custom model, select this deployment option.
The following table describes the differences between the two deployment options.
Type | Quick deployment of open-source models | High-performance deployment |
Model configuration | Open source public models |
|
Accelerated framework |
| Accelerated deployment: BladeLLM |
Calling method |
| API calling and online calling |
This topic uses quick deployment of open-source models as an example to describe how to deploy an LLM service. For information about how to perform high-performance deployment, see Get Started with BladeLLM.
Deploy an EAS service
Log on to the PAI console. Select a region and a workspace. Then, click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section of the Deploy Service page, select LLM Deployment.
On the LLM Deployment page, configure the parameters described in the following table.
Parameter
Description
Basic Information
Service Name
Specify a name for the model service.
Version
Select Open-source Model Quick Deployment. For information about how to perform high-performance deployment, see Get Started with BladeLLM.
Model Type
Select a model category.
Deployment Method
The following table describes different deployment methods that are supported by various model categories:
Accelerated deployment: BladeLLM
Accelerated deployment: SGLang
Accelerated deployment: vLLM
Standard deployment of Transformers: accelerated framework not involved
You can view the deployment methods of a specific model category when you deploy a service. Accelerated deployment supports only API inference and online debugging.
Resource Deployment
Resource Type
By default, Public Resources is selected. If you want to use dedicated resources to deploy a service, you can use EAS resource groups or resource quotas. For more information about how to purchase resource groups and create resource quotas, see Work with EAS resource groups and Lingjun resource quotas.
NoteYou can use resource quotas only in the China (Ulanqab), Singapore, China (Shenzhen), China (Beijing), China (Shanghai), and China (Hangzhou) regions.
Deployment Resources
When you use public resources, the system automatically selects an appropriate instance type after you select a model category.
Click Deploy.
Call an EAS service
WebUI calling
Only services deployed by using standard deployment support the WebUI calling method.
Click View Web App in the Service Type column of the desired service.
On the WebUI page, enter the chat content, such as
What is the capital of Canada?
, and click Send.
Online debugging
On the Elastic Algorithm Service (EAS) page, click
and select Online Debugging in the Actions column of the desired service.
In the Params section of the Online Debugging tab, add a request interface, such as
/v1/chat/completions
, configure the request body, and then click Send Request.ImportantIf the
Unsupported Media Type: Only 'application/json' is allowed
message appears, add the following request headers in the Headers section and click Send Request.Key: Content-Type
Value: application/json
If the
The model "Model name" does not exist.
message appears, check whether the model name in the request body is correct. You can send a GET request to the/v1/models
endpoint of the service to obtain the model name.
Request path:
<EAS_ENDPOINT>/v1/chat/completions
Request body:
// vLLM/SGLan-accelerated deployment: Specify a model name for the model parameter. You can send a GET request to the /v1/models endpoint of the service to obtain the model name. // BladeLLM-accelerated deployment: The model parameter is not required. // Transformers standard deployment: Set the model parameter to /model_dir/. Some models may need to be modified based on the returned result. { "model": "Qwen2.5-7B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of Canada?" } ] }
Sample result:
vLLM/SGLang-accelerated deployment
BladeLLM-accelerated deployment
Transformers standard deployment
API calling
1. Obtain the service access address and token.
On the Elastic Algorithm Service (EAS) page, select a workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, find the desired service and click its name.
In the Basic Information of the Overview tab, click View Endpoint Information to obtain the service token and access address.
NoteYou can select the public endpoint or virtual private cloud (VPC) endpoint:
To use the public endpoint, the client must support access over the Internet.
To use the VPC endpoint, the client must be in the VPC in which the EAS service resides.
2. Call the API
The API calling methods vary based on the deployment methods. Select an appropriate calling method based on your deployment method.
vLLM/SGLang-accelerated deployment
In the terminal, run the following code to call the service:
Python
from openai import OpenAI
##### API configuration #####
openai_api_key = "<EAS API KEY>"
openai_api_base = "<EAS API Endpoint>/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
print(model)
def main():
stream = True
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is the capital of Canada?",
}
],
}
],
model=model,
max_completion_tokens=2048,
stream=stream,
)
if stream:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
else:
result = chat_completion.choices[0].message.content
print(result)
if __name__ == "__main__":
main()
Take note of the following parameters:
Replace <EAS API KEY> with the service token.
Replace <EAS API Endpoint> with the service access address.
CLI
curl -X POST <service_url>/v1/chat/completions -d '{
"model": "<model_name>",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful and harmless assistant."
}
]
},
{
"role": "user",
"content": "What is the capital of Canada?"
}
]
}' -H "Content-Type: application/json" -H "Authorization: <token>"
Take note of the following parameters:
Replace <service_url> with the service access address.
Replace <token> with the service token.
Replace <model_name> with the model name. You can call
<service_url>/v1/modes
to obtain the model name.curl -X GET \ -H "Authorization: <token>" \ <service_url>/v1/models
BladeLLM-accelerated deployment
In the terminal, run the following code to call the service to obtain the generated text in streaming mode:
# Call EAS service
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: AUTH_TOKEN_FOR_EAS" \
-d '{"prompt":"What is the capital of Canada?", "stream":"true"}' \
<service_url>/v1/completions
Take note of the following parameters:
Set the Authorization value to the service token.
Replace <service_url> with the service access address.
Sample result:
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" The"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":1,"total_tokens":8},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" capital"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":2,"total_tokens":9},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" of"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":3,"total_tokens":10},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Canada"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":4,"total_tokens":11},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" is"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":5,"total_tokens":12},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":" Ottawa"}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":6,"total_tokens":13},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"","index":0,"logprobs":null,"text":"."}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":7,"total_tokens":14},"error_info":null}
data: {"id":"91f9a28a-f949-40fb-b720-08ceeeb2****","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"text":""}],"object":"text_completion","usage":{"prompt_tokens":7,"completion_tokens":8,"total_tokens":15},"error_info":null}
data: [DONE]
Transformers standard deployment
This deployment method supports the following calling methods:
Use HTTP
Non-streaming mode
The client sends the following types of standard HTTP requests when curl commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the service token. Replace $host with the service access address. The chatllm_data.txt file is a plain text file that contains the prompt, such as
what is the capital of Canada?
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following sample code provides a format example of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "What is the capital of Canada?", "system_prompt": "Act like you are a knowledgeable assistant who can provide information on geography and related topics.", "history": [ [ "Can you tell me what's the capital of France?", "The capital of France is Paris." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": true, "use_cache": true }
The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specifies whether to return the output tokens in streaming mode.
true
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is in the List[Tuple(str, str)] format.
[()]
temperature
The randomness of the model output. A larger value specifies a higher randomness. The value 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specifies whether to enable output sampling.
true
use_cache
Specifies whether to enable KV cache.
true
You can also implement your client based on the Python requests package. You can use the
--prompt
parameter to specify the request content, such aspython xxx.py --prompt "What is the capital of Canada?"
.import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<Public access address of the EAS service>" authorization = "<Public token of the EAS service>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the client request. The client manages the dialogue history to implement multi-round dialogues. In most cases, information from the previous round of dialogue is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
Take note of the following parameters:
Set the host parameter to the service access address.
Set the authorization parameter to the service token.
Streaming mode
In streaming mode, the HTTP SSE method is used. You can use the
--prompt
parameter to specify the request content, such aspython xxx.py --prompt "What is the capital of Canada?"
.import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
Take note of the following parameters:
Set the host parameter to the service access address.
Set the authorization parameter to the service token.
Use WebSocket
The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:
import os
import time
import json
import struct
from multiprocessing import Process
import websocket
round = 5
questions = 0
def on_message_1(ws, message):
if message == "<EOS>":
print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
time.time(), message), flush=True)
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
else:
print("{}".format(time.time()))
print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
time.time(), message), flush=True)
def on_message_2(ws, message):
global questions
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
if message == "<EOS>":
questions = questions + 1
if questions == 5:
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_message_3(ws, message):
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_error(ws, error):
print('error happened: ', str(error))
def on_close(ws, a, b):
print("### closed ###", a, b)
def on_pong(ws, pong):
print('pong:', pong)
# stream chat validation test
def on_open_1(ws):
print('Opening Websocket connection to the server ... ')
params_dict = {}
params_dict['prompt'] = """Show me a golang code example: """
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['do_sample'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
# raw_req = f"""To open a Websocket connection to the server: """
ws.send(raw_req)
# end the client-side streaming
# multi-round query validation test
def on_open_2(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {"max_new_tokens": 6144}
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['use_stream_chat'] = True
params_dict['prompt'] = "Hello"
params_dict = {
"system_prompt":
"Act like you are programmer with 5+ years of experience."
}
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please write a sorting algorithm in Python."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please convert the programming language to Java."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please introduce yourself."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please summarize the dialogue above."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
# Langchain validation test.
def on_open_3(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {}
# params_dict['prompt'] = """To open a Websocket connection to the server: """
params_dict['prompt'] = """Can you tell me what's the MNN?"""
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['use_stream_chat'] = False
params_dict['langchain'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
authorization = ""
host = "ws://" + ""
def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
ws = websocket.WebSocketApp(
host,
on_open=on_open_func,
on_message=on_message_func,
on_error=on_error,
on_pong=on_pong,
on_close=on_clonse_func,
header=[
'Authorization: ' + authorization],
)
# setup ping interval to keep long connection.
ws.run_forever(ping_interval=2)
if __name__ == "__main__":
for i in range(5):
p1 = Process(target=single_call, args=(on_open_1, on_message_1))
p2 = Process(target=single_call, args=(on_open_2, on_message_2))
p3 = Process(target=single_call, args=(on_open_3, on_message_3))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
Take note of the following parameters:
Set the authorization parameter to the service token.
Set the host parameter to the service access address. Replace the http prefix in the address with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. Default value: True.
Refer to the on_open_2 function in the preceding code to implement a multi-round dialogue.
References
You can use EAS to deploy a Retrieval-Augmented Generation (RAG)-based LLM chatbot. The chatbot supports information retrieval by using an on-premises knowledge base. After you use LangChain to integrate your business data, you can use WebUI or API operations to verify the inference capability of a model. For more information, see RAG-based LLM chatbot.