You can use Elastic Algorithm Service (EAS) of Platform for AI (PAI) to deploy a large language model (LLM) as an AI-powered web application. After you deploy the model, you can call the application by using the web UI or API operations. You can also use the LangChain framework to integrate enterprise knowledge base to implement intelligent conversation and automation capabilities. EAS also provides BladeLLM and vLLM inference acceleration engines that support high concurrency and low latency.
Background information
As foundation models such as ChatGPT and TongYi Qianwen become popular in the industry, the inference application of LLMs has come under the spotlight. EAS allows you to choose open source foundation models that are available on the market based on their performance and your business requirements. For example, you can quickly launch model files of LLMs such as Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B from a third party in EAS. You can also deploy an open source model as an inference application in a few clicks to perform model inference in EAS. This topic describes how to deploy an LLM model and call the model service by using EAS. This topic also provides answers to frequently asked questions.
Prerequisites
EAS is activated and the default workspace is created. For more information, see Activate PAI and create the default workspace.
If you use a RAM user to deploy the model, make sure that the RAM user is granted the management permissions on EAS. For more information, see Grant the permissions that are required to use EAS.
Limits
The inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.
Deploy model service in EAS
Go to the EAS-Online Model Services page.
Log on to the Platform for AI (PAI) console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.
On the Deploy Service page, configure the required parameters. The following table describes key parameters.
Parameter
Description
Service Name
The name of the service. The service name llm_demo001 is used in this example.
Deployment Method
Select Deploy Web App by Using Image.
Select Image
Click PAI Image, select chat-llm-webui from the drop-down list, and select 2.1 as the image version.
NoteYou can select the latest version of the image when you deploy the model service.
Command to Run
After you select an image version, the system automatically sets the command to
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat
. The command calls the Qwen-7B model. If you want to call other LLMs, you can replace the command. For more information, see How do I switch to another open source foundation model?Resource Group Type
Select Public Resource Group.
Resource Configuration Mode
Select General.
Resource Configuration
You must select a GPU type. We recommend that you use the ml.gu7i.c16m60.1-gu30 instance type to call the Qwen-7B model in terms of cost-effectiveness. For more information about the instance types that we recommend you to use when you deploy other open source LLMs, see How do I switch to another open source foundation model?.
Click Deploy. The deployment requires several seconds to complete.
When the Model Status changes to Running, the service is deployed.
Use web UI to perform model inference
Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web UI page.
Enter a sentence in the input text box below the dialog box to start the dialogue. For example, please
provide a financial learning plan
. Click Send to start the dialogue.
FAQ
How do I switch to another open source foundation model?
EAS allows you to use the following open source foundation models: Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B models. Perform the following steps to switch to these models to deploy services.
On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.
On the Deploy Service page, update the Command to Run and Instance Type parameters based on the following table. Then, click Deploy.
Model type
Command to Run
Recommended specification
Qwen-1.8B
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-1_8B-Chat
1 * GU30
1 * NVIDIA A10
1 * NVIDIA T4
1 * NVIDIA V100
Qwen-7B
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat
1 * GU30
1 * NVIDIA A10
Qwen-14B
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
Qwen-72B
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-72B-Chat
8 * NVIDIA V100 (gn6e)
Llama2-7B
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
1 * GU30
1 * NVIDIA A10
Llama2-13B
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
chatglm2-6B
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b
1 * GU30
1 * NVIDIA A10
chatglm3-6B
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b
1 * GU30
1 * NVIDIA A10
baichuan-13B
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
baichuan2-7B
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat
1 * GU30
1 * NVIDIA A10
baichuan2-13B
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
Yi-6B
python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B
1 * GU30
1 * NVIDIA A10
Mistral-7B
python webui/webui_server.py --model-path=mistralai/Mistral-7B-Instruct-v0.1
1 * GU30
1 * NVIDIA A10
falcon-7B
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct
1 * GU30
1 * NVIDIA A10
How do I use LangChain to integrate my business data?
What is LangChain:
LangChain is an open source framework that allows AI developers to integrate LLMs like GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work:
LangChain divides a document, such as a 20-page PDF file, into smaller chunks and embeds them into a vector store.
LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the local-generated answer and the user input together into the LLM to generate a custom answer.
How to configure LangChain:
Click LangChain and go to the LangChain tab on the web UI page.
Upload custom data in the lower-left corner of the web UI page. You can upload files in TXT, MD, DOCX, and PDF formats.
For example, you can drag and drop to upload a README.md file and click Vectorstore knowledge in the lower-left corner. The following result indicates that the custom data is loaded.
In the input box at the bottom of the web UI page, enter a sentence to start a dialogue.
For example, enter
how to install deepspeed
in the input box and click Send. The following figure shows the result.
NoteAfter you use LangChain to integrate business data on the web UI page, you can perform model inference with the data by using API operations. You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.
How do I improve concurrency and reduce latency for the inference service?
EAS provides BladeLLM and vLLM inference acceleration engines to ensure high concurrency and low latency for the inference service. Perform the following steps:
On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.
In the Model Service Information section, add the parameter
--backend=vllm
to the Command to Run parameter and click Deploy.ImportantThe inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.
Update versions of Transformers and vLLM.
As new models are released, the previously released models and latest released models may have incompatibility issues on the dependency of toolkit versions, such as Transformers and vLLM. To solve the incompatibility issues, we recommend that you upgrade toolkits such as Transformers and vLLM based on your business requirements. Specify the specific versions of the toolkits in the Third-party Library Settings section.
How do I mount a custom model?
You can use Object Storage Service (OSS) to mount a custom model. Procedure:
Upload the model and related configuration files to your OSS bucket. For more information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure provides a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.
In the Model Service Information section, specify the required parameters and click Deploy.
Parameter
Description
Model Settings
Click Specify Model Settings to configure the model.
Select Mount OSS Path in Model Settings. Set the OSS bucket path to the path where the custom model files reside. Example:
oss://bucket-test/data-oss/
.Set Mount Path to
/data
.Enable Read-only Mode: turn off the read-only mode.
Command to Run
Add the following parameters to Command to Run:
--model-path: Set the parameter to
/data
. Set the value to the mount path.--model-type: the model type.
For more information about commands to run for different types of models, see Commands to run.
How do I use API operations to perform model inference?
Obtain the service access endpoint and token.
Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.
Click the name of the service to go to the Service Details tab.
In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.
Perform model inference by calling API operations.
Call the service by using HTTP
Non-streaming mode
The client sends standard HTTP requests of the following types when cURL commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following code provides an example of the content format of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "How to install it?", "system_prompt": "Act like you are programmer with 5+ years of experience." "history": [ [ "Can you tell me what's the bladellm?", "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": True, "use_cache": True, }
The following table describes the parameters. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specify whether to return the output tokens in streaming mode.
True
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is of the List[Tuple(str, str)] type.
[()]
temperature
Specify the randomness of the model output. A larger value indicates a higher randomness. A value of 0 indicates a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specify whether to enable output sampling.
True
use_cache
Specify whether to enable KV cache.
True
You can also implement your own client based on the Python requests package. Example:
import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<EAS service public endpoint>" authorization = "<EAS service public token>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Streaming mode
The streaming mode uses the HTTP SSE method. Sample code:
import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Call the service by using WebSocket
The WebSocket protocol is more efficient for handling the conversation history. You can use the WebSocket method to connect to the service and perform one or multiple rounds of conversation. Sample code:
import os import time import json import struct from multiprocessing import Process import websocket round = 5 questions = 0 def on_message_1(ws, message): if message == "<EOS>": print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(), time.time(), message), flush=True) ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) else: print("{}".format(time.time())) print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(), time.time(), message), flush=True) def on_message_2(ws, message): global questions print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming if message == "<EOS>": questions = questions + 1 if questions == 5: ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_message_3(ws, message): print('pid-{} --- message received: {}'.format(os.getpid(), message)) # end the client-side streaming ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE) def on_error(ws, error): print('error happened: ', str(error)) def on_close(ws, a, b): print("### closed ###", a, b) def on_pong(ws, pong): print('pong:', pong) # stream chat validation test def on_open_1(ws): print('Opening Websocket connection to the server ... ') params_dict = {} params_dict['prompt'] = """Show me a golang code example: """ params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['do_sample'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') # raw_req = f"""To open a Websocket connection to the server: """ ws.send(raw_req) # end the client-side streaming # multi-round query validation test def on_open_2(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {"max_new_tokens": 6144} params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['use_stream_chat'] = True params_dict['prompt'] = "Hello! " params_dict = { "system_prompt": "Act like you are programmer with 5+ years of experience." } raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please write a sorting algorithm in Python." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please convert to the Java implementation." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please introduce yourself?" raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) params_dict['prompt'] = "Please summarize the dialogue above." raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) # Langchain validation test. def on_open_3(ws): global round print('Opening Websocket connection to the server ... ') params_dict = {} # params_dict['prompt'] = """To open a Websocket connection to the server: """ params_dict['prompt'] = """Can you tell me what's the MNN?""" params_dict['temperature'] = 0.9 params_dict['top_p'] = 0.1 params_dict['top_k'] = 30 params_dict['max_new_tokens'] = 2048 params_dict['use_stream_chat'] = False params_dict['langchain'] = True raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8') ws.send(raw_req) authorization = "" host = "ws://" + "" def single_call(on_open_func, on_message_func, on_clonse_func=on_close): ws = websocket.WebSocketApp( host, on_open=on_open_func, on_message=on_message_func, on_error=on_error, on_pong=on_pong, on_close=on_clonse_func, header=[ 'Authorization: ' + authorization], ) # setup ping interval to keep long connection. ws.run_forever(ping_interval=2) if __name__ == "__main__": for i in range(5): p1 = Process(target=single_call, args=(on_open_1, on_message_1)) p2 = Process(target=single_call, args=(on_open_2, on_message_2)) p3 = Process(target=single_call, args=(on_open_3, on_message_3)) p1.start() p2.start() p3.start() p1.join() p2.join() p3.join()
Parameters:
Set authorization to the service token.
Set host to the service access endpoint. Replace the http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in the streaming mode. The default value is True, which indicates that the server returns streaming data.
Refer to the implementation method of the on_open_2 function in the preceding sample code to implement multi-round conversation.
References
For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.