Llama 2 Chat models are suitable for dialogue use cases. Elastic Algorithm Service (EAS) of Platform for AI (PAI) allows you to easily deploy a chat application that is powered by a Llama 2 Chat model and accessible by using a web page. You can also use LangChain to integrate your business data into the application to ensure that the answers are aligned with your business requirements
Background information
Llama 2 is a series of open-source large language models (LLM) provided by Meta. The models vary in parameter scale, which are 7B, 13B, or 70B. Llama 2 models were trained based on 2 trillion tokens which is an 40% increase over Llama 1. Llama 2 models support a maximum sequence length of 4,096 tokens, which is twice that of Llama 1 models. Llama 2 Chat models are fine-tuned versions of pre-trained Llama 2 models that cater to chat scenarios. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are used in the fine-tuning process to increase the safety and alignment of the models with human preferences. The fine-tuning data includes publicly available instruction datasets and over 1 million human-labeled samples. Llama 2 Chat can be used as a chat assistant in various natural language generation scenarios. This topic describes how to deploy a Llama 2 model as a chat application in EAS and use the web UI to perform model inference. This topic also provides answers to frequently asked questions about Llama 2 model deployment. In this example, the Llama2-13b-chat model is used.
Deploy a model service in EAS
Go to the EAS-Online Model Services page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.
On the Create Service page, configure the parameters. The following table describes the parameters.
Parameter
Description
Service Name
The name of the service. In this example, chatllm_llama2_13b is specified.
Deployment Method
Select Deploy Web App by Using Image.
Select Image
Click PAI Image, select chat-llm-webui from the drop-down list, and then select 2.0 for the image version.
NoteYou can select the latest version of the image when you deploy the model service. If versions newer than 2.0 are available when you deploy the service, select the latest version.
Command to Run
Specify one of the following commands based on your model. In this example, the 13B model is used.
Command to run if you use the 13B model:
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16
.Command to run if you use the 7B model:
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
.
Set the port number to 8000.
Resource Group Type
Select Public Resource Group.
Resource Configuration Mode
Select General.
Resource Configuration
Click GPU and then select an instance type from the list. In this example, ecs.gn6e-c12g1.3xlarge is selected.
The 13B model requires an instance type that belongs to the gn6e instance family or higher.
For the 7B model, an A10 or GU30 instance is recommended.
Additional System Disk
Set the value to 50.
Click Deploy. The deployment requires approximately five minutes to complete.
When the Model Status changes to Running, the service is deployed.
Use the web application to perform model inference
Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web application.
Enter a prompt in the input text box, such as
Give me a plan for learning the basics of personal finance
. Click Send.
FAQ
How do I use LangChain to integrate my business data into the application?
What is LangChain
LangChain is an open source framework that allows AI developers to integrate LLMs such as GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work
LangChain divides a document, such as a 20-page PDF file, into smaller chunks, converts the chunks into numerical vectors by using embedding models such as BAAI General Embedding (BGE) or text2vec, and then embeds the vectors into a vector store.
LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the locally generated answer and the user input together into the LLM to generate a custom answer.
How to integrate your business data
Click the LangChain tab in the upper-left corner of the page.
Upload custom data in the lower-left corner of the web page. You can upload files in the following formats: TXT, MD, DOCX, and PDF.
For example, you can drag and drop a README.md file to upload the file and then click Vectorstore knowledge in the lower-left corner. A message is displayed when the custom data is loaded.
In the input box in the lower part of the page, enter a sentence to start a dialogue.
How do I switch to another open source foundation model?
EAS comes with several open source foundation models, such as Llama 2, ChatGLM, and Tongyi Qianwen. Perform the following steps to switch to the models to deploy services.
On the Elastic Algorithm Service (EAS) page, find the service that you want to update. Click Update Service in the Actions column of the service.
On the Update Service page, modify the Command to Run and Instance Type parameters based on the following table. Then, click Update.
Model type
Use method
Command to Run
Recommended specification
Llama2-13b
API+WebUI
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
Llama2-7b
API+WebUI
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
1 * GU30
1 * NVIDIA A10
ChatGLM2-6B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b
1 * GU30
1 * NVIDIA A10
Qwen-7b (Tongyi Qianwen-7b)
API+WebUI
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat
1 * GU30
1 * NVIDIA A10
ChatGLM-6B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm-6b
1 * GU30
1 * NVIDIA A10
Baichuan-13B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
Falcon-7B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct
1 * GU30
1 * NVIDIA A10
Baichuan2-7B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat
1 * GU30
1 * NVIDIA A10
Baichuan2-13B
API+WebUI
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat
2 * GU30
2 * NVIDIA A10
Qwen-14b (Tongyi Qianwen-14b)
API+WebUI
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat
1 * NVIDIA V100 (gn6e)
2 * GU30
2 * NVIDIA A10
How do I deploy a custom model?
You can use Object Storage Service (OSS) to mount a custom model. Procedure:
Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Get started by using the OSS console.
The following figure shows a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.
In the Model Service Information section, configure the following parameters and click Update.
Parameter
Description
Model Settings
Click Specify Model Settings.
Select Mount OSS Path in the Model Settings section. Set the OSS bucket path to the path of the custom model files. Example:
oss://bucket-test/data-oss/
.Set Mount Path to
/data
.Turn off Enable Read-only Mode to disable the read-only mode.
Command to Run
Add the following options to the command:
--model-path: Set the value to
/data
, which is the same as the mount path.--model-type: the type of the model.
For more information about commands to run for different types of models, see Commands to run.
How do I use API operations to perform model inference?
Obtain the service access endpoint and token.
Go to the Elastic Algorithm Service (EAS) page. For more information, see the Deploy model service in EAS section in this topic.
Click the name of the service to go to the Service Details tab.
In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.
Perform model inference by calling API operations.
Non-streaming mode
The client sends standard HTTP requests of the following types when cURL commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "How to install it?", "system_prompt": "Act like you are programmer with 5+ years of experience." "history": [ [ "Can you tell me what's the bladellm?", "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": True, "use_cache": True, }
The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specify whether to return the output tokens in streaming mode.
True
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is in the List[Tuple(str, str)] format.
[()]
temperature
Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specify whether to enable output sampling.
True
use_cache
Specify whether to enable KV cache.
True
You can also implement your own client based on the Python requests package. Example:
import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<EAS service public endpoint>" authorization = "<EAS service public token>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Streaming mode
The streaming mode uses the HTTP SSE method. Sample code:
import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Set authorization to the service token.
Set host to the service access endpoint. Replace the http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. The default value is True, which specifies that the server returns streaming data.
The on_open_2 function in the preceding code is used to implement multi-round dialogues.
HTTP
WebSocket
The WebSocket protocol can efficiently handle the conversation history. You can use the WebSocket method to connect to the service and perform one or more rounds of conversation. Sample code:
import os
import time
import json
import struct
from multiprocessing import Process
import websocket
round = 5
questions = 0
def on_message_1(ws, message):
if message == "<EOS>":
print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
time.time(), message), flush=True)
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
else:
print("{}".format(time.time()))
print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
time.time(), message), flush=True)
def on_message_2(ws, message):
global questions
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
if message == "<EOS>":
questions = questions + 1
if questions == 5:
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_message_3(ws, message):
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_error(ws, error):
print('error happened: ', str(error))
def on_close(ws, a, b):
print("### closed ###", a, b)
def on_pong(ws, pong):
print('pong:', pong)
# stream chat validation test
def on_open_1(ws):
print('Opening Websocket connection to the server ... ')
params_dict = {}
params_dict['prompt'] = """Show me a golang code example: """
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['do_sample'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
# raw_req = f"""To open a Websocket connection to the server: """
ws.send(raw_req)
# end the client-side streaming
# multi-round query validation test
def on_open_2(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {"max_new_tokens": 6144}
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['use_stream_chat'] = True
params_dict['prompt'] = "Hello!"
params_dict = {
"system_prompt":
"Act like you are programmer with 5+ years of experience."
}
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please write a sorting algorithm in Python."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please convert the code to Java."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please introduce yourself."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please summarize the dialogue above."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
# Langchain validation test.
def on_open_3(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {}
# params_dict['prompt'] = """To open a Websocket connection to the server: """
params_dict['prompt'] = """Can you tell me what's the MNN?"""
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['use_stream_chat'] = False
params_dict['langchain'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
authorization = ""
host = "ws://" + ""
def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
ws = websocket.WebSocketApp(
host,
on_open=on_open_func,
on_message=on_message_func,
on_error=on_error,
on_pong=on_pong,
on_close=on_clonse_func,
header=[
'Authorization: ' + authorization],
)
# setup ping interval to keep long connection.
ws.run_forever(ping_interval=2)
if __name__ == "__main__":
for i in range(5):
p1 = Process(target=single_call, args=(on_open_1, on_message_1))
p2 = Process(target=single_call, args=(on_open_2, on_message_2))
p3 = Process(target=single_call, args=(on_open_3, on_message_3))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
In the preceding command:
What options are supported in the commands?
The following tables describe the options that you can configure in the command.
Option | Description | Default value |
--model-path | Specify the preset model name or a custom model path.
| meta-llama/Llama-2-7b-chat-hf |
--cpu | Use CPU to perform model inference. Example: | By default, GPU is used for model inference. |
--precision | Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: | The system automatically specifies the precision of the 7B model based on the GPU memory size. |
--port | Specify the listening port of the server. Sample code: | 8000 |
--api-only | Allows users to access the service only by calling API operations. By default, the service starts both the web UI and API server. Sample code: | False |
--no-api | Allows users to access the service only by using the web UI. By default, the service starts both the web UI and API server. Sample code: | False |
--max-new-tokens | The maximum number of tokens. Sample code: | 2048 |
--temperature | Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1. Sample code: | 0.95 |
--max_round | The number of dialogue rounds supported during inference. Sample code: | 5 |
--top_k | The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: | N/A |
--top_p | The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: | N/A |
--no-template | Models such as Llama2 and falcon provide a default prompt template. You can specify this parameter if you want to use your template instead of the default prompt template. Sample code: | If this parameter is not specified, the default prompt template is automatically used. |
--log-level | Specify the log output level. Valid values: DEBUG, INFO, WARNING, and ERROR. Sample code: | INFO |
--export-history-path | You can use EAS-LLM to export the conversation history. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify a mount path of an OSS bucket. EAS exports the records of the conversation that happened over a specific period of time to a file. Sample code: | By default, this feature is disabled. |
--export-interval | The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the | 3600 |
References
For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.