Llama 2 Chat models are suitable for dialogue use cases. Elastic Algorithm Service (EAS) of Platform for AI (PAI) allows you to easily deploy a chat application that is powered by a Llama 2 Chat model and accessible by using a web page. You can also use LangChain to integrate your business data into the application to ensure that the answers meet your business requirements.
Background information
Llama 2 is a series of open-source large language models (LLMs) provided by Meta. The models vary in parameter scale, which are 7B, 13B, or 70B. Llama 2 models are trained on 2 trillion tokens indicating an 40% increase over Llama 1. Llama 2 models support a maximum context length of 4,096 tokens, which is twice that supported by Llama 1 models. Llama 2 Chat models are fine-tuned versions of pre-trained Llama 2 models that cater to chat scenarios. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are used in the fine-tuning process to enhance model safety and alignment with human preferences. The fine-tuning data includes publicly available instruction datasets and more than 1 million human-labeled samples. Llama 2 Chat can be used as a chat assistant in various natural language generation scenarios. This topic describes how to deploy a Llama 2 model as a chat application in EAS and use the web application to perform model inference. This topic also provides answers to some frequently asked questions about Llama 2 model deployment. In this example, the Llama2-13b-chat model is used.
Deploy a model service in EAS
Go to the EAS-Online Model Services page.
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.
On the Deploy Service page, configure the parameters. The following table describes the parameters.
Parameter
Description
Service Name
The name of the service. In this example, chatllm_llama2_13b is specified.
Deployment Method
Select Deploy Web App by Using Image.
Select Image
Click PAI Image, select chat-llm-webui from the drop-down list, and then select 2.0 for the image version.
NoteThe image version is updated frequently. We recommend that you select the latest version.
Command to Run
Specify one of the following commands based on your model. In this example, the 13B model is used.
Command to run if you use the 13B model:
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16
.Command to run if you use the 7B model:
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
.
Set the port number to 8000.
Resource Group Type
Select Public Resource Group.
Resource Configuration Mode
Select General.
Resource Configuration
Click GPU and then select an instance type from the list. In this example, ecs.gn6e-c12g1.3xlarge is selected.
The 13B model requires an instance type that belongs to the gn6e instance family or higher.
For the 7B model, we recommend that you use an A10 or GU30 instance.
Additional System Disk
Set the value to 50.
Click Deploy. The deployment requires approximately 5 minutes to complete.
When the Model Status changes to Running, the service is deployed.
Use the web application to perform model inference
Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web application.
Enter a prompt in the input text box, such as
Give me a plan for learning the basics of personal finance
. Click Send.
FAQ
How do I use LangChain to integrate my business data into the application?
What is LangChain?
LangChain is an open source framework that allows AI developers to integrate LLMs such as GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work?
LangChain divides a document, such as a 20-page PDF file, into smaller chunks, converts the chunks into numerical vectors by using embedding models such as BAAI General Embedding (BGE) or text2vec, and then embeds the vectors into a vector store.
LangChain processes the user input and stores the data as the on-premises knowledge base of the LLM. In each inference process, LangChain searches the on-premises knowledge base for a text chunk similar to the input question, and then inputs the answer generated by the on-premises knowledge base and the prompt together into the LLM to generate a custom answer.
How to integrate your business data?
Click the LangChain tab in the upper-left corner of the web application interface.
Upload custom data in the lower-left corner of the web application interface. You can upload files in the following formats: TXT, MD, DOCX, and PDF.
For example, you can drag a README.md file to upload the file and then click Vectorstore knowledge in the lower-left corner. A message is displayed when the custom data is loaded.
In the input box in the lower part of the page, enter a sentence to start a dialogue.
How do I switch to another open source foundation model?
EAS comes with several open source foundation models, such as Llama 2, ChatGLM, and Tongyi Qianwen. To switch to the models to deploy services, perform the following steps.
On the EAS-Online Model Services page, find the service that you want to update and click Update Service in the Actions column.
On the Deploy Service page, modify the Command to Run and Instance Type parameters based on the following table. Then, click Deploy.
Model type
Use method
Command to run
Recommended specification
Llama2-13b
API and web application
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16
1 × NVIDIA V100 (gn6e)
2 × GU30
2 × NVIDIA A10
Llama2-7b
API and web application
python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf
1 × GU30
1 × NVIDIA A10
ChatGlm2-6B
API and web application
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b
1 × GU30
1 × NVIDIA A10
Qwen-7b (Tongyi Qianwen-7b)
API and web application
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat
1 × GU30
1 × NVIDIA A10
ChatGlm-6B
API and web application
python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm-6b
1 × GU30
1 × NVIDIA A10
Baichuan-13B
API and web application
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat
1 × NVIDIA V100 (gn6e)
2 × GU30
2 × NVIDIA A10
Falcon-7B
API and web application
python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct
1 × GU30
1 × NVIDIA A10
Baichuan2-7B
API and web application
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat
1 × GU30
1 × NVIDIA A10
Baichuan2-13B
API and web application
python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat
2 × GU30
2 × NVIDIA A10
Qwen-14b (Tongyi Qianwen-14b)
API and web application
python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat
1 × NVIDIA V100 (gn6e)
2 × GU30
2 × NVIDIA A10
How do I mount a custom model?
You can use Object Storage Service (OSS) to mount a custom model. Procedure:
Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure shows a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.
In the Model Service Information section, configure the following parameters and click Deploy.
Parameter
Description
Model Settings
Click Specify Model Settings.
Select Mount OSS Path in the Model Settings section. Set the OSS bucket path to the path of the custom model files. Example:
oss://bucket-test/data-oss/
.Set the Mount Path parameter to
/data
.Turn off Enable Read-only Mode to disable the read-only mode.
Command to Run
Add the following options to the command:
--model-path: Set the value to
/data
, which is the same as the mount path.--model-type: the type of the model.
For more information about commands to run for different types of models, see Commands to run.
How do I use API operations to perform model inference?
Obtain the service access endpoint and token.
Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.
Click the name of the service to go to the Service Details tab.
In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.
Perform model inference by calling API operations.
Non-streaming mode
The client sends the following types of standard HTTP requests when cURL commands are run.
STRING requests
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.
Structured requests
curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"
Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:
{ "max_new_tokens": 4096, "use_stream_chat": false, "prompt": "How to install it?", "system_prompt": "Act like you are programmer with 5+ years of experience.", "history": [ [ "Can you tell me what's the bladellm?", "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc." ] ], "temperature": 0.8, "top_k": 10, "top_p": 0.8, "do_sample": true, "use_cache": true }
The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.
Parameter
Description
Default value
max_new_tokens
The maximum number of output tokens.
2048
use_stream_chat
Specify whether to return the output tokens in streaming mode.
true
prompt
The user prompt.
""
system_prompt
The system prompt.
""
history
The dialogue history. The value is in the List[Tuple(str, str)] format.
[()]
temperature
The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.
0.95
top_k
The number of outputs selected from the generated results.
30
top_p
The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.
0.8
do_sample
Specify whether to enable output sampling.
true
use_cache
Specify whether to enable KV cache.
true
You can also implement your own client based on the Python requests package. Example:
import argparse import json from typing import Iterable, List import requests def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = False temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "<EAS service public endpoint>" authorization = "<EAS service public token>" print(f"Prompt: {prompt!r}\n", flush=True) # System prompts can be included in the requests. system_prompt = "Act like you are programmer with \ 5+ years of experience." # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) output, history = get_response(response) print(f" --- output: {output} \n --- history: {history}", flush=True) # The server returns a JSON response that includes the inference result and dialogue history. def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) output = data["response"] history = data["history"] return output, history
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Streaming mode
In streaming mode, the HTTP SSE method is used. Sample code:
import argparse import json from typing import Iterable, List import requests def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINE_CLEAR = '\x1b[2K' for _ in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True) def post_http_request(prompt: str, system_prompt: str, history: list, host: str, authorization: str, max_new_tokens: int = 2048, temperature: float = 0.95, top_k: int = 1, top_p: float = 0.8, langchain: bool = False, use_stream_chat: bool = False) -> requests.Response: headers = { "User-Agent": "Test Client", "Authorization": f"{authorization}" } if not history: history = [ ( "San Francisco is a", "city located in the state of California in the United States. \ It is known for its iconic landmarks, such as the Golden Gate Bridge \ and Alcatraz Island, as well as its vibrant culture, diverse population, \ and tech industry. The city is also home to many famous companies and \ startups, including Google, Apple, and Twitter." ) ] pload = { "prompt": prompt, "system_prompt": system_prompt, "top_k": top_k, "top_p": top_p, "temperature": temperature, "max_new_tokens": max_new_tokens, "use_stream_chat": use_stream_chat, "history": history } if langchain: pload["langchain"] = langchain response = requests.post(host, headers=headers, json=pload, stream=use_stream_chat) return response def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) output = data["response"] history = data["history"] yield output, history if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--top-k", type=int, default=4) parser.add_argument("--top-p", type=float, default=0.8) parser.add_argument("--max-new-tokens", type=int, default=2048) parser.add_argument("--temperature", type=float, default=0.95) parser.add_argument("--prompt", type=str, default="How can I get there?") parser.add_argument("--langchain", action="store_true") args = parser.parse_args() prompt = args.prompt top_k = args.top_k top_p = args.top_p use_stream_chat = True temperature = args.temperature langchain = args.langchain max_new_tokens = args.max_new_tokens host = "" authorization = "" print(f"Prompt: {prompt!r}\n", flush=True) system_prompt = "Act like you are programmer with \ 5+ years of experience." history = [] response = post_http_request( prompt, system_prompt, history, host, authorization, max_new_tokens, temperature, top_k, top_p, langchain=langchain, use_stream_chat=use_stream_chat) for h, history in get_streaming_response(response): print( f" --- stream line: {h} \n --- history: {history}", flush=True)
In the preceding command:
Set host to the service access endpoint.
Set authorization to the service token.
Set authorization to the service token.
Set host to the service access endpoint. Replace http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. The default value is True, which specifies that the results are returned in streaming mode.
The on_open_2 function in the preceding code is used to implement multi-round dialogues.
HTTP
WebSocket
The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:
import os
import time
import json
import struct
from multiprocessing import Process
import websocket
round = 5
questions = 0
def on_message_1(ws, message):
if message == "<EOS>":
print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
time.time(), message), flush=True)
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
else:
print("{}".format(time.time()))
print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
time.time(), message), flush=True)
def on_message_2(ws, message):
global questions
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
if message == "<EOS>":
questions = questions + 1
if questions == 5:
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_message_3(ws, message):
print('pid-{} --- message received: {}'.format(os.getpid(), message))
# end the client-side streaming
ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
def on_error(ws, error):
print('error happened: ', str(error))
def on_close(ws, a, b):
print("### closed ###", a, b)
def on_pong(ws, pong):
print('pong:', pong)
# stream chat validation test
def on_open_1(ws):
print('Opening Websocket connection to the server ... ')
params_dict = {}
params_dict['prompt'] = """Show me a golang code example: """
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['do_sample'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
# raw_req = f"""To open a Websocket connection to the server: """
ws.send(raw_req)
# end the client-side streaming
# multi-round query validation test
def on_open_2(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {"max_new_tokens": 6144}
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['use_stream_chat'] = True
params_dict['prompt'] = "Hello!"
params_dict = {
"system_prompt":
"Act like you are programmer with 5+ years of experience."
}
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please write a sorting algorithm in Python."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please convert the code to Java."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please introduce yourself."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
params_dict['prompt'] = "Please summarize the preceding dialogue."
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
# Langchain validation test.
def on_open_3(ws):
global round
print('Opening Websocket connection to the server ... ')
params_dict = {}
# params_dict['prompt'] = """To open a Websocket connection to the server: """
params_dict['prompt'] = """Can you tell me what's the MNN?"""
params_dict['temperature'] = 0.9
params_dict['top_p'] = 0.1
params_dict['top_k'] = 30
params_dict['max_new_tokens'] = 2048
params_dict['use_stream_chat'] = False
params_dict['langchain'] = True
raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
ws.send(raw_req)
authorization = ""
host = "ws://" + ""
def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
ws = websocket.WebSocketApp(
host,
on_open=on_open_func,
on_message=on_message_func,
on_error=on_error,
on_pong=on_pong,
on_close=on_clonse_func,
header=[
'Authorization: ' + authorization],
)
# setup ping interval to keep long connection.
ws.run_forever(ping_interval=2)
if __name__ == "__main__":
for i in range(5):
p1 = Process(target=single_call, args=(on_open_1, on_message_1))
p2 = Process(target=single_call, args=(on_open_2, on_message_2))
p3 = Process(target=single_call, args=(on_open_3, on_message_3))
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
In the preceding command:
What options are supported in the commands?
The following tables describe the options that you can configure in the command.
Option | Description | Default |
--model-path | Specify the preset model name or a custom model path.
| meta-llama/Llama-2-7b-chat-hf |
--cpu | Use CPU to perform model inference. Example: | By default, GPU is used for model inference. |
--precision | Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: | The system automatically specifies the precision of the 7B model based on the GPU memory size. |
--port | Specify the listening port of the server. Sample code: | 8000 |
--api-only | Allows users to access the service only by calling API operations. By default, the service starts the web application and API server. Sample code: | False |
--no-api | Allows users to access the service only by using the web application. By default, the service starts the web application and API server. Sample code: | False |
--max-new-tokens | The maximum number of tokens. Sample code: | 2048 |
--temperature | The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1. Sample code: | 0.95 |
--max_round | The maximum number of dialogue rounds supported during inference. Sample code: | 5 |
--top_k | The number of outputs selected from the generated results. The value must be a positive integer. Sample code: | None |
--top_p | The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: | None |
--no-template | Models such as Llama2 and falcon provide a default prompt template. You can specify this parameter if you want to use your template instead of the default prompt template. Sample code: | If you do not specify this parameter, the default prompt template is automatically used. |
--log-level | The log output level. Valid values: DEBUG, INFO, WARNING, and ERROR. Sample code: | INFO |
--export-history-path | Exports the conversation history by using EAS-LLM. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify a mount path of an OSS bucket. EAS exports the dialog records within a specific period of time to a file. Sample code: | By default, this feature is disabled. |
--export-interval | The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the | 3600 |
References
For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.