deploy a Llama 2-based chat application in EAS - - Alibaba Cloud Documentation Center

Llama 2 Chat models are suitable for dialogue use cases. Elastic Algorithm Service (EAS) of Platform for AI (PAI) allows you to easily deploy a chat application that is powered by a Llama 2 Chat model and accessible by using a web page. You can also use LangChain to integrate your business data into the application to ensure that the answers meet your business requirements.

Background information

Llama 2 is a series of open-source large language models (LLMs) provided by Meta. The models vary in parameter scale, which are 7B, 13B, or 70B. Llama 2 models are trained on 2 trillion tokens indicating an 40% increase over Llama 1. Llama 2 models support a maximum context length of 4,096 tokens, which is twice that supported by Llama 1 models. Llama 2 Chat models are fine-tuned versions of pre-trained Llama 2 models that cater to chat scenarios. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are used in the fine-tuning process to enhance model safety and alignment with human preferences. The fine-tuning data includes publicly available instruction datasets and more than 1 million human-labeled samples. Llama 2 Chat can be used as a chat assistant in various natural language generation scenarios. This topic describes how to deploy a Llama 2 model as a chat application in EAS and use the web application to perform model inference. This topic also provides answers to some frequently asked questions about Llama 2 model deployment. In this example, the Llama2-13b-chat model is used.

Deploy a model service in EAS

Go to the EAS-Online Model Services page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Deploy Service page, configure the parameters. The following table describes the parameters.

Parameter	Description
Service Name	The name of the service. In this example, chatllm_llama2_13b is specified.
Deployment Method	Select Deploy Web App by Using Image.
Select Image	Click PAI Image, select chat-llm-webui from the drop-down list, and then select 2.0 for the image version. Note The image version is updated frequently. We recommend that you select the latest version.
Command to Run	Specify one of the following commands based on your model. In this example, the 13B model is used. Command to run if you use the 13B model: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16`. Command to run if you use the 7B model: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`. Set the port number to 8000.
Resource Group Type	Select Public Resource Group.
Resource Configuration Mode	Select General.
Resource Configuration	Click GPU and then select an instance type from the list. In this example, ecs.gn6e-c12g1.3xlarge is selected. The 13B model requires an instance type that belongs to the gn6e instance family or higher. For the 7B model, we recommend that you use an A10 or GU30 instance.
Additional System Disk	Set the value to 50.

3e6ce9e8ad5f333f6e30cf723b00eccc

Click Deploy. The deployment requires approximately 5 minutes to complete.
When the Model Status changes to Running, the service is deployed.

Use the web application to perform model inference

Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web application.
Enter a prompt in the input text box, such as Give me a plan for learning the basics of personal finance. Click Send.

FAQ

How do I use LangChain to integrate my business data into the application?

What is LangChain?
LangChain is an open source framework that allows AI developers to integrate LLMs such as GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work?
LangChain divides a document, such as a 20-page PDF file, into smaller chunks, converts the chunks into numerical vectors by using embedding models such as BAAI General Embedding (BGE) or text2vec, and then embeds the vectors into a vector store.
LangChain processes the user input and stores the data as the on-premises knowledge base of the LLM. In each inference process, LangChain searches the on-premises knowledge base for a text chunk similar to the input question, and then inputs the answer generated by the on-premises knowledge base and the prompt together into the LLM to generate a custom answer.
How to integrate your business data?
1. Click the LangChain tab in the upper-left corner of the web application interface.
2. Upload custom data in the lower-left corner of the web application interface. You can upload files in the following formats: TXT, MD, DOCX, and PDF.
  For example, you can drag a README.md file to upload the file and then click Vectorstore knowledge in the lower-left corner. A message is displayed when the custom data is loaded.
3. In the input box in the lower part of the page, enter a sentence to start a dialogue.

How do I switch to another open source foundation model?

EAS comes with several open source foundation models, such as Llama 2, ChatGLM, and Tongyi Qianwen. To switch to the models to deploy services, perform the following steps.

On the EAS-Online Model Services page, find the service that you want to update and click Update Service in the Actions column.

On the Deploy Service page, modify the Command to Run and Instance Type parameters based on the following table. Then, click Deploy.

Model type	Use method	Command to run	Recommended specification
Llama2-13b	API and web application	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16`	1 × NVIDIA V100 (gn6e) 2 × GU30 2 × NVIDIA A10
Llama2-7b	API and web application	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`	1 × GU30 1 × NVIDIA A10
ChatGlm2-6B	API and web application	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b`	1 × GU30 1 × NVIDIA A10
Qwen-7b (Tongyi Qianwen-7b)	API and web application	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat`	1 × GU30 1 × NVIDIA A10
ChatGlm-6B	API and web application	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm-6b`	1 × GU30 1 × NVIDIA A10
Baichuan-13B	API and web application	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat`	1 × NVIDIA V100 (gn6e) 2 × GU30 2 × NVIDIA A10
Falcon-7B	API and web application	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct`	1 × GU30 1 × NVIDIA A10
Baichuan2-7B	API and web application	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat`	1 × GU30 1 × NVIDIA A10
Baichuan2-13B	API and web application	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat`	2 × GU30 2 × NVIDIA A10
Qwen-14b (Tongyi Qianwen-14b)	API and web application	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat`	1 × NVIDIA V100 (gn6e) 2 × GU30 2 × NVIDIA A10

How do I mount a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure shows a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.

In the Model Service Information section, configure the following parameters and click Deploy.

Parameter

Description

Model Settings

Click Specify Model Settings.

Select Mount OSS Path in the Model Settings section. Set the OSS bucket path to the path of the custom model files. Example: oss://bucket-test/data-oss/.
Set the Mount Path parameter to /data.
Turn off Enable Read-only Mode to disable the read-only mode.

Command to Run

Add the following options to the command:

--model-path: Set the value to /data, which is the same as the mount path.
--model-type: the type of the model.

For more information about commands to run for different types of models, see Commands to run.

Commands to run

Model type	Use method	Command to run
Llama2	API and web application	python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2 --precision=fp16
ChatGlm2	API and web application	python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm2
Qwen (Tongyi Qianwen)	API and web application	python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen
ChatGlm	API and web application	python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm
Falcon-7B	API and web application	python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon

How do I use API operations to perform model inference?

Obtain the service access endpoint and token.
1. Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.
2. Click the name of the service to go to the Service Details tab.
3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.
Perform model inference by calling API operations.

HTTP

Non-streaming mode

The client sends the following types of standard HTTP requests when cURL commands are run.

STRING requests
```
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
```
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

Structured requests

curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:

{
  "max_new_tokens": 4096,
  "use_stream_chat": false,
  "prompt": "How to install it?",
  "system_prompt": "Act like you are programmer with 5+ years of experience.",
  "history": [
    [
      "Can you tell me what's the bladellm?",
      "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
    ]
  ],
  "temperature": 0.8,
  "top_k": 10,
  "top_p": 0.8,
  "do_sample": true,
  "use_cache": true
}

The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

Parameter	Description	Default value
max_new_tokens	The maximum number of output tokens.	2048
use_stream_chat	Specify whether to return the output tokens in streaming mode.	true
prompt	The user prompt.	""
system_prompt	The system prompt.	""
history	The dialogue history. The value is in the List[Tuple(str, str)] format.	[()]
temperature	The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.	0.95
top_k	The number of outputs selected from the generated results.	30
top_p	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.	0.8
do_sample	Specify whether to enable output sampling.	true
use_cache	Specify whether to enable KV cache.	true

You can also implement your own client based on the Python requests package. Example:

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "<EAS service public endpoint>"
    authorization = "<EAS service public token>"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # System prompts can be included in the requests. 
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. 
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# The server returns a JSON response that includes the inference result and dialogue history. 
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

Streaming mode

In streaming mode, the HTTP SSE method is used. Sample code:

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

WebSocket

The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:

import os
import time
import json
import struct
from multiprocessing import Process

import websocket

round = 5
questions = 0


def on_message_1(ws, message):
    if message == "<EOS>":
        print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
              time.time(), message), flush=True)
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    else:
        print("{}".format(time.time()))
        print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
              time.time(), message), flush=True)


def on_message_2(ws, message):
    global questions
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    if message == "<EOS>":
        questions = questions + 1
        if questions == 5:
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_message_3(ws, message):
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_error(ws, error):
    print('error happened: ', str(error))


def on_close(ws, a, b):
    print("### closed ###", a, b)


def on_pong(ws, pong):
    print('pong:', pong)

# stream chat validation test
def on_open_1(ws):
    print('Opening Websocket connection to the server ... ')
    params_dict = {}
    params_dict['prompt'] = """Show me a golang code example: """
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['do_sample'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    # raw_req = f"""To open a Websocket connection to the server: """

    ws.send(raw_req)
    # end the client-side streaming


# multi-round query validation test
def on_open_2(ws):
    global round
    print('Opening Websocket connection to the server ... ')
    params_dict = {"max_new_tokens": 6144}
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['use_stream_chat'] = True
    params_dict['prompt'] = "Hello!"
    params_dict = {
        "system_prompt":
        "Act like you are programmer with 5+ years of experience."
    }
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please write a sorting algorithm in Python."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please convert the code to Java."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please introduce yourself."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please summarize the preceding dialogue."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


# Langchain validation test.
def on_open_3(ws):
    global round
    print('Opening Websocket connection to the server ... ')

    params_dict = {}
    # params_dict['prompt'] = """To open a Websocket connection to the server: """
    params_dict['prompt'] = """Can you tell me what's the MNN?"""
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['use_stream_chat'] = False
    params_dict['langchain'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


authorization = ""
host = "ws://" + ""


def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
    ws = websocket.WebSocketApp(
        host,
        on_open=on_open_func,
        on_message=on_message_func,
        on_error=on_error,
        on_pong=on_pong,
        on_close=on_clonse_func,
        header=[
            'Authorization: ' + authorization],
    )

    # setup ping interval to keep long connection.
    ws.run_forever(ping_interval=2)


if __name__ == "__main__":
    for i in range(5):
        p1 = Process(target=single_call, args=(on_open_1, on_message_1))
        p2 = Process(target=single_call, args=(on_open_2, on_message_2))
        p3 = Process(target=single_call, args=(on_open_3, on_message_3))

        p1.start()
        p2.start()
        p3.start()

        p1.join()
        p2.join()
        p3.join()

In the preceding command:

Set authorization to the service token.
Set host to the service access endpoint. Replace http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. The default value is True, which specifies that the results are returned in streaming mode.
The on_open_2 function in the preceding code is used to implement multi-round dialogues.

What options are supported in the commands?

The following tables describe the options that you can configure in the command.

Option	Description	Default
--model-path	Specify the preset model name or a custom model path. Example 1: Load a preset model. You can use a preset model in the meta-llama/Llama-2-* series, including Llama-2-7b-hf, Llama-2-7b-chat-hf, Llama-2-13b-hf, and Llama-2-13b-chat-hf. Example: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`. Example 2: Load an on-premises custom model. Example: `python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat`.	meta-llama/Llama-2-7b-chat-hf
--cpu	Use CPU to perform model inference. Example: `python webui/webui_server.py --port=8000 --cpu`.	By default, GPU is used for model inference.
--precision	Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: `python webui/webui_server.py --port=8000 --precision=fp32`.	The system automatically specifies the precision of the 7B model based on the GPU memory size.
--port	Specify the listening port of the server. Sample code: `python webui/webui_server.py --port=8000`.	8000
--api-only	Allows users to access the service only by calling API operations. By default, the service starts the web application and API server. Sample code: `python webui/webui_server.py --api-only`.	False
--no-api	Allows users to access the service only by using the web application. By default, the service starts the web application and API server. Sample code: `python webui/webui_server.py --no-api`.	False
--max-new-tokens	The maximum number of tokens. Sample code: `python api/api_server.py --port=8000 --max-new-tokens=1024`.	2048
--temperature	The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1. Sample code: `python api/api_server.py --port=8000 --max_length=0.8`.	0.95
--max_round	The maximum number of dialogue rounds supported during inference. Sample code: `python api/api_server.py --port=8000 --max_round=10`.	5
--top_k	The number of outputs selected from the generated results. The value must be a positive integer. Sample code: `python api/api_server.py --port=8000 --top_k=10`.	None
--top_p	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: `python api/api_server.py --port=8000 --top_p=0.9`.	None
--no-template	Models such as Llama2 and falcon provide a default prompt template. You can specify this parameter if you want to use your template instead of the default prompt template. Sample code: `python api/api_server.py --port=8000 --no-template`.	If you do not specify this parameter, the default prompt template is automatically used.
--log-level	The log output level. Valid values: DEBUG, INFO, WARNING, and ERROR. Sample code: `python api/api_server.py --port=8000 --log-level=DEBUG`.	INFO
--export-history-path	Exports the conversation history by using EAS-LLM. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify a mount path of an OSS bucket. EAS exports the dialog records within a specific period of time to a file. Sample code: `python api/api_server.py --port=8000 --export-history-path=/your_mount_path`.	By default, this feature is disabled.
--export-interval	The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the `--export-interval` parameter to 3600, the conversation records of the previous hour are exported into a file.	3600

References

For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.