deploy a Llama 2-based chat application in EAS - Platform For AI

Llama 2 Chat models are suitable for dialogue use cases. Elastic Algorithm Service (EAS) of Platform for AI (PAI) allows you to easily deploy a chat application that is powered by a Llama 2 Chat model and accessible by using a web page. You can also use LangChain to integrate your business data into the application to ensure that the answers are aligned with your business requirements

Background information

Llama 2 is a series of open-source large language models (LLM) provided by Meta. The models vary in parameter scale, which are 7B, 13B, or 70B. Llama 2 models were trained based on 2 trillion tokens which is an 40% increase over Llama 1. Llama 2 models support a maximum sequence length of 4,096 tokens, which is twice that of Llama 1 models. Llama 2 Chat models are fine-tuned versions of pre-trained Llama 2 models that cater to chat scenarios. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are used in the fine-tuning process to increase the safety and alignment of the models with human preferences. The fine-tuning data includes publicly available instruction datasets and over 1 million human-labeled samples. Llama 2 Chat can be used as a chat assistant in various natural language generation scenarios. This topic describes how to deploy a Llama 2 model as a chat application in EAS and use the web UI to perform model inference. This topic also provides answers to frequently asked questions about Llama 2 model deployment. In this example, the Llama2-13b-chat model is used.

Deploy a model service in EAS

Go to the EAS-Online Model Services page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Create Service page, configure the parameters. The following table describes the parameters.

Parameter	Description
Service Name	The name of the service. In this example, chatllm_llama2_13b is specified.
Deployment Method	Select Deploy Web App by Using Image.
Select Image	Click PAI Image, select chat-llm-webui from the drop-down list, and then select 2.0 for the image version. Note You can select the latest version of the image when you deploy the model service. If versions newer than 2.0 are available when you deploy the service, select the latest version.
Command to Run	Specify one of the following commands based on your model. In this example, the 13B model is used. Command to run if you use the 13B model: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16`. Command to run if you use the 7B model: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`. Set the port number to 8000.
Resource Group Type	Select Public Resource Group.
Resource Configuration Mode	Select General.
Resource Configuration	Click GPU and then select an instance type from the list. In this example, ecs.gn6e-c12g1.3xlarge is selected. The 13B model requires an instance type that belongs to the gn6e instance family or higher. For the 7B model, an A10 or GU30 instance is recommended.
Additional System Disk	Set the value to 50.

3e6ce9e8ad5f333f6e30cf723b00eccc

Click Deploy. The deployment requires approximately five minutes to complete.
When the Model Status changes to Running, the service is deployed.

Use the web application to perform model inference

Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web application.
Enter a prompt in the input text box, such as Give me a plan for learning the basics of personal finance. Click Send.

FAQ

How do I use LangChain to integrate my business data into the application?

What is LangChain
LangChain is an open source framework that allows AI developers to integrate LLMs such as GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work
LangChain divides a document, such as a 20-page PDF file, into smaller chunks, converts the chunks into numerical vectors by using embedding models such as BAAI General Embedding (BGE) or text2vec, and then embeds the vectors into a vector store.
LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the locally generated answer and the user input together into the LLM to generate a custom answer.
How to integrate your business data
1. Click the LangChain tab in the upper-left corner of the page.
2. Upload custom data in the lower-left corner of the web page. You can upload files in the following formats: TXT, MD, DOCX, and PDF.
  For example, you can drag and drop a README.md file to upload the file and then click Vectorstore knowledge in the lower-left corner. A message is displayed when the custom data is loaded.
3. In the input box in the lower part of the page, enter a sentence to start a dialogue.

How do I switch to another open source foundation model?

EAS comes with several open source foundation models, such as Llama 2, ChatGLM, and Tongyi Qianwen. Perform the following steps to switch to the models to deploy services.

On the Elastic Algorithm Service (EAS) page, find the service that you want to update. Click Update Service in the Actions column of the service.

On the Update Service page, modify the Command to Run and Instance Type parameters based on the following table. Then, click Update.

Model type	Use method	Command to Run	Recommended specification
Llama2-13b	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
Llama2-7b	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`	1 * GU30 1 * NVIDIA A10
ChatGLM2-6B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b`	1 * GU30 1 * NVIDIA A10
Qwen-7b (Tongyi Qianwen-7b)	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat`	1 * GU30 1 * NVIDIA A10
ChatGLM-6B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm-6b`	1 * GU30 1 * NVIDIA A10
Baichuan-13B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
Falcon-7B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct`	1 * GU30 1 * NVIDIA A10
Baichuan2-7B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat`	1 * GU30 1 * NVIDIA A10
Baichuan2-13B	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat`	2 * GU30 2 * NVIDIA A10
Qwen-14b (Tongyi Qianwen-14b)	API+WebUI	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10

How do I deploy a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Get started by using the OSS console.
The following figure shows a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.

In the Model Service Information section, configure the following parameters and click Update.

Parameter

Description

Model Settings

Click Specify Model Settings.

Select Mount OSS Path in the Model Settings section. Set the OSS bucket path to the path of the custom model files. Example: oss://bucket-test/data-oss/.
Set Mount Path to /data.
Turn off Enable Read-only Mode to disable the read-only mode.

Command to Run

Add the following options to the command:

--model-path: Set the value to /data, which is the same as the mount path.
--model-type: the type of the model.

For more information about commands to run for different types of models, see Commands to run.

Command to run

Model type	Use method	Command to Run
Llama2	API+WebUI	python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2 --precision=fp16
ChatGLM2	API+WebUI	python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm2
Qwen (Tongyi Qianwen)	API+WebUI	python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen
ChatGLM	API+WebUI	python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm
Falcon-7B	API+WebUI	python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon

How do I use API operations to perform model inference?

Obtain the service access endpoint and token.
1. Go to the Elastic Algorithm Service (EAS) page. For more information, see the Deploy model service in EAS section in this topic.
2. Click the name of the service to go to the Service Details tab.
3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.
Perform model inference by calling API operations.

HTTP

Non-streaming mode

The client sends standard HTTP requests of the following types when cURL commands are run.

STRING requests
```
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
```
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

Structured requests

curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:

{
    "max_new_tokens": 4096,
    "use_stream_chat": false,
    "prompt": "How to install it?",
    "system_prompt": "Act like you are programmer with 5+ years of experience."
    "history": [
        [
            "Can you tell me what's the bladellm?",
            "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
        ]
    ],
    "temperature": 0.8,
    "top_k": 10,
    "top_p": 0.8,
    "do_sample": True,
    "use_cache": True,
}

The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

Parameter	Description	Default value
max_new_tokens	The maximum number of output tokens.	2048
use_stream_chat	Specify whether to return the output tokens in streaming mode.	True
prompt	The user prompt.	""
system_prompt	The system prompt.	""
history	The dialogue history. The value is in the List[Tuple(str, str)] format.	[()]
temperature	Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.	0.95
top_k	The number of outputs selected from the generated results.	30
top_p	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.	0.8
do_sample	Specify whether to enable output sampling.	True
use_cache	Specify whether to enable KV cache.	True

You can also implement your own client based on the Python requests package. Example:

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "<EAS service public endpoint>"
    authorization = "<EAS service public token>"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # System prompts can be included in the requests. 
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. 
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# The server returns a JSON response that includes the inference result and dialogue history. 
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

Streaming mode

The streaming mode uses the HTTP SSE method. Sample code:

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

WebSocket

The WebSocket protocol can efficiently handle the conversation history. You can use the WebSocket method to connect to the service and perform one or more rounds of conversation. Sample code:

import os
import time
import json
import struct
from multiprocessing import Process

import websocket

round = 5
questions = 0


def on_message_1(ws, message):
    if message == "<EOS>":
        print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
              time.time(), message), flush=True)
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    else:
        print("{}".format(time.time()))
        print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
              time.time(), message), flush=True)


def on_message_2(ws, message):
    global questions
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    if message == "<EOS>":
        questions = questions + 1
        if questions == 5:
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_message_3(ws, message):
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_error(ws, error):
    print('error happened: ', str(error))


def on_close(ws, a, b):
    print("### closed ###", a, b)


def on_pong(ws, pong):
    print('pong:', pong)

# stream chat validation test
def on_open_1(ws):
    print('Opening Websocket connection to the server ... ')
    params_dict = {}
    params_dict['prompt'] = """Show me a golang code example: """
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['do_sample'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    # raw_req = f"""To open a Websocket connection to the server: """

    ws.send(raw_req)
    # end the client-side streaming


# multi-round query validation test
def on_open_2(ws):
    global round
    print('Opening Websocket connection to the server ... ')
    params_dict = {"max_new_tokens": 6144}
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['use_stream_chat'] = True
    params_dict['prompt'] = "Hello!"
    params_dict = {
        "system_prompt":
        "Act like you are programmer with 5+ years of experience."
    }
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please write a sorting algorithm in Python."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please convert the code to Java."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please introduce yourself."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please summarize the dialogue above."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


# Langchain validation test.
def on_open_3(ws):
    global round
    print('Opening Websocket connection to the server ... ')

    params_dict = {}
    # params_dict['prompt'] = """To open a Websocket connection to the server: """
    params_dict['prompt'] = """Can you tell me what's the MNN?"""
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['use_stream_chat'] = False
    params_dict['langchain'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


authorization = ""
host = "ws://" + ""


def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
    ws = websocket.WebSocketApp(
        host,
        on_open=on_open_func,
        on_message=on_message_func,
        on_error=on_error,
        on_pong=on_pong,
        on_close=on_clonse_func,
        header=[
            'Authorization: ' + authorization],
    )

    # setup ping interval to keep long connection.
    ws.run_forever(ping_interval=2)


if __name__ == "__main__":
    for i in range(5):
        p1 = Process(target=single_call, args=(on_open_1, on_message_1))
        p2 = Process(target=single_call, args=(on_open_2, on_message_2))
        p3 = Process(target=single_call, args=(on_open_3, on_message_3))

        p1.start()
        p2.start()
        p3.start()

        p1.join()
        p2.join()
        p3.join()

In the preceding command:

Set authorization to the service token.
Set host to the service access endpoint. Replace the http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. The default value is True, which specifies that the server returns streaming data.
The on_open_2 function in the preceding code is used to implement multi-round dialogues.

What options are supported in the commands?

The following tables describe the options that you can configure in the command.

Option	Description	Default value
--model-path	Specify the preset model name or a custom model path. Example 1: Load a preset model. You can use a preset model in the meta-llama/Llama-2-* series, including Llama-2-7b-hf, Llama-2-7b-chat-hf, Llama-2-13b-hf, and Llama-2-13b-chat-hf. Example: `python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`. Example 2: Load an on-premises custom model. Example: `python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat`.	meta-llama/Llama-2-7b-chat-hf
--cpu	Use CPU to perform model inference. Example: `python webui/webui_server.py --port=8000 --cpu`.	By default, GPU is used for model inference.
--precision	Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: `python webui/webui_server.py --port=8000 --precision=fp32`.	The system automatically specifies the precision of the 7B model based on the GPU memory size.
--port	Specify the listening port of the server. Sample code: `python webui/webui_server.py --port=8000`.	8000
--api-only	Allows users to access the service only by calling API operations. By default, the service starts both the web UI and API server. Sample code: `python webui/webui_server.py --api-only`.	False
--no-api	Allows users to access the service only by using the web UI. By default, the service starts both the web UI and API server. Sample code: `python webui/webui_server.py --no-api`.	False
--max-new-tokens	The maximum number of tokens. Sample code: `python api/api_server.py --port=8000 --max-new-tokens=1024`.	2048
--temperature	Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1. Sample code: `python api/api_server.py --port=8000 --max_length=0.8`.	0.95
--max_round	The number of dialogue rounds supported during inference. Sample code: `python api/api_server.py --port=8000 --max_round=10`.	5
--top_k	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: `python api/api_server.py --port=8000 --top_p=0.9`.	N/A
--top_p	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1. Sample code: `python api/api_server.py --port=8000 --top_p=0.9`.	N/A
--no-template	Models such as Llama2 and falcon provide a default prompt template. You can specify this parameter if you want to use your template instead of the default prompt template. Sample code: `python api/api_server.py --port=8000 --no-template`.	If this parameter is not specified, the default prompt template is automatically used.
--log-level	Specify the log output level. Valid values: DEBUG, INFO, WARNING, and ERROR. Sample code: `python api/api_server.py --port=8000 --log-level=DEBUG`.	INFO
--export-history-path	You can use EAS-LLM to export the conversation history. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify a mount path of an OSS bucket. EAS exports the records of the conversation that happened over a specific period of time to a file. Sample code: `python api/api_server.py --port=8000 --export-history-path=/your_mount_path`.	By default, this feature is disabled.
--export-interval	The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the `--export-interval` parameter to 3600, the conversation records of the previous hour are exported into a file.	3600

References

For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.