All Products
Search
Document Center

Platform For AI:Deploy a Llama model as a web application in EAS

Last Updated:Apr 30, 2024

Llama 2 Chat models are suitable for dialogue use cases. Elastic Algorithm Service (EAS) of Platform for AI (PAI) allows you to easily deploy a chat application that is powered by a Llama 2 Chat model and accessible by using a web page. You can also use LangChain to integrate your business data into the application to ensure that the answers are aligned with your business requirements

Background information

Llama 2 is a series of open-source large language models (LLM) provided by Meta. The models vary in parameter scale, which are 7B, 13B, or 70B. Llama 2 models were trained based on 2 trillion tokens which is an 40% increase over Llama 1. Llama 2 models support a maximum sequence length of 4,096 tokens, which is twice that of Llama 1 models. Llama 2 Chat models are fine-tuned versions of pre-trained Llama 2 models that cater to chat scenarios. Techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) are used in the fine-tuning process to increase the safety and alignment of the models with human preferences. The fine-tuning data includes publicly available instruction datasets and over 1 million human-labeled samples. Llama 2 Chat can be used as a chat assistant in various natural language generation scenarios. This topic describes how to deploy a Llama 2 model as a chat application in EAS and use the web UI to perform model inference. This topic also provides answers to frequently asked questions about Llama 2 model deployment. In this example, the Llama2-13b-chat model is used.

Deploy a model service in EAS

  1. Go to the EAS-Online Model Services page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.

    3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page. image.png

  2. On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

  3. On the Create Service page, configure the parameters. The following table describes the parameters.

    Parameter

    Description

    Service Name

    The name of the service. In this example, chatllm_llama2_13b is specified.

    Deployment Method

    Select Deploy Web App by Using Image.

    Select Image

    Click PAI Image, select chat-llm-webui from the drop-down list, and then select 2.0 for the image version.

    Note

    You can select the latest version of the image when you deploy the model service. If versions newer than 2.0 are available when you deploy the service, select the latest version.

    Command to Run

    Specify one of the following commands based on your model. In this example, the 13B model is used.

    • Command to run if you use the 13B model: python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16.

    • Command to run if you use the 7B model: python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf.

    Set the port number to 8000.

    Resource Group Type

    Select Public Resource Group.

    Resource Configuration Mode

    Select General.

    Resource Configuration

    Click GPU and then select an instance type from the list. In this example, ecs.gn6e-c12g1.3xlarge is selected.

    • The 13B model requires an instance type that belongs to the gn6e instance family or higher.

    • For the 7B model, an A10 or GU30 instance is recommended.

    Additional System Disk

    Set the value to 50.

    3e6ce9e8ad5f333f6e30cf723b00eccc

  4. Click Deploy. The deployment requires approximately five minutes to complete.

    When the Model Status changes to Running, the service is deployed.

Use the web application to perform model inference

  1. Find the service that you want to manage and click View Web App in the Service Type column. image.png

  2. Perform model inference on the web application.

    Enter a prompt in the input text box, such as Give me a plan for learning the basics of personal finance. Click Send. image.png

FAQ

How do I use LangChain to integrate my business data into the application?

  • What is LangChain

    LangChain is an open source framework that allows AI developers to integrate LLMs such as GPT-4 with external data to improve performance and optimize resource utilization.

  • How does LangChain work

    LangChain divides a document, such as a 20-page PDF file, into smaller chunks, converts the chunks into numerical vectors by using embedding models such as BAAI General Embedding (BGE) or text2vec, and then embeds the vectors into a vector store.

    LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the locally generated answer and the user input together into the LLM to generate a custom answer.

  • How to integrate your business data

    1. Click the LangChain tab in the upper-left corner of the page.

    2. Upload custom data in the lower-left corner of the web page. You can upload files in the following formats: TXT, MD, DOCX, and PDF.image.png

      For example, you can drag and drop a README.md file to upload the file and then click Vectorstore knowledge in the lower-left corner. A message is displayed when the custom data is loaded. image.png

    3. In the input box in the lower part of the page, enter a sentence to start a dialogue.

      image.png

How do I switch to another open source foundation model?

EAS comes with several open source foundation models, such as Llama 2, ChatGLM, and Tongyi Qianwen. Perform the following steps to switch to the models to deploy services.

  1. On the Elastic Algorithm Service (EAS) page, find the service that you want to update. Click Update Service in the Actions column of the service.

  2. On the Update Service page, modify the Command to Run and Instance Type parameters based on the following table. Then, click Update.

    Model type

    Use method

    Command to Run

    Recommended specification

    Llama2-13b

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Llama2-7b

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf

    • 1 * GU30

    • 1 * NVIDIA A10

    ChatGLM2-6B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    Qwen-7b (Tongyi Qianwen-7b)

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    ChatGLM-6B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    Baichuan-13B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Falcon-7B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct

    • 1 * GU30

    • 1 * NVIDIA A10

    Baichuan2-7B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    Baichuan2-13B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat

    • 2 * GU30

    • 2 * NVIDIA A10

    Qwen-14b (Tongyi Qianwen-14b)

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

How do I deploy a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

  1. Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Get started by using the OSS console.

    The following figure shows a sample of the model files that you need to prepare: image.png

    The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.

  2. Click Update Service in the Actions column of the service.

  3. In the Model Service Information section, configure the following parameters and click Update. image.png

    Parameter

    Description

    Model Settings

    Click Specify Model Settings.

    • Select Mount OSS Path in the Model Settings section. Set the OSS bucket path to the path of the custom model files. Example: oss://bucket-test/data-oss/.

    • Set Mount Path to /data.

    • Turn off Enable Read-only Mode to disable the read-only mode.

    Command to Run

    Add the following options to the command:

    • --model-path: Set the value to /data, which is the same as the mount path.

    • --model-type: the type of the model.

    For more information about commands to run for different types of models, see Commands to run.

    Command to run

    Model type

    Use method

    Command to Run

    Llama2

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2 --precision=fp16

    ChatGLM2

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm2

    Qwen (Tongyi Qianwen)

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen

    ChatGLM

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm

    Falcon-7B

    API+WebUI

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon

How do I use API operations to perform model inference?

  1. Obtain the service access endpoint and token.

    1. Go to the Elastic Algorithm Service (EAS) page. For more information, see the Deploy model service in EAS section in this topic.

    2. Click the name of the service to go to the Service Details tab.

    3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.

  2. Perform model inference by calling API operations.

  3. HTTP

    • Non-streaming mode

      The client sends standard HTTP requests of the following types when cURL commands are run.

      • STRING requests

        curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

        Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

      • Structured requests

        curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

        Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:

        {
            "max_new_tokens": 4096,
            "use_stream_chat": false,
            "prompt": "How to install it?",
            "system_prompt": "Act like you are programmer with 5+ years of experience."
            "history": [
                [
                    "Can you tell me what's the bladellm?",
                    "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
                ]
            ],
            "temperature": 0.8,
            "top_k": 10,
            "top_p": 0.8,
            "do_sample": True,
            "use_cache": True,
        }

        The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

        Parameter

        Description

        Default value

        max_new_tokens

        The maximum number of output tokens.

        2048

        use_stream_chat

        Specify whether to return the output tokens in streaming mode.

        True

        prompt

        The user prompt.

        ""

        system_prompt

        The system prompt.

        ""

        history

        The dialogue history. The value is in the List[Tuple(str, str)] format.

        [()]

        temperature

        Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

        0.95

        top_k

        The number of outputs selected from the generated results.

        30

        top_p

        The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

        0.8

        do_sample

        Specify whether to enable output sampling.

        True

        use_cache

        Specify whether to enable KV cache.

        True

      You can also implement your own client based on the Python requests package. Example:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
      
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = False
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = "<EAS service public endpoint>"
          authorization = "<EAS service public token>"
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          # System prompts can be included in the requests. 
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
      
          # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. 
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
          output, history = get_response(response)
          print(f" --- output: {output} \n --- history: {history}", flush=True)
      
      # The server returns a JSON response that includes the inference result and dialogue history. 
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    • Streaming mode

      The streaming mode uses the HTTP SSE method. Sample code:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      
      def clear_line(n: int = 1) -> None:
          LINE_UP = '\033[1A'
          LINE_CLEAR = '\x1b[2K'
          for _ in range(n):
              print(LINE_UP, end=LINE_CLEAR, flush=True)
      
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      
      def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
          for chunk in response.iter_lines(chunk_size=8192,
                                           decode_unicode=False,
                                           delimiter=b"\0"):
              if chunk:
                  data = json.loads(chunk.decode("utf-8"))
                  output = data["response"]
                  history = data["history"]
                  yield output, history
      
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = True
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = ""
          authorization = ""
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
      
          for h, history in get_streaming_response(response):
              print(
                  f" --- stream line: {h} \n --- history: {history}", flush=True)
      

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    WebSocket

    The WebSocket protocol can efficiently handle the conversation history. You can use the WebSocket method to connect to the service and perform one or more rounds of conversation. Sample code:

    import os
    import time
    import json
    import struct
    from multiprocessing import Process
    
    import websocket
    
    round = 5
    questions = 0
    
    
    def on_message_1(ws, message):
        if message == "<EOS>":
            print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
        else:
            print("{}".format(time.time()))
            print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
    
    
    def on_message_2(ws, message):
        global questions
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        if message == "<EOS>":
            questions = questions + 1
            if questions == 5:
                ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_message_3(ws, message):
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_error(ws, error):
        print('error happened: ', str(error))
    
    
    def on_close(ws, a, b):
        print("### closed ###", a, b)
    
    
    def on_pong(ws, pong):
        print('pong:', pong)
    
    # stream chat validation test
    def on_open_1(ws):
        print('Opening Websocket connection to the server ... ')
        params_dict = {}
        params_dict['prompt'] = """Show me a golang code example: """
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['do_sample'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        # raw_req = f"""To open a Websocket connection to the server: """
    
        ws.send(raw_req)
        # end the client-side streaming
    
    
    # multi-round query validation test
    def on_open_2(ws):
        global round
        print('Opening Websocket connection to the server ... ')
        params_dict = {"max_new_tokens": 6144}
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['use_stream_chat'] = True
        params_dict['prompt'] = "Hello!"
        params_dict = {
            "system_prompt":
            "Act like you are programmer with 5+ years of experience."
        }
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please write a sorting algorithm in Python."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please convert the code to Java."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please introduce yourself."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please summarize the dialogue above."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    # Langchain validation test.
    def on_open_3(ws):
        global round
        print('Opening Websocket connection to the server ... ')
    
        params_dict = {}
        # params_dict['prompt'] = """To open a Websocket connection to the server: """
        params_dict['prompt'] = """Can you tell me what's the MNN?"""
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['use_stream_chat'] = False
        params_dict['langchain'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    authorization = ""
    host = "ws://" + ""
    
    
    def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
        ws = websocket.WebSocketApp(
            host,
            on_open=on_open_func,
            on_message=on_message_func,
            on_error=on_error,
            on_pong=on_pong,
            on_close=on_clonse_func,
            header=[
                'Authorization: ' + authorization],
        )
    
        # setup ping interval to keep long connection.
        ws.run_forever(ping_interval=2)
    
    
    if __name__ == "__main__":
        for i in range(5):
            p1 = Process(target=single_call, args=(on_open_1, on_message_1))
            p2 = Process(target=single_call, args=(on_open_2, on_message_2))
            p3 = Process(target=single_call, args=(on_open_3, on_message_3))
    
            p1.start()
            p2.start()
            p3.start()
    
            p1.join()
            p2.join()
            p3.join()

    In the preceding command:

    • Set authorization to the service token.

    • Set host to the service access endpoint. Replace the http in the endpoint with ws.

    • Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. The default value is True, which specifies that the server returns streaming data.

    • The on_open_2 function in the preceding code is used to implement multi-round dialogues.

What options are supported in the commands?

The following tables describe the options that you can configure in the command.

Option

Description

Default value

--model-path

Specify the preset model name or a custom model path.

  • Example 1: Load a preset model. You can use a preset model in the meta-llama/Llama-2-* series, including Llama-2-7b-hf, Llama-2-7b-chat-hf, Llama-2-13b-hf, and Llama-2-13b-chat-hf. Example:

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf.

  • Example 2: Load an on-premises custom model.

    Example: python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat.

meta-llama/Llama-2-7b-chat-hf

--cpu

Use CPU to perform model inference.

Example: python webui/webui_server.py --port=8000 --cpu.

By default, GPU is used for model inference.

--precision

Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: python webui/webui_server.py --port=8000 --precision=fp32.

The system automatically specifies the precision of the 7B model based on the GPU memory size.

--port

Specify the listening port of the server.

Sample code: python webui/webui_server.py --port=8000.

8000

--api-only

Allows users to access the service only by calling API operations. By default, the service starts both the web UI and API server.

Sample code: python webui/webui_server.py --api-only.

False

--no-api

Allows users to access the service only by using the web UI. By default, the service starts both the web UI and API server.

Sample code: python webui/webui_server.py --no-api.

False

--max-new-tokens

The maximum number of tokens.

Sample code: python api/api_server.py --port=8000 --max-new-tokens=1024.

2048

--temperature

Specify the randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

Sample code: python api/api_server.py --port=8000 --max_length=0.8.

0.95

--max_round

The number of dialogue rounds supported during inference.

Sample code: python api/api_server.py --port=8000 --max_round=10.

5

--top_k

The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

Sample code: python api/api_server.py --port=8000 --top_p=0.9.

N/A

--top_p

The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

Sample code: python api/api_server.py --port=8000 --top_p=0.9.

N/A

--no-template

Models such as Llama2 and falcon provide a default prompt template. You can specify this parameter if you want to use your template instead of the default prompt template.

Sample code: python api/api_server.py --port=8000 --no-template.

If this parameter is not specified, the default prompt template is automatically used.

--log-level

Specify the log output level. Valid values: DEBUG, INFO, WARNING, and ERROR.

Sample code: python api/api_server.py --port=8000 --log-level=DEBUG.

INFO

--export-history-path

You can use EAS-LLM to export the conversation history. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify a mount path of an OSS bucket. EAS exports the records of the conversation that happened over a specific period of time to a file.

Sample code: python api/api_server.py --port=8000 --export-history-path=/your_mount_path.

By default, this feature is disabled.

--export-interval

The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the --export-interval parameter to 3600, the conversation records of the previous hour are exported into a file.

3600

References