All Products
Search
Document Center

Platform For AI:Quickly deploy open source LLMs in EAS

Last Updated:May 13, 2024

The Elastic Algorithm Service (EAS) module of Platform for AI (PAI) is a model serving platform for online inference scenarios. You can use EAS to deploy a large language model (LLM) with a few clicks and then call the model by using the Web User Interface (WebUI) or API operations. After you deploy an LLM, you can use the LangChain framework to build a Q&A chatbot that is connected to a custom knowledge base. You can also use the inference acceleration engines provided by EAS, such as BladeLLM and vLLM, to ensure high concurrency and low latency.

Background information

The application of LLMs, such as the Generative Pre-trained Transformer (GPT) and TongYi Qianwen (Qwen) series of models, has garnered significant attention, especially in inference tasks. You can select from a wide range of open source LLMs based on your business requirements. EAS allows you to quickly deploy mainstream open source LLMs as an inference service with a few clicks. Supported LLMs include Qwen, Llama 2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B. This topic describes how to deploy an LLM in EAS and call the model. This topic also provides answers to some frequently asked questions about LLM deployment in EAS.

Prerequisites

Limits

The inference acceleration engines provided by EAS support only the following models: Qwen, Llama 2, Baichuan-13B, and Baichuan2-13B.

Deploy an LLM in EAS

  1. Go to the EAS-Online Model Services page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.

    3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page. image

  2. On the EAS-Online Model Services page, click Deploy Service. In the dialog box that appears, select LLM Large Language Model (LLM) and click OK.

  3. On the LLM Deployment page, configure the parameters. The following table describes the required parameters. Retain the default settings for other parameters.

    Parameter

    Description

    Service Name

    The name of the service. In this example, the service is named llm_demo001.

    Model Type

    The model that you want to deploy. In this example, Qwen-7b (Tongyi Qianwen-7b) is selected.

    Resource Configuration

    In this example, the Instance Type parameter is set to ml.gu7i.c16m60.1-gu30 for cost efficiency. For information about the recommended instance types for other open source LLMs, see the "How do I switch to another open source LLM?" section of this topic.

    Note

    If the resources in the current region are insufficient, you can deploy the model in the Singapore region.

    58f3ff63feb39f5dc0e37945b09d44f9

  4. Click Deploy. The deployment requires approximately 5 minutes to complete.

    When the Model Status changes to Running, the service is deployed.

Use the WebUI to perform inference

  1. Find the service that you want to manage and click View Web App in the Service Type column to access the WebUI. image

  2. Test the inference performance on the WebUI.

    Enter a sentence in the input text box and click Send to start a conversation. Sample input: Provide a learning plan for personal finance. image

FAQ

How do I switch to another open source LLM?

EAS provides the following open source LLMs: Qwen, Llama 2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B. To switch between these models, perform the following steps:

  1. On the EAS-Online Model Services page, find the service that you want to update and click Update Service in the Actions column.

  2. On the Deploy Service page, modify the Command to Run and Instance Type parameters and then click Update. The following table describes the parameter configurations for different models.

    Model

    Command to Run

    Recommended GPU specification

    Qwen-1.8B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-1_8B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    • 1 * NVIDIA T4

    • 1 * NVIDIA V100

    Qwen-7B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    Qwen-14B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Qwen-72B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-72B-Chat

    8 * NVIDIA V100 (gn6e)

    Qwen1.5-7B

    python webui/weui_server.py --model-path=Qwen/Qwen1.5-7B-Chat --port=8000

    • 1 * GU30

    • 1 * NVIDIA A10

    Qwen1.5-14B

    python webui/weui_server.py --model-path=Qwen/Qwen1.5-14B-Chat --port=8000

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Qwen1.5-72B

    python webui/weui_server.py --model-path=Qwen/Qwen1.5-72B-Chat --port=8000

    8 * NVIDIA V100 (gn6e)

    Llama2-7B

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf

    • 1 * GU30

    • 1 * NVIDIA A10

    Llama2-13B

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    chatglm2-6B

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    chatglm3-6B

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    baichuan-13B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    baichuan2-7B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    baichuan2-13B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Yi-6B

    python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B

    • 1 * GU30

    • 1 * NVIDIA A10

    Mistral-7B

    python webui/webui_server.py --model-path=mistralai/Mistral-7B-Instruct-v0.1

    • 1 * GU30

    • 1 * NVIDIA A10

    falcon-7B

    python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct

    • 1 * GU30

    • 1 * NVIDIA A10

How do I use LangChain to integrate my business data?

  • Introduction to LangChain

    LangChain is an open source framework that allows you to combine LLMs such as GPT-4 with external data sources to improve inference performance and optimize resource utilization. LangChain is commonly used to develop Retrieval Augmented Generation (RAG) applications.

  • How LangChain works in RAG applications

    LangChain divides the source data (such as a 20-page PDF file) into smaller chunks, converts the chunks into numerical vectors by using embedding models (such as BGE and text2vec), and then stores the vectors in a vector database.

    This way, the LLM can use the data in the vector database to generate responses. For each user query, LangChain retrieves the chunk that is relevant to the user query from the vector database, includes the retrieved information and the query in a prompt, and then sends the prompt to the LLM to generate an answer.

  • How to configure LangChain for the LLM that you deployed in EAS

    1. On the WebUI of the service that you deployed, click the LangChain tab.

    2. In the lower-left corner of the ChatLLM-LangChain-WebUI page, follow the on-screen instructions to upload a knowledge base. You can upload files in the following formats: TXT, Markdown, DOCX, and PDF. image.png

      For example, you can upload a README.md file and click Vectorstore knowledge. The following result indicates that the data in the file is loaded. image.png

    3. Enter a question about the data you uploaded in the input text box and click Send to start a conversation.

      Note

      After you configure LangChain on the WebUI, the configuration also takes effect when you call the model by using API operations. You can also store your knowledge base in a on-premises vector database. For more information, see Use PAI and a vector database to build an LLM-powered chatbot.

      Example input: How to install deepspeed. image

How do I improve concurrency and reduce latency for the inference service?

EAS provides BladeLLM and vLLM, which are inference acceleration engines that you can use to ensure high concurrency and low latency. To use the inference acceleration engines, perform the following steps:

  1. On the EAS-Online Model Services page, find the service that you want to update and click Update Service in the Actions column.

  2. In the Model Service Information section, modify the Select Image and Command to Run parameters.

    Important

    The inference acceleration engines support only the following models: Qwen, Llama 2, Baichuan-13B, and Baichuan2-13B.

    • Use BladeLLM

      Parameter

      Description

      Select Image

      Select PAI Image. On the drop-down lists that appear, select chat-llm-webui and 3.0-blade.

      Command to Run

      After you configure the image version, the system automatically configures this parameter.

    • Use vLLM

      Parameter

      Description

      Select Image

      Select PAI Image. On the drop-down lists that appear, select chat-llm-webui and 3.0-vllm.

      Command to Run

      After you configure the image version, the system automatically configures this parameter.

  3. Update the version of vLLM.

    A recent model may depend on a version of vLLM that is diffenrent from the previously released models. To resolve this issue, we recommend that you update the vLLM version based on your business requirements. To update the vLLM version, perform the following steps: Click Specify Third-party Libraries, select Third-party Libraries, and then specify the version of vLLM that you want to use. image

  4. Click Update.

How do I mount a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

  1. Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.

    The following figure shows a sample of the model files that you need to prepare: image.png

    The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.

  2. Click Update Service in the Actions column of the service.

  3. In the Model Service Information section, configure the parameters and click Update. The following table describes the parameters.image.png

    Parameter

    Description

    Model Settings

    Click Specify Model Settings.

    • In the Model Settings section, select Mount OSS Path and then select the OSS path that contains the custom model files. Example: oss://bucket-test/data-oss/.

    • Set the Mount Path parameter to /data.

    • Turn off Enable Read-only Mode.

    Command to Run

    Add the following parameters:

    • --model-path: Set this parameter to /data, which is the value of the Mount Path parameter.

    • --model-type: Specify the type of the model.

    For information about the commands for different types of models, see the "Commands to run" section of this topic.

    Commands to run

    Model type

    Command to Run

    Llama2

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2

    ChatGlm2

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm2

    ChatGlm3

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm3

    Qwen

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen

    ChatGlm

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm

    Falcon-7B

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon

How do I call API operations to perform inference?

  1. Obtain the service access endpoint and token.

    1. Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.

    2. Click the name of the service to go to the Service Details tab.

    3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.

  2. To call API operations to perform inference, use one of the following methods:

    Use HTTP

    • Non-streaming mode

      The client sends the following types of standard HTTP requests when cURL commands are run.

      • STRING requests

        curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

        Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

      • Structured requests

        curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

        Use the chatllm_data.json file to configure inference parameters. The following sample code provides an example of the content format of the chatllm_data.json file:

        {
          "max_new_tokens": 4096,
          "use_stream_chat": false,
          "prompt": "How to install it?",
          "system_prompt": "Act like you are programmer with 5+ years of experience.",
          "history": [
            [
              "Can you tell me what's the bladellm?",
              "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
            ]
          ],
          "temperature": 0.8,
          "top_k": 10,
          "top_p": 0.8,
          "do_sample": true,
          "use_cache": true
        }

        The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

        Parameter

        Description

        Default value

        max_new_tokens

        The maximum number of output tokens.

        2048

        use_stream_chat

        Specify whether to return the output tokens in streaming mode.

        true

        prompt

        The user prompt.

        ""

        system_prompt

        The system prompt.

        ""

        history

        The dialogue history. The value is in the List[Tuple(str, str)] format.

        [()]

        temperature

        The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

        0.95

        top_k

        The number of outputs selected from the generated results.

        30

        top_p

        The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

        0.8

        do_sample

        Specify whether to enable output sampling.

        true

        use_cache

        Specify whether to enable KV cache.

        true

      You can also implement your own client based on the Python requests package. Example:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
      
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = False
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = "<EAS service public endpoint>"
          authorization = "<EAS service public token>"
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          # System prompts can be included in the requests. 
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
      
          # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format. 
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
          output, history = get_response(response)
          print(f" --- output: {output} \n --- history: {history}", flush=True)
      
      # The server returns a JSON response that includes the inference result and dialogue history. 
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    • Streaming mode

      In streaming mode, the HTTP SSE method is used. Sample code:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      
      def clear_line(n: int = 1) -> None:
          LINE_UP = '\033[1A'
          LINE_CLEAR = '\x1b[2K'
          for _ in range(n):
              print(LINE_UP, end=LINE_CLEAR, flush=True)
      
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      
      def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
          for chunk in response.iter_lines(chunk_size=8192,
                                           decode_unicode=False,
                                           delimiter=b"\0"):
              if chunk:
                  data = json.loads(chunk.decode("utf-8"))
                  output = data["response"]
                  history = data["history"]
                  yield output, history
      
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = True
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = ""
          authorization = ""
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
      
          for h, history in get_streaming_response(response):
              print(
                  f" --- stream line: {h} \n --- history: {history}", flush=True)
      

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    Use WebSocket

    The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:

    import os
    import time
    import json
    import struct
    from multiprocessing import Process
    
    import websocket
    
    round = 5
    questions = 0
    
    
    def on_message_1(ws, message):
        if message == "<EOS>":
            print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
        else:
            print("{}".format(time.time()))
            print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
    
    
    def on_message_2(ws, message):
        global questions
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        if message == "<EOS>":
            questions = questions + 1
            if questions == 5:
                ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_message_3(ws, message):
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_error(ws, error):
        print('error happened: ', str(error))
    
    
    def on_close(ws, a, b):
        print("### closed ###", a, b)
    
    
    def on_pong(ws, pong):
        print('pong:', pong)
    
    # stream chat validation test
    def on_open_1(ws):
        print('Opening Websocket connection to the server ... ')
        params_dict = {}
        params_dict['prompt'] = """Show me a golang code example: """
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['do_sample'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        # raw_req = f"""To open a Websocket connection to the server: """
    
        ws.send(raw_req)
        # end the client-side streaming
    
    
    # multi-round query validation test
    def on_open_2(ws):
        global round
        print('Opening Websocket connection to the server ... ')
        params_dict = {"max_new_tokens": 6144}
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['use_stream_chat'] = True
        params_dict['prompt'] = "Hello! "
        params_dict = {
            "system_prompt":
            "Act like you are programmer with 5+ years of experience."
        }
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please write a sorting algorithm in Python."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please convert the programming language to Java."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please introduce yourself."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please summarize the dialogue above."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    # Langchain validation test.
    def on_open_3(ws):
        global round
        print('Opening Websocket connection to the server ... ')
    
        params_dict = {}
        # params_dict['prompt'] = """To open a Websocket connection to the server: """
        params_dict['prompt'] = """Can you tell me what's the MNN?"""
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['use_stream_chat'] = False
        params_dict['langchain'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    authorization = ""
    host = "ws://" + ""
    
    
    def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
        ws = websocket.WebSocketApp(
            host,
            on_open=on_open_func,
            on_message=on_message_func,
            on_error=on_error,
            on_pong=on_pong,
            on_close=on_clonse_func,
            header=[
                'Authorization: ' + authorization],
        )
    
        # setup ping interval to keep long connection.
        ws.run_forever(ping_interval=2)
    
    
    if __name__ == "__main__":
        for i in range(5):
            p1 = Process(target=single_call, args=(on_open_1, on_message_1))
            p2 = Process(target=single_call, args=(on_open_2, on_message_2))
            p3 = Process(target=single_call, args=(on_open_3, on_message_3))
    
            p1.start()
            p2.start()
            p3.start()
    
            p1.join()
            p2.join()
            p3.join()

    In the preceding code:

    • Set the authorization parameter to the service token.

    • Set the host parameter to the service endpoint and replace the http prefix in the endpoint with ws.

    • Use the use_stream_chat parameter to specify whether the client generates output in the streaming mode. Default value: True.

    • Refer to the on_open_2 function in the preceding code to implement a multi-round conversation.

Reference