All Products
Search
Document Center

Platform For AI:Deploy LLM applications in EAS

Last Updated:Feb 18, 2024

You can use Elastic Algorithm Service (EAS) of Platform for AI (PAI) to deploy a large language model (LLM) as an AI-powered web application. After you deploy the model, you can call the application by using the web UI or API operations. You can also use the LangChain framework to integrate enterprise knowledge base to implement intelligent conversation and automation capabilities. EAS also provides BladeLLM and vLLM inference acceleration engines that support high concurrency and low latency.

Background information

As foundation models such as ChatGPT and TongYi Qianwen become popular in the industry, the inference application of LLMs has come under the spotlight. EAS allows you to choose open source foundation models that are available on the market based on their performance and your business requirements. For example, you can quickly launch model files of LLMs such as Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B from a third party in EAS. You can also deploy an open source model as an inference application in a few clicks to perform model inference in EAS. This topic describes how to deploy an LLM model and call the model service by using EAS. This topic also provides answers to frequently asked questions.

Prerequisites

Limits

The inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.

Deploy model service in EAS

  1. Go to the EAS-Online Model Services page.

    1. Log on to the Platform for AI (PAI) console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.

    3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page. image.png

  2. On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

  3. On the Deploy Service page, configure the required parameters. The following table describes key parameters.

    Parameter

    Description

    Service Name

    The name of the service. The service name llm_demo001 is used in this example.

    Deployment Method

    Select Deploy Web App by Using Image.

    Select Image

    Click PAI Image, select chat-llm-webui from the drop-down list, and select 2.1 as the image version.

    Note

    You can select the latest version of the image when you deploy the model service.

    Command to Run

    After you select an image version, the system automatically sets the command to python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat. The command calls the Qwen-7B model. If you want to call other LLMs, you can replace the command. For more information, see How do I switch to another open source foundation model?

    Resource Group Type

    Select Public Resource Group.

    Resource Configuration Mode

    Select General.

    Resource Configuration

    You must select a GPU type. We recommend that you use the ml.gu7i.c16m60.1-gu30 instance type to call the Qwen-7B model in terms of cost-effectiveness. For more information about the instance types that we recommend you to use when you deploy other open source LLMs, see How do I switch to another open source foundation model?.

    097f7b15e57b95236e438630fac91bb5.png

  4. Click Deploy. The deployment requires several seconds to complete.

    When the Model Status changes to Running, the service is deployed.

Use web UI to perform model inference

  1. Find the service that you want to manage and click View Web App in the Service Type column. image.png

  2. Perform model inference on the web UI page.

    Enter a sentence in the input text box below the dialog box to start the dialogue. For example, please provide a financial learning plan. Click Send to start the dialogue. image.png

FAQ

How do I switch to another open source foundation model?

EAS allows you to use the following open source foundation models: Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B models. Perform the following steps to switch to these models to deploy services.

  1. On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.

  2. On the Deploy Service page, update the Command to Run and Instance Type parameters based on the following table. Then, click Deploy.

    Model type

    Command to Run

    Recommended specification

    Qwen-1.8B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-1_8B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    • 1 * NVIDIA T4

    • 1 * NVIDIA V100

    Qwen-7B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    Qwen-14B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Qwen-72B

    python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-72B-Chat

    8 * NVIDIA V100 (gn6e)

    Llama2-7B

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf

    • 1 * GU30

    • 1 * NVIDIA A10

    Llama2-13B

    python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    chatglm2-6B

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    chatglm3-6B

    python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b

    • 1 * GU30

    • 1 * NVIDIA A10

    baichuan-13B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    baichuan2-7B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat

    • 1 * GU30

    • 1 * NVIDIA A10

    baichuan2-13B

    python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat

    • 1 * NVIDIA V100 (gn6e)

    • 2 * GU30

    • 2 * NVIDIA A10

    Yi-6B

    python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B

    • 1 * GU30

    • 1 * NVIDIA A10

    Mistral-7B

    python webui/webui_server.py --model-path=mistralai/Mistral-7B-Instruct-v0.1

    • 1 * GU30

    • 1 * NVIDIA A10

    falcon-7B

    python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct

    • 1 * GU30

    • 1 * NVIDIA A10

How do I use LangChain to integrate my business data?

  • What is LangChain:

    LangChain is an open source framework that allows AI developers to integrate LLMs like GPT-4 with external data to improve performance and optimize resource utilization.

  • How does LangChain work:

    LangChain divides a document, such as a 20-page PDF file, into smaller chunks and embeds them into a vector store.

    LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the local-generated answer and the user input together into the LLM to generate a custom answer.

  • How to configure LangChain:

    1. Click LangChain and go to the LangChain tab on the web UI page.

    2. Upload custom data in the lower-left corner of the web UI page. You can upload files in TXT, MD, DOCX, and PDF formats. image.png

      For example, you can drag and drop to upload a README.md file and click Vectorstore knowledge in the lower-left corner. The following result indicates that the custom data is loaded. image.png

    3. In the input box at the bottom of the web UI page, enter a sentence to start a dialogue.

    4. For example, enter how to install deepspeed in the input box and click Send. The following figure shows the result. image.png

    Note

    After you use LangChain to integrate business data on the web UI page, you can perform model inference with the data by using API operations. You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.

How do I improve concurrency and reduce latency for the inference service?

EAS provides BladeLLM and vLLM inference acceleration engines to ensure high concurrency and low latency for the inference service. Perform the following steps:

  1. On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.

  2. In the Model Service Information section, add the parameter --backend=vllm to the Command to Run parameter and click Deploy.

    Important

    The inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.

    image.png

  3. Update versions of Transformers and vLLM.

    As new models are released, the previously released models and latest released models may have incompatibility issues on the dependency of toolkit versions, such as Transformers and vLLM. To solve the incompatibility issues, we recommend that you upgrade toolkits such as Transformers and vLLM based on your business requirements. Specify the specific versions of the toolkits in the Third-party Library Settings section. image.png

How do I mount a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

  1. Upload the model and related configuration files to your OSS bucket. For more information about how to create a bucket and upload objects, see Create buckets and Upload objects.

    The following figure provides a sample of the model files that you need to prepare: image.png

    The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.

  2. Click Update Service in the Actions column of the service.

  3. In the Model Service Information section, specify the required parameters and click Deploy. image.png

    Parameter

    Description

    Model Settings

    Click Specify Model Settings to configure the model.

    • Select Mount OSS Path in Model Settings. Set the OSS bucket path to the path where the custom model files reside. Example: oss://bucket-test/data-oss/.

    • Set Mount Path to /data.

    • Enable Read-only Mode: turn off the read-only mode.

    Command to Run

    Add the following parameters to Command to Run:

    • --model-path: Set the parameter to /data. Set the value to the mount path.

    • --model-type: the model type.

    For more information about commands to run for different types of models, see Commands to run.

    Command to run

    Model type

    Command to Run

    llama2

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2

    chatglm2

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm

    qwen (Tongyi Qianwen)

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen

    chatglm

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm

    falcon-7B

    python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon

How do I use API operations to perform model inference?

  1. Obtain the service access endpoint and token.

    1. Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.

    2. Click the name of the service to go to the Service Details tab.

    3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.

  2. Perform model inference by calling API operations.

    Call the service by using HTTP

    • Non-streaming mode

      The client sends standard HTTP requests of the following types when cURL commands are run.

      • STRING requests

        curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

        Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

      • Structured requests

        curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

        Use the chatllm_data.json file to configure inference parameters. The following code provides an example of the content format of the chatllm_data.json file:

        {
            "max_new_tokens": 4096,
            "use_stream_chat": false,
            "prompt": "How to install it?",
            "system_prompt": "Act like you are programmer with 5+ years of experience."
            "history": [
                [
                    "Can you tell me what's the bladellm?",
                    "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
                ]
            ],
            "temperature": 0.8,
            "top_k": 10,
            "top_p": 0.8,
            "do_sample": True,
            "use_cache": True,
        }

        The following table describes the parameters. Configure the parameters based on your business requirements.

        Parameter

        Description

        Default value

        max_new_tokens

        The maximum number of output tokens.

        2048

        use_stream_chat

        Specify whether to return the output tokens in streaming mode.

        True

        prompt

        The user prompt.

        ""

        system_prompt

        The system prompt.

        ""

        history

        The dialogue history. The value is of the List[Tuple(str, str)] type.

        [()]

        temperature

        Specify the randomness of the model output. A larger value indicates a higher randomness. A value of 0 indicates a fixed output. The value is of the Float type and ranges from 0 to 1.

        0.95

        top_k

        The number of outputs selected from the generated results.

        30

        top_p

        The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

        0.8

        do_sample

        Specify whether to enable output sampling.

        True

        use_cache

        Specify whether to enable KV cache.

        True

      You can also implement your own client based on the Python requests package. Example:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
      
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = False
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = "<EAS service public endpoint>"
          authorization = "<EAS service public token>"
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          # System prompts can be included in the requests.
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
      
          # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format.
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
          output, history = get_response(response)
          print(f" --- output: {output} \n --- history: {history}", flush=True)
      
      # The server returns a JSON response that includes the inference result and dialogue history.
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    • Streaming mode

      The streaming mode uses the HTTP SSE method. Sample code:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      
      def clear_line(n: int = 1) -> None:
          LINE_UP = '\033[1A'
          LINE_CLEAR = '\x1b[2K'
          for _ in range(n):
              print(LINE_UP, end=LINE_CLEAR, flush=True)
      
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      
      def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
          for chunk in response.iter_lines(chunk_size=8192,
                                           decode_unicode=False,
                                           delimiter=b"\0"):
              if chunk:
                  data = json.loads(chunk.decode("utf-8"))
                  output = data["response"]
                  history = data["history"]
                  yield output, history
      
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = True
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = ""
          authorization = ""
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
      
          for h, history in get_streaming_response(response):
              print(
                  f" --- stream line: {h} \n --- history: {history}", flush=True)
      

      In the preceding command:

      • Set host to the service access endpoint.

      • Set authorization to the service token.

    Call the service by using WebSocket

    The WebSocket protocol is more efficient for handling the conversation history. You can use the WebSocket method to connect to the service and perform one or multiple rounds of conversation. Sample code:

    import os
    import time
    import json
    import struct
    from multiprocessing import Process
    
    import websocket
    
    round = 5
    questions = 0
    
    
    def on_message_1(ws, message):
        if message == "<EOS>":
            print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
        else:
            print("{}".format(time.time()))
            print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
    
    
    def on_message_2(ws, message):
        global questions
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        if message == "<EOS>":
            questions = questions + 1
            if questions == 5:
                ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_message_3(ws, message):
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_error(ws, error):
        print('error happened: ', str(error))
    
    
    def on_close(ws, a, b):
        print("### closed ###", a, b)
    
    
    def on_pong(ws, pong):
        print('pong:', pong)
    
    # stream chat validation test
    def on_open_1(ws):
        print('Opening Websocket connection to the server ... ')
        params_dict = {}
        params_dict['prompt'] = """Show me a golang code example: """
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['do_sample'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        # raw_req = f"""To open a Websocket connection to the server: """
    
        ws.send(raw_req)
        # end the client-side streaming
    
    
    # multi-round query validation test
    def on_open_2(ws):
        global round
        print('Opening Websocket connection to the server ... ')
        params_dict = {"max_new_tokens": 6144}
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['use_stream_chat'] = True
        params_dict['prompt'] = "Hello! "
        params_dict = {
            "system_prompt":
            "Act like you are programmer with 5+ years of experience."
        }
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please write a sorting algorithm in Python."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please convert to the Java implementation."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please introduce yourself?"
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please summarize the dialogue above."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    # Langchain validation test.
    def on_open_3(ws):
        global round
        print('Opening Websocket connection to the server ... ')
    
        params_dict = {}
        # params_dict['prompt'] = """To open a Websocket connection to the server: """
        params_dict['prompt'] = """Can you tell me what's the MNN?"""
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['use_stream_chat'] = False
        params_dict['langchain'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    authorization = ""
    host = "ws://" + ""
    
    
    def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
        ws = websocket.WebSocketApp(
            host,
            on_open=on_open_func,
            on_message=on_message_func,
            on_error=on_error,
            on_pong=on_pong,
            on_close=on_clonse_func,
            header=[
                'Authorization: ' + authorization],
        )
    
        # setup ping interval to keep long connection.
        ws.run_forever(ping_interval=2)
    
    
    if __name__ == "__main__":
        for i in range(5):
            p1 = Process(target=single_call, args=(on_open_1, on_message_1))
            p2 = Process(target=single_call, args=(on_open_2, on_message_2))
            p3 = Process(target=single_call, args=(on_open_3, on_message_3))
    
            p1.start()
            p2.start()
            p3.start()
    
            p1.join()
            p2.join()
            p3.join()

    Parameters:

    • Set authorization to the service token.

    • Set host to the service access endpoint. Replace the http in the endpoint with ws.

    • Use the use_stream_chat parameter to specify whether the client generates output in the streaming mode. The default value is True, which indicates that the server returns streaming data.

    • Refer to the implementation method of the on_open_2 function in the preceding sample code to implement multi-round conversation.

References