All Products
Search
Document Center

Platform For AI:Quickly deploy LLMs in EAS

Last Updated:Sep 06, 2024

The Elastic Algorithm Service (EAS) module of Platform for AI (PAI) is a model serving platform for online inference scenarios. You can use EAS to deploy a large language model (LLM) with a few clicks and then call the model by using the Web User Interface (WebUI) or API operations. After you deploy an LLM, you can use the LangChain framework to build a Q&A chatbot based on a custom knowledge base. You can also use the inference acceleration engines provided by EAS, such as BladeLLM and vLLM, to ensure high concurrency and low latency. This topic describes how to deploy and call an LLM in EAS and some frequently asked questions.

Background information

The application of LLMs, such as the Generative Pre-trained Transformer (GPT) and TongYi Qianwen (Qwen) series of models, has garnered significant attention, especially in inference tasks. You can select from a wide range of open source LLMs based on your business requirements. EAS allows you to quickly deploy mainstream open source LLMs as an inference service with a few clicks. Supported LLMs include Llama 3, Qwen, Llama 2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B.

You can not only call the models by using WebUI or API, but also use the LangChain framework to generate custom response based on your business data.

Prerequisites

Limits

  • The inference acceleration engines provided by EAS support only the following models: Qwen2-7b, Qwen1.5-1.8b, Qwen1.5-7b, Qwen1.5-14b, llama3-8b, llama2-7b, llama2-13b, chatglm3-6b, baichuan2-7b, baichuan2-13b, falcon-7b, yi-6b, mistral-7b-instruct-v0.2, gemma-2b-it, gemma-7b-it, deepseek-coder-7b-instruct-v1.5.

  • The LangChain framework is not supported by the inference acceleration engines.

Deploy an LLM in EAS

  1. Go to the EAS-Online Model Services page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.

    3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page. image

  2. On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, select LLM Deployment.

  3. On the LLM Deployment page, configure the following parameters. Use the default values for other parameters.

    Parameter

    Description

    Service Name

    Specify a name for the service. In this example, the service is named llm_demo001.

    Model Source

    Select Open Source Model.

    Model Type

    The model that you want to deploy. In this example, Qwen1.5-7b is used. EAS provides various model types, such as chatglm3-6b and llama2-13b. You can select a model based on your requirements.

    Resource Configuration

    After you select a model, the system recommends appropriate resource configurations. Select a configuration from the drop-down list.

    Note

    If resources in the current region are insufficient, you can deploy the service in the Singapore (Singapore) region or select other recommended configurations.

    Inference Acceleration

    Specifies whether to enable inference acceleration. In this example, Not Accelerated is used.

    image

  4. Click Deploy. The model deployment requires approximately five minutes.

Use WebUI to perform inference

  1. Find the deployed service and click View Web App in the Service Type column. image

  2. Test the inference performance on the WebUI page.

    Enter a sentence in the input text box and click Send to start a conversation. Sample input: Provide a learning plan for personal finance. image

  3. Use LangChain to integrate your business data.

    1. On the WebUI page of the service that you deployed, click the LangChain tab.

    2. In the lower-left corner of the ChatLLM-LangChain-WebUI page, follow the on-screen instructions to upload a knowledge base. You can upload files in the following formats: TXT, Markdown, DOCX, and PDF. image

      For example, you can upload a README.md file and click Vectorstore knowledge. The following result indicates that the data in the file is loaded. image

    3. Enter a question about the data you uploaded in the input text box and click Send to start a conversation.

      Sample input: How to install deepspeed. image

FAQ

How do I improve concurrency and reduce latency for the inference service?

EAS provides BladeLLM and vLLM, which are inference acceleration engines that you can use to ensure high concurrency and low latency. To use the inference acceleration engines, perform the following steps:

  1. On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update Service in the Actions column.

  2. In the Resource Configuration section, modify the Inference Acceleration parameter. Valid values:

    • BladeLLM Inference Acceleration

    • Open-source vLLM Inference Acceleration

  3. Click Deploy.

How do I mount a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

  1. Upload the model and related configuration files to your OSS bucket. For information about how to create a bucket and upload objects, see Create buckets and Upload objects.

    The following figure shows a sample of the model files that you need to prepare: image.png

    The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.

  2. On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update Service in the Actions column.

  3. On the LLM Deployment page, specify the following parameters and click Deploy.

    Parameter

    Description

    Basic Information

    Model Source

    Select Custom Fine-tuned Model.

    Model Type

    Select Model, Parameter Quantity, and Precision based on your model.

    Model Settings

    Select Mount OSS for Type and specify the OSS path where the model file is stored.

    Resource Configuration

    Resource Configuration

    After you specify the Model Type parameter, the system automatically configures the Resource Configuration parameter. You can also specify Resource Configuration based on the parameter quantity of your model.

How do I call API operations to perform inference?

  1. Obtain the service endpoint and token.

    1. Go to the Elastic Algorithm Service (EAS) page. For more information, see the Deploy an LLM in EAS section of this topic.

    2. Click the name of the service to go to the Service Details tab.

    3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab of the dialogue box that appears, obtain the token and endpoint.

  2. To call API operations to perform inference, use one of the following methods:

    Use HTTP

    • Non-streaming mode

      The client sends the following types of standard HTTP requests when cURL commands are run.

      • STRING requests

        curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v

        Replace $authorization with the token. Replace $host with the endpoint. The chatllm_data.txt file is a plain text file that contains the prompt.

      • Structured requests

        curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

        Use the chatllm_data.json file to configure inference parameters. The following sample code provides an format example of the chatllm_data.json file:

        {
          "max_new_tokens": 4096,
          "use_stream_chat": false,
          "prompt": "How to install it?",
          "system_prompt": "Act like you are programmer with 5+ years of experience.",
          "history": [
            [
              "Can you tell me what's the bladellm?",
              "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
            ]
          ],
          "temperature": 0.8,
          "top_k": 10,
          "top_p": 0.8,
          "do_sample": true,
          "use_cache": true
        }

        The following table describes the parameters in the preceding code. Configure the parameters based on your business requirements.

        Parameter

        Description

        Default value

        max_new_tokens

        The maximum number of output tokens.

        2048

        use_stream_chat

        Specifies whether to return the output tokens in the streaming mode.

        true

        prompt

        The user prompt.

        ""

        system_prompt

        The system prompt.

        ""

        history

        The dialogue history. The value is in the List[Tuple(str, str)] format.

        [()]

        temperature

        The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

        0.95

        top_k

        The number of outputs selected from the generated results.

        30

        top_p

        The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

        0.8

        do_sample

        Specifies whether to enable output sampling.

        true

        use_cache

        Specifies whether to enable KV cache.

        true

      You can also implement your own client based on the Python requests package. Example:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
      
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = False
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = "<EAS service public endpoint>"
          authorization = "<EAS service public token>"
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          # System prompts can be included in the requests. 
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
      
          # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the information from the previous round of dialogue is used. The information is in the List[Tuple(str, str)] format. 
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
          output, history = get_response(response)
          print(f" --- output: {output} \n --- history: {history}", flush=True)
      
      # The server returns a JSON response that includes the inference result and dialogue history. 
      def get_response(response: requests.Response) -> List[str]:
          data = json.loads(response.content)
          output = data["response"]
          history = data["history"]
          return output, history

      Take note of the following parameters:

      • Set the host parameter to the service endpoint

      • Set the authorization parameter to the service token.

    • Streaming mode

      In streaming mode, the HTTP SSE method is used. Sample code:

      import argparse
      import json
      from typing import Iterable, List
      
      import requests
      
      
      def clear_line(n: int = 1) -> None:
          LINE_UP = '\033[1A'
          LINE_CLEAR = '\x1b[2K'
          for _ in range(n):
              print(LINE_UP, end=LINE_CLEAR, flush=True)
      
      
      def post_http_request(prompt: str,
                            system_prompt: str,
                            history: list,
                            host: str,
                            authorization: str,
                            max_new_tokens: int = 2048,
                            temperature: float = 0.95,
                            top_k: int = 1,
                            top_p: float = 0.8,
                            langchain: bool = False,
                            use_stream_chat: bool = False) -> requests.Response:
          headers = {
              "User-Agent": "Test Client",
              "Authorization": f"{authorization}"
          }
          if not history:
              history = [
                  (
                      "San Francisco is a",
                      "city located in the state of California in the United States. \
                      It is known for its iconic landmarks, such as the Golden Gate Bridge \
                      and Alcatraz Island, as well as its vibrant culture, diverse population, \
                      and tech industry. The city is also home to many famous companies and \
                      startups, including Google, Apple, and Twitter."
                  )
              ]
          pload = {
              "prompt": prompt,
              "system_prompt": system_prompt,
              "top_k": top_k,
              "top_p": top_p,
              "temperature": temperature,
              "max_new_tokens": max_new_tokens,
              "use_stream_chat": use_stream_chat,
              "history": history
          }
          if langchain:
              pload["langchain"] = langchain
          response = requests.post(host, headers=headers,
                                   json=pload, stream=use_stream_chat)
          return response
      
      
      def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
          for chunk in response.iter_lines(chunk_size=8192,
                                           decode_unicode=False,
                                           delimiter=b"\0"):
              if chunk:
                  data = json.loads(chunk.decode("utf-8"))
                  output = data["response"]
                  history = data["history"]
                  yield output, history
      
      
      if __name__ == "__main__":
          parser = argparse.ArgumentParser()
          parser.add_argument("--top-k", type=int, default=4)
          parser.add_argument("--top-p", type=float, default=0.8)
          parser.add_argument("--max-new-tokens", type=int, default=2048)
          parser.add_argument("--temperature", type=float, default=0.95)
          parser.add_argument("--prompt", type=str, default="How can I get there?")
          parser.add_argument("--langchain", action="store_true")
          args = parser.parse_args()
      
          prompt = args.prompt
          top_k = args.top_k
          top_p = args.top_p
          use_stream_chat = True
          temperature = args.temperature
          langchain = args.langchain
          max_new_tokens = args.max_new_tokens
      
          host = ""
          authorization = ""
      
          print(f"Prompt: {prompt!r}\n", flush=True)
          system_prompt = "Act like you are programmer with \
                      5+ years of experience."
          history = []
          response = post_http_request(
              prompt, system_prompt, history,
              host, authorization,
              max_new_tokens, temperature, top_k, top_p,
              langchain=langchain, use_stream_chat=use_stream_chat)
      
          for h, history in get_streaming_response(response):
              print(
                  f" --- stream line: {h} \n --- history: {history}", flush=True)
      

      Take note of the following parameters:

      • Set the host parameter to the service endpoint

      • Set the authorization parameter to the service token.

    Use WebSocket

    The WebSocket protocol can efficiently handle the dialogue history. You can use the WebSocket method to connect to the service and perform one or more rounds of dialogue. Sample code:

    import os
    import time
    import json
    import struct
    from multiprocessing import Process
    
    import websocket
    
    round = 5
    questions = 0
    
    
    def on_message_1(ws, message):
        if message == "<EOS>":
            print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
        else:
            print("{}".format(time.time()))
            print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
                  time.time(), message), flush=True)
    
    
    def on_message_2(ws, message):
        global questions
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        if message == "<EOS>":
            questions = questions + 1
            if questions == 5:
                ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_message_3(ws, message):
        print('pid-{} --- message received: {}'.format(os.getpid(), message))
        # end the client-side streaming
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    
    
    def on_error(ws, error):
        print('error happened: ', str(error))
    
    
    def on_close(ws, a, b):
        print("### closed ###", a, b)
    
    
    def on_pong(ws, pong):
        print('pong:', pong)
    
    # stream chat validation test
    def on_open_1(ws):
        print('Opening Websocket connection to the server ... ')
        params_dict = {}
        params_dict['prompt'] = """Show me a golang code example: """
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['do_sample'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        # raw_req = f"""To open a Websocket connection to the server: """
    
        ws.send(raw_req)
        # end the client-side streaming
    
    
    # multi-round query validation test
    def on_open_2(ws):
        global round
        print('Opening Websocket connection to the server ... ')
        params_dict = {"max_new_tokens": 6144}
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['use_stream_chat'] = True
        params_dict['prompt'] = "Hello! "
        params_dict = {
            "system_prompt":
            "Act like you are programmer with 5+ years of experience."
        }
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please write a sorting algorithm in Python."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please convert the programming language to Java."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please introduce yourself."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
        params_dict['prompt'] = "Please summarize the dialogue above."
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    # Langchain validation test.
    def on_open_3(ws):
        global round
        print('Opening Websocket connection to the server ... ')
    
        params_dict = {}
        # params_dict['prompt'] = """To open a Websocket connection to the server: """
        params_dict['prompt'] = """Can you tell me what's the MNN?"""
        params_dict['temperature'] = 0.9
        params_dict['top_p'] = 0.1
        params_dict['top_k'] = 30
        params_dict['max_new_tokens'] = 2048
        params_dict['use_stream_chat'] = False
        params_dict['langchain'] = True
        raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
        ws.send(raw_req)
    
    
    authorization = ""
    host = "ws://" + ""
    
    
    def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
        ws = websocket.WebSocketApp(
            host,
            on_open=on_open_func,
            on_message=on_message_func,
            on_error=on_error,
            on_pong=on_pong,
            on_close=on_clonse_func,
            header=[
                'Authorization: ' + authorization],
        )
    
        # setup ping interval to keep long connection.
        ws.run_forever(ping_interval=2)
    
    
    if __name__ == "__main__":
        for i in range(5):
            p1 = Process(target=single_call, args=(on_open_1, on_message_1))
            p2 = Process(target=single_call, args=(on_open_2, on_message_2))
            p3 = Process(target=single_call, args=(on_open_3, on_message_3))
    
            p1.start()
            p2.start()
            p3.start()
    
            p1.join()
            p2.join()
            p3.join()

    Take note of the following parameters:

    • Set the authorization parameter to the service token.

    • Set the host parameter to the service endpoint Replace the http prefix in the endpoint with ws.

    • Use the use_stream_chat parameter to specify whether the client generates output in streaming mode. Default value: True.

    • Refer to the on_open_2 function in the preceding code to implement a multi-round dialogue.

How do I configure more parameters?

  1. On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update Service in the Actions column.

  2. On the LLM Deployment page, click Convert to Custom Deployment in the upper-right corner.

  3. In the Model Service Information section, specify the following Command to Run and click Deploy.

    Parameter

    Description

    Default value

    --model-path

    Specify the preset model name or a custom model path.

    • Example 1: Load a preset model. You can use a preset model in the meta-llama/Llama-2-* series, including Llama-2-7b-hf, Llama-2-7b-chat-hf, Llama-2-13b-hf, and Llama-2-13b-chat-hf. Example:

      python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf.

    • Example 2: Load an on-premises custom model.

      Example: python webui/webui_server.py --port=8000 --model-path=/llama2-7b-chat.

    meta-llama/Llama-2-7b-chat-hf

    --cpu

    Use CPU to perform model inference.

    Example: python webui/webui_server.py --port=8000 --cpu.

    By default, GPU is used for model inference.

    --precision

    Specify the precision of the Llama2 model. Valid values: fp32 and fp16. Example: python webui/webui_server.py --port=8000 --precision=fp32.

    The system automatically specifies the precision of the 7B model based on the GPU memory size.

    --port

    Specify the listening port of the server.

    Sample code: python webui/webui_server.py --port=8000.

    8000

    --api-only

    Allows users to access the service only by calling API operations. By default, the service starts both WebUI and API server.

    Sample code: python webui/webui_server.py --api-only.

    False

    --no-api

    Allows users to access the service only by using the WebUI. By default, the service starts both WebUI and API server.

    Sample code: python webui/webui_server.py --no-api.

    False

    --max-new-tokens

    The maximum number of output tokens.

    Sample code: python api/api_server.py --port=8000 --max-new-tokens=1024.

    2048

    --temperature

    The randomness of the model output. A larger value specifies a higher randomness. A value of 0 specifies a fixed output. The value is of the Float type and ranges from 0 to 1.

    Sample code: python api/api_server.py --port=8000 --max_length=0.8.

    0.95

    --max_round

    The maximum number of rounds of dialogue supported during inference.

    Sample code: python api/api_server.py --port=8000 --max_round=10.

    5

    --top_k

    The number of outputs selected from the generated results. The value is a positive integer.

    Example: python api/api_server.py --port=8000 --top_k=10.

    None

    --top_p

    The probability threshold of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.

    Sample code: python api/api_server.py --port=8000 --top_p=0.9.

    None

    --no-template

    Models such as Llama 2 and Falcon provide a default prompt template. If you leave this parameter empty, the default prompt template is used. If you configure this parameter, you must specify your own template.

    Sample code: python api/api_server.py --port=8000 --no-template.

    If you do not specify this parameter, the default prompt template is automatically used.

    --log-level

    The log output level. Valid values: DEBUG, INFO, WARNING, and ERROR.

    Sample code: python api/api_server.py --port=8000 --log-level=DEBUG.

    INFO

    --export-history-path

    You can use EAS-LLM to export the conversation history. In this case, you must specify an output path to which you want to export the conversation history when you start the service. In most cases, you can specify the mount path of an OSS bucket. EAS exports the records of the conversation that happened over a specific period of time to a file.

    Sample code: python api/api_server.py --port=8000 --export-history-path=/your_mount_path.

    By default, this feature is disabled.

    --export-interval

    The period of time during which the conversation is recorded. Unit: seconds. For example, if you set the --export-interval parameter to 3600, the conversation records of the previous hour are exported into a file.

    3600

References

  • For more information about EAS, see EAS overview.

  • After you use the LangChain framework on the WebUI page, the knowledge base is also available when you use API. We recommend that you store your knowledge base in an on-premises vector database. For more information, see RAG-based LLM chatbot.

  • For more information about the versions of ChatLLM-WebUI, see Release notes for ChatLLM WebUI.