Deploy LLM applications in EAS - Platform For AI - Alibaba Cloud Documentation Center

You can use Elastic Algorithm Service (EAS) of Platform for AI (PAI) to deploy a large language model (LLM) as an AI-powered web application. After you deploy the model, you can call the application by using the web UI or API operations. You can also use the LangChain framework to integrate enterprise knowledge base to implement intelligent conversation and automation capabilities. EAS also provides BladeLLM and vLLM inference acceleration engines that support high concurrency and low latency.

Background information

As foundation models such as ChatGPT and TongYi Qianwen become popular in the industry, the inference application of LLMs has come under the spotlight. EAS allows you to choose open source foundation models that are available on the market based on their performance and your business requirements. For example, you can quickly launch model files of LLMs such as Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B from a third party in EAS. You can also deploy an open source model as an inference application in a few clicks to perform model inference in EAS. This topic describes how to deploy an LLM model and call the model service by using EAS. This topic also provides answers to frequently asked questions.

Prerequisites

EAS is activated and the default workspace is created. For more information, see Activate PAI and create the default workspace.
If you use a RAM user to deploy the model, make sure that the RAM user is granted the management permissions on EAS. For more information, see Grant the permissions that are required to use EAS.

Limits

The inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.

Deploy model service in EAS

Go to the EAS-Online Model Services page.
1. Log on to the Platform for AI (PAI) console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which the model service that you want to manage belongs.
3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.
On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Deploy Service page, configure the required parameters. The following table describes key parameters.

Parameter	Description
Service Name	The name of the service. The service name llm_demo001 is used in this example.
Deployment Method	Select Deploy Web App by Using Image.
Select Image	Click PAI Image, select chat-llm-webui from the drop-down list, and select 2.1 as the image version. Note You can select the latest version of the image when you deploy the model service.
Command to Run	After you select an image version, the system automatically sets the command to `python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat`. The command calls the Qwen-7B model. If you want to call other LLMs, you can replace the command. For more information, see How do I switch to another open source foundation model?
Resource Group Type	Select Public Resource Group.
Resource Configuration Mode	Select General.
Resource Configuration	You must select a GPU type. We recommend that you use the ml.gu7i.c16m60.1-gu30 instance type to call the Qwen-7B model in terms of cost-effectiveness. For more information about the instance types that we recommend you to use when you deploy other open source LLMs, see How do I switch to another open source foundation model?.

Click Deploy. The deployment requires several seconds to complete.
When the Model Status changes to Running, the service is deployed.

Use web UI to perform model inference

Find the service that you want to manage and click View Web App in the Service Type column.
Perform model inference on the web UI page.
Enter a sentence in the input text box below the dialog box to start the dialogue. For example, please provide a financial learning plan. Click Send to start the dialogue.

FAQ

How do I switch to another open source foundation model?

EAS allows you to use the following open source foundation models: Qwen, Llama2, ChatGLM, Baichuan, Yi-6B, Mistral-7B, and Falcon-7B models. Perform the following steps to switch to these models to deploy services.

On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.

On the Deploy Service page, update the Command to Run and Instance Type parameters based on the following table. Then, click Deploy.

Model type	Command to Run	Recommended specification
Qwen-1.8B	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-1_8B-Chat`	1 * GU30 1 * NVIDIA A10 1 * NVIDIA T4 1 * NVIDIA V100
Qwen-7B	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-7B-Chat`	1 * GU30 1 * NVIDIA A10
Qwen-14B	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-14B-Chat`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
Qwen-72B	`python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen-72B-Chat`	8 * NVIDIA V100 (gn6e)
Llama2-7B	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-7b-chat-hf`	1 * GU30 1 * NVIDIA A10
Llama2-13B	`python webui/webui_server.py --port=8000 --model-path=meta-llama/Llama-2-13b-chat-hf --precision=fp16`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
chatglm2-6B	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm2-6b`	1 * GU30 1 * NVIDIA A10
chatglm3-6B	`python webui/webui_server.py --port=8000 --model-path=THUDM/chatglm3-6b`	1 * GU30 1 * NVIDIA A10
baichuan-13B	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan-13B-Chat`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
baichuan2-7B	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-7B-Chat`	1 * GU30 1 * NVIDIA A10
baichuan2-13B	`python webui/webui_server.py --port=8000 --model-path=baichuan-inc/Baichuan2-13B-Chat`	1 * NVIDIA V100 (gn6e) 2 * GU30 2 * NVIDIA A10
Yi-6B	`python webui/webui_server.py --port=8000 --model-path=01-ai/Yi-6B`	1 * GU30 1 * NVIDIA A10
Mistral-7B	`python webui/webui_server.py --model-path=mistralai/Mistral-7B-Instruct-v0.1`	1 * GU30 1 * NVIDIA A10
falcon-7B	`python webui/webui_server.py --port=8000 --model-path=tiiuae/falcon-7b-instruct`	1 * GU30 1 * NVIDIA A10

How do I use LangChain to integrate my business data?

What is LangChain:
LangChain is an open source framework that allows AI developers to integrate LLMs like GPT-4 with external data to improve performance and optimize resource utilization.
How does LangChain work:
LangChain divides a document, such as a 20-page PDF file, into smaller chunks and embeds them into a vector store.
LangChain processes the user input and stores the data locally as the knowledge base of the LLM. In each inference process, LangChain searches for an answer similar to the input question in the local knowledge base, and then inputs the local-generated answer and the user input together into the LLM to generate a custom answer.
How to configure LangChain:
1. Click LangChain and go to the LangChain tab on the web UI page.
2. Upload custom data in the lower-left corner of the web UI page. You can upload files in TXT, MD, DOCX, and PDF formats.
  For example, you can drag and drop to upload a README.md file and click Vectorstore knowledge in the lower-left corner. The following result indicates that the custom data is loaded.
3. In the input box at the bottom of the web UI page, enter a sentence to start a dialogue.
4. For example, enter how to install deepspeed in the input box and click Send. The following figure shows the result.
Note
After you use LangChain to integrate business data on the web UI page, you can perform model inference with the data by using API operations. You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.

How do I improve concurrency and reduce latency for the inference service?

EAS provides BladeLLM and vLLM inference acceleration engines to ensure high concurrency and low latency for the inference service. Perform the following steps:

On the EAS-Online Model Services page, find the service that you want to update. Click Update Service in the Actions column of the service.
In the Model Service Information section, add the parameter --backend=vllm to the Command to Run parameter and click Deploy.
Important
The inference acceleration engine supports only the following model types: Qwen, Llama2, Baichuan-13B, and Baichuan2-13B.
Update versions of Transformers and vLLM.
As new models are released, the previously released models and latest released models may have incompatibility issues on the dependency of toolkit versions, such as Transformers and vLLM. To solve the incompatibility issues, we recommend that you upgrade toolkits such as Transformers and vLLM based on your business requirements. Specify the specific versions of the toolkits in the Third-party Library Settings section.

How do I mount a custom model?

You can use Object Storage Service (OSS) to mount a custom model. Procedure:

Upload the model and related configuration files to your OSS bucket. For more information about how to create a bucket and upload objects, see Create buckets and Upload objects.
The following figure provides a sample of the model files that you need to prepare:
The config.json file must be uploaded. You must configure the config.json file based on the Huggingface model format. For more information about the sample file, see config.json.
Click Update Service in the Actions column of the service.

In the Model Service Information section, specify the required parameters and click Deploy.

Parameter

Description

Model Settings

Click Specify Model Settings to configure the model.

Select Mount OSS Path in Model Settings. Set the OSS bucket path to the path where the custom model files reside. Example: oss://bucket-test/data-oss/.
Set Mount Path to /data.
Enable Read-only Mode: turn off the read-only mode.

Command to Run

Add the following parameters to Command to Run:

--model-path: Set the parameter to /data. Set the value to the mount path.
--model-type: the model type.

For more information about commands to run for different types of models, see Commands to run.

Command to run

Model type	Command to Run
llama2	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=llama2`
chatglm2	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm`
qwen (Tongyi Qianwen)	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=qwen`
chatglm	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=chatglm`
falcon-7B	`python webui/webui_server.py --port=8000 --model-path=/data --model-type=falcon`

How do I use API operations to perform model inference?

Obtain the service access endpoint and token.
1. Go to the PAI-EAS Model Online Service page. For more information, see the Deploy model service in EAS section in this topic.
2. Click the name of the service to go to the Service Details tab.
3. In the Basic Information section, click Invocation Method. On the Public Endpoint tab, obtain the service token and endpoint.

Perform model inference by calling API operations.

Call the service by using HTTP

Non-streaming mode

The client sends standard HTTP requests of the following types when cURL commands are run.

STRING requests
```
curl $host -H 'Authorization: $authorization' --data-binary @chatllm_data.txt -v
```
Replace $authorization with the service token. Replace $host with the service endpoint. The file chatllm_data.txt is a plain text file that contains the prompt.

Structured requests

curl $host -H 'Authorization: $authorization' -H "Content-type: application/json" --data-binary @chatllm_data.json -v -H "Connection: close"

Use the chatllm_data.json file to configure inference parameters. The following code provides an example of the content format of the chatllm_data.json file:

{
    "max_new_tokens": 4096,
    "use_stream_chat": false,
    "prompt": "How to install it?",
    "system_prompt": "Act like you are programmer with 5+ years of experience."
    "history": [
        [
            "Can you tell me what's the bladellm?",
            "BladeLLM is an framework for LLM serving, integrated with acceleration techniques like quantization, ai compilation, etc. , and supporting popular LLMs like OPT, Bloom, LLaMA, etc."
        ]
    ],
    "temperature": 0.8,
    "top_k": 10,
    "top_p": 0.8,
    "do_sample": True,
    "use_cache": True,
}

The following table describes the parameters. Configure the parameters based on your business requirements.

Parameter	Description	Default value
max_new_tokens	The maximum number of output tokens.	2048
use_stream_chat	Specify whether to return the output tokens in streaming mode.	True
prompt	The user prompt.	""
system_prompt	The system prompt.	""
history	The dialogue history. The value is of the List[Tuple(str, str)] type.	[()]
temperature	Specify the randomness of the model output. A larger value indicates a higher randomness. A value of 0 indicates a fixed output. The value is of the Float type and ranges from 0 to 1.	0.95
top_k	The number of outputs selected from the generated results.	30
top_p	The proportion of outputs selected from the generated results. The value is of the Float type and ranges from 0 to 1.	0.8
do_sample	Specify whether to enable output sampling.	True
use_cache	Specify whether to enable KV cache.	True

You can also implement your own client based on the Python requests package. Example:

import argparse
import json
from typing import Iterable, List

import requests

def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response

def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")

    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = False
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = "<EAS service public endpoint>"
    authorization = "<EAS service public token>"

    print(f"Prompt: {prompt!r}\n", flush=True)
    # System prompts can be included in the requests.
    system_prompt = "Act like you are programmer with \
                5+ years of experience."

    # Dialogue history can be included in the requests. The client manages the history to implement multi-round dialogues. In most cases, the previous round of dialogue information is used. The information is in the List[Tuple(str, str)] format.
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)
    output, history = get_response(response)
    print(f" --- output: {output} \n --- history: {history}", flush=True)

# The server returns a JSON response that includes the inference result and dialogue history.
def get_response(response: requests.Response) -> List[str]:
    data = json.loads(response.content)
    output = data["response"]
    history = data["history"]
    return output, history

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

Streaming mode

The streaming mode uses the HTTP SSE method. Sample code:

import argparse
import json
from typing import Iterable, List

import requests


def clear_line(n: int = 1) -> None:
    LINE_UP = '\033[1A'
    LINE_CLEAR = '\x1b[2K'
    for _ in range(n):
        print(LINE_UP, end=LINE_CLEAR, flush=True)


def post_http_request(prompt: str,
                      system_prompt: str,
                      history: list,
                      host: str,
                      authorization: str,
                      max_new_tokens: int = 2048,
                      temperature: float = 0.95,
                      top_k: int = 1,
                      top_p: float = 0.8,
                      langchain: bool = False,
                      use_stream_chat: bool = False) -> requests.Response:
    headers = {
        "User-Agent": "Test Client",
        "Authorization": f"{authorization}"
    }
    if not history:
        history = [
            (
                "San Francisco is a",
                "city located in the state of California in the United States. \
                It is known for its iconic landmarks, such as the Golden Gate Bridge \
                and Alcatraz Island, as well as its vibrant culture, diverse population, \
                and tech industry. The city is also home to many famous companies and \
                startups, including Google, Apple, and Twitter."
            )
        ]
    pload = {
        "prompt": prompt,
        "system_prompt": system_prompt,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "max_new_tokens": max_new_tokens,
        "use_stream_chat": use_stream_chat,
        "history": history
    }
    if langchain:
        pload["langchain"] = langchain
    response = requests.post(host, headers=headers,
                             json=pload, stream=use_stream_chat)
    return response


def get_streaming_response(response: requests.Response) -> Iterable[List[str]]:
    for chunk in response.iter_lines(chunk_size=8192,
                                     decode_unicode=False,
                                     delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["response"]
            history = data["history"]
            yield output, history


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--top-k", type=int, default=4)
    parser.add_argument("--top-p", type=float, default=0.8)
    parser.add_argument("--max-new-tokens", type=int, default=2048)
    parser.add_argument("--temperature", type=float, default=0.95)
    parser.add_argument("--prompt", type=str, default="How can I get there?")
    parser.add_argument("--langchain", action="store_true")
    args = parser.parse_args()

    prompt = args.prompt
    top_k = args.top_k
    top_p = args.top_p
    use_stream_chat = True
    temperature = args.temperature
    langchain = args.langchain
    max_new_tokens = args.max_new_tokens

    host = ""
    authorization = ""

    print(f"Prompt: {prompt!r}\n", flush=True)
    system_prompt = "Act like you are programmer with \
                5+ years of experience."
    history = []
    response = post_http_request(
        prompt, system_prompt, history,
        host, authorization,
        max_new_tokens, temperature, top_k, top_p,
        langchain=langchain, use_stream_chat=use_stream_chat)

    for h, history in get_streaming_response(response):
        print(
            f" --- stream line: {h} \n --- history: {history}", flush=True)

In the preceding command:

Set host to the service access endpoint.
Set authorization to the service token.

Call the service by using WebSocket

The WebSocket protocol is more efficient for handling the conversation history. You can use the WebSocket method to connect to the service and perform one or multiple rounds of conversation. Sample code:

import os
import time
import json
import struct
from multiprocessing import Process

import websocket

round = 5
questions = 0


def on_message_1(ws, message):
    if message == "<EOS>":
        print('pid-{} timestamp-({}) receives end message: {}'.format(os.getpid(),
              time.time(), message), flush=True)
        ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)
    else:
        print("{}".format(time.time()))
        print('pid-{} timestamp-({}) --- message received: {}'.format(os.getpid(),
              time.time(), message), flush=True)


def on_message_2(ws, message):
    global questions
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    if message == "<EOS>":
        questions = questions + 1
        if questions == 5:
            ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_message_3(ws, message):
    print('pid-{} --- message received: {}'.format(os.getpid(), message))
    # end the client-side streaming
    ws.send(struct.pack('!H', 1000), websocket.ABNF.OPCODE_CLOSE)


def on_error(ws, error):
    print('error happened: ', str(error))


def on_close(ws, a, b):
    print("### closed ###", a, b)


def on_pong(ws, pong):
    print('pong:', pong)

# stream chat validation test
def on_open_1(ws):
    print('Opening Websocket connection to the server ... ')
    params_dict = {}
    params_dict['prompt'] = """Show me a golang code example: """
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['do_sample'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    # raw_req = f"""To open a Websocket connection to the server: """

    ws.send(raw_req)
    # end the client-side streaming


# multi-round query validation test
def on_open_2(ws):
    global round
    print('Opening Websocket connection to the server ... ')
    params_dict = {"max_new_tokens": 6144}
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['use_stream_chat'] = True
    params_dict['prompt'] = "Hello! "
    params_dict = {
        "system_prompt":
        "Act like you are programmer with 5+ years of experience."
    }
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please write a sorting algorithm in Python."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please convert to the Java implementation."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please introduce yourself?"
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)
    params_dict['prompt'] = "Please summarize the dialogue above."
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


# Langchain validation test.
def on_open_3(ws):
    global round
    print('Opening Websocket connection to the server ... ')

    params_dict = {}
    # params_dict['prompt'] = """To open a Websocket connection to the server: """
    params_dict['prompt'] = """Can you tell me what's the MNN?"""
    params_dict['temperature'] = 0.9
    params_dict['top_p'] = 0.1
    params_dict['top_k'] = 30
    params_dict['max_new_tokens'] = 2048
    params_dict['use_stream_chat'] = False
    params_dict['langchain'] = True
    raw_req = json.dumps(params_dict, ensure_ascii=False).encode('utf8')
    ws.send(raw_req)


authorization = ""
host = "ws://" + ""


def single_call(on_open_func, on_message_func, on_clonse_func=on_close):
    ws = websocket.WebSocketApp(
        host,
        on_open=on_open_func,
        on_message=on_message_func,
        on_error=on_error,
        on_pong=on_pong,
        on_close=on_clonse_func,
        header=[
            'Authorization: ' + authorization],
    )

    # setup ping interval to keep long connection.
    ws.run_forever(ping_interval=2)


if __name__ == "__main__":
    for i in range(5):
        p1 = Process(target=single_call, args=(on_open_1, on_message_1))
        p2 = Process(target=single_call, args=(on_open_2, on_message_2))
        p3 = Process(target=single_call, args=(on_open_3, on_message_3))

        p1.start()
        p2.start()
        p3.start()

        p1.join()
        p2.join()
        p3.join()

Parameters:

Set authorization to the service token.
Set host to the service access endpoint. Replace the http in the endpoint with ws.
Use the use_stream_chat parameter to specify whether the client generates output in the streaming mode. The default value is True, which indicates that the server returns streaming data.
Refer to the implementation method of the on_open_2 function in the preceding sample code to implement multi-round conversation.

References

For more information about EAS, see EAS overview.
You can also perform vector search in an on-premises knowledge base. For more information, see Use PAI and a vector database to implement intelligent dialogue based on LLMs.