Deploy and call a RAG-based chatbot - Platform For AI - Alibaba Cloud Documentation Center

Elastic Algorithm Service (EAS) provides simplified deployment methods for different scenarios. You can configure parameters to deploy a Retrieval-Augmented Generation (RAG)-based large language model (LLM) chatbot. This significantly shortens the service deployment time. When you use the chatbot to perform model inference, the chatbot effectively retrieves relevant information from the knowledge base and combines the retrieved information with answers from LLM applications to provide accurate and informative answers. This significantly improves the quality of Q&A and overall performance. The chatbot is suitable for Q&A, summarization, and other natural language processing (NLP) tasks that rely on specific knowledge bases. This topic describes how to deploy a RAG-based LLM chatbot and how to perform model inference.

Background information

LLM applications have limits in generating accurate and real-time responses. Therefore, LLM applications are not suitable for scenarios that require precise information, such as the customer service or Q&A scenario. To resolve these issues, the RAG technique is used to enhance the performance of LLM applications. This significantly improves the quality of Q&A, summarization, and other NLP tasks that rely on specific knowledge bases.

RAG improves the answer accuracy and increases the amount of information about answers by combining LLM applications such as Tongyi Qianwen with information retrieval components. When a query is initiated, RAG uses an information retrieval component to find documents or information fragments related to the query in the knowledge base, and integrates these retrieved contents with the original query into an LLM application. The LLM application uses its induction and generation capabilities to generate factual answers based on the latest information. You do not need to retrain the LLM application.

The chatbot that is deployed in EAS integrates LLM applications with RAG to overcome the limits of LLM applications in terms of accuracy and timeliness. This chatbot provides accurate and informative answers in various Q&A scenarios and helps improve the overall performance and user experience of NLP tasks.

Prerequisites

A virtual private cloud (VPC), vSwitch, and security group are created. For more information, see Create and manage a VPC and Create a security group.
Note
If you use Facebook AI Similarity Search (Faiss) to build a vector database, the preceding prerequisites are not required.
An Object Storage Service (OSS) bucket or network-attached storage (NAS) file system is created for storing fine-tuned model files. This prerequisite must be met if you use a fine-tuned model to deploy the chatbot. For more information, see Get started by using the OSS console or Create a file system.
Note
If you use Faiss to build a vector database, you must prepare an OSS bucket.

Limits

The vector database and EAS must be deployed in the same region.

Precautions

This practice is subject to the maximum number of tokens of an LLM service and is designed to help you understand the basic retrieval feature of a RAG-based LLM chatbot.

The chatbot is limited by the server resource size of the LLM service and the default number (512) of tokens. In addition, the conversation length supported by the chatbot is limited. If you want to experience the retrieval feature of the chatbot, we recommended that you enter text with at most 300 words for retrieval.
If you do not need to perform multiple rounds of conversations, we recommended that you disable the with chat history feature of the chatbot. This can effectively reduce the possibility of reaching the limit. For more information, see How do I disable the with chat history feature of the RAG-based chatbot?

Step 1: Prepare a vector database

You can use one of the following services to build a vector database: Faiss, Elasticsearch, Hologres, and AnalyticDB for PostgreSQL. When you build a vector database, save the required parameter configurations for connecting to the vector database.

Faiss

Faiss streamlines the process of building an on-premises vector database. You do not need to purchase or activate the service.

Elasticsearch

Create an Elasticsearch cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster.
Take note of the following parameters:
- Set the Instance Type parameter to Standard Edition.
- Copy the values of the Username and Password parameters and save them to your on-premises machine.
Click the name of the cluster to go to the Basic Information page. Copy the values of the Internal Endpoint and Internal Port parameters and save them to your on-premises machine.

Hologres

Purchase a Hologres instance and create a database. For more information, see Purchase a Hologres instance and Create a database. You must save the name of the database to your on-premises machine.
View the invocation information in the Hologres console.
1. Click the name of the instance to go to the Instance Details page.
2. In the Network Information section, find Select VPC, click Copy in the Endpoint column, and then save the content before :80 in the endpoint to your on-premises machine.
In the left-side navigation pane, click Account Management to create a custom account. Save the account and password to your on-premises machine. This information is used for subsequent connections to the Hologres instance. For information about how to create a custom account, see the "Create a custom account" section in Manage users.
Set the Select Member Role parameter to Examples of the Super Administrator (SuperUser).

AnalyticDB for PostgreSQL

Create an instance in the AnalyticDB for PostgreSQL console. For more information, see Create an instance.
Set the Vector Engine Optimization parameter to Enabled.
Click the name of the instance to go to the Basic Information page. In the Database Connection Information section, copy the internal and public endpoints of the instance and save them to your on-premises machine.
Note
- If no public endpoints are available, click Apply for Public Endpoint. For more information, see Manage public endpoints.
- If the instance resides in the same VPC as EAS, you need only the internal endpoint.
Create a database account. Save the database account and password to your on-premises machine. This information is used for subsequent connections to the database. For more information, see Create a database account.
Configure a whitelist that consists of trusted IP addresses. For more information, see Configure an IP address whitelist.

Step 2: Deploy the RAG-based chatbot

Go to the EAS-Online Model Services page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace to which you want to deploy the model.
3. In the left-side navigation pane, choose Model Deployment>Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click RAG-based Smart Dialogue Deployment.

On the RAG-based LLM Chatbot Deployment page, configure the parameters. The following tables describe the key parameters in different sections.

Basic Information

Parameter	Description
Service Name	The name of the service.
Model Source	Valid values: Open Source Model and Custom Fine-tuned Model.
Model Type	Select a model type based on your business requirements. If you set Model Source to Custom Fine-tuned Model, you must configure the parameter quantity and precision for the model type.
Model Settings	If you set Model Source to Custom Fine-tuned Model, you must configure the path in which the fine-tuned model file is stored. Valid values: Note Make sure that the model file format is compatible with Hugging Face transformers. Mount OSS: Select the OSS path in which the fine-tuned model file is stored. Mount NAS: Select the NAS file system in which the fine-tuned model file is stored and the source path of the NAS file system.

Resource Configuration

Parameter

Description

Resource Configuration

If you set Model Source to Open Source Model, the system automatically selects an instance type based on the selected model type as the default value.
If you set Model Source to Custom Fine-tuned Model, you need to select an instance type that matches the model. For more information, see Deploy LLM applications in EAS.

Inference Acceleration

Inference acceleration can be enabled for the Qwen, Llama2, ChatGLM, or Baichuan2 model that is deployed on A10 or GU30 instances. Valid values:

BladeLLM Inference Acceleration: The BladeLLM inference acceleration engine ensures high concurrency and low latency. You can use BladeLLM to accelerate LLM inference in a cost-effective manner.
Open-source vLLM Inference Acceleration

Vector Database Settings

Select a service to build a vector database based on your business requirements.

FAISS

Parameter	Description
Vector Database Type	Select FAISS.
Mount Path	Select an OSS path.
Index Name	Enter the name of the index. Example: faiss_index.

Elasticsearch

Parameter	Description
Vector Database Type	Select Elasticsearch.
Private Endpoint and Port	Enter the private endpoint and port number that you obtained in Step 1. Specify the parameter in the `http://Private endpoint:Port number` format.
Index Name	Enter the name of the index.
Account	Enter the username that you configured when you created the Elasticsearch cluster in Step 1.
Password	Enter the password that you configured when you created the Elasticsearch cluster in Step 1.

Hologres

Parameter	Description
Vector Database Type	Select Hologres.
Invocation Information	Enter the Hologres invocation information that you obtained in Step 1.
Database	Enter the name of the database that you created in Step 1.
Account	Enter the custom account that you created in Step 1.
Password	Enter the password of the custom account that you created in Step 1.
Database Table	Enter the name of the database table. Example: test_table.

AnalyticDB

Parameter	Description
Vector Database Type	Select AnalyticDB.
Database Endpoint	Enter the public endpoint of the database that you obtained in Step 1. Note If the instance resides in the same VPC as EAS, you need only the internal endpoint.
Database Name	To view the name of the database, log on to the database. For more information, see Connect to a database.
Account	Enter the database account that you created in Step 1.
Password	Enter the password of the database that you created in Step 1.
Database Folder Name	Enter the name of the database folder. Example: test_db.
Delete Table	Select a policy for processing the existing database table that has the same name. Valid values: Delete: deletes the existing database table that has the same name and creates another table. If no table that has the same name exists, a new table is directly created. Do not delete: retains the existing database table that has the same name and appends the data to the existing database table.

VPC Configuration

Parameter	Description
VPC	If you use Hologres, AnalyticDB for PostgreSQL, or Elasticsearch to build a vector database, select the VPC in which the vector database is deployed. If you use Faiss to build a vector database, you do not need to configure the VPC.
vSwitch
Security Group Name

Click Deploy.
If the value in the Service Status column changes to Running, the RAG-based chatbot is deployed.

Step 3: Perform model inference on the web UI

This section describes how to debug the RAG-based chatbot on the web UI. After you test the Q&A performance of the RAG-based chatbot on the web UI, you can call API operations provided by Platform for AI (PAI) to apply the RAG-based chatbot to your business system. For more information, see Step 4: Call API operations to perform model inference in this topic.

1. Configure the RAG-based chatbot

After you deploy the RAG-based chatbot, click View Web App in the Service Type column to enter the web UI.
Configure the machine learning model.
- Embedding Model: Five models are available. By default, the optimal model is selected.
- Embedding Dimension: After you configure the Embedding Model parameter, the system automatically configures this parameter.
Check whether the vector database is connected.
The system automatically recognizes and applies the vector database settings that are configured when you deploy the chatbot. The settings cannot be modified. If you use Hologres to build the vector database, click Connect Hologres to check whether the vector database in Hologres is connected.

2. Upload specified business data files

On the Upload tab, upload the specified business data files.

Configure semantic-based chunking parameters.

File type

Example value

Description

text

rag_chatbot_test_doc.txt

Configure the following parameters to control the granularity for splitting files into chunks.

Chunk Size: the size of each chunk. Default value: 200. Unit: bytes.
Chunk Overlap: the portion of overlap between adjacent chunks. Default value: 0.

On the Files tab, upload one or more business data files. You can also upload a directory that contains the business data files on the Directory tab.
Click Upload. The system performs data cleansing and semantic-based chunking on the business data files before uploading the business data files. Data cleansing includes text extraction and hyperlink replacement.

3. Configure model inference parameters

Configure Q&A policies for retrieval-based queries

On the Chat tab, configure Q&A policies for retrieval-based queries.

Parameter	Description
Top K	The number of the most relevant results that are returned from the vector database.
Similarity Distance Threshold	A smaller value indicates a higher level of similarity between vectors. If the similarity distance of two vectors is less than the threshold, the two vectors are similar. We recommend that you retain the default value of this parameter.
Re-Rank Model	Most vector databases compromise data accuracy to provide high computing efficiency. As a result, the top K results that are returned from the vector database may not be the most relevant. In this case, you can use the open source model BAAI/bge-reranker-base or BAAI/bge-reranker-large to perform a higher-precision re-rank operation on the top K results that are returned from the vector database to obtain more relevant and accurate knowledge files.
Keyword Retrieval	Embedding Only: Vector database-based retrieval is used. Keyword Ensemble: Multimodal retrieval is used. Note In most complex scenarios, vector database-based retrieval delivers good performance. However, in some vertical fields in which corpora are scarce or in scenarios in which accurate matching is required, vector database-based retrieval may not be as good as the traditional retrieval based on sparse and dense vectors. Retrieval based on sparse and dense vectors is simpler and more efficient by calculating the keyword overlap between user queries and knowledge files. PAI provides keyword-based retrieval algorithms, such as BM25, to perform retrieval based on sparse and dense vectors. Vector database-based retrieval and keyword-based retrieval have their own advantages and disadvantages. Combining the results of the two types of retrieval can improve the overall accuracy and efficiency. The reciprocal rank fusion (RRF) algorithm calculates the weighted sum value of ranks by which a file is sorted in different retrieval methods to obtain a total score. If you select Keyword Ensembled for Keyword Retrieval, keyword-based retrieval is used. In this case, PAI uses the RRF algorithm by default to combine results returned from the vector database-based retrieval and keyword-based retrieval.

Configure Q&A policies for RAG-based queries

On the Chat tab, configure Q&A policies for RAG-based queries. The RAG-based chatbot uses the retrieval method and is empowered by LLM applications. PAI provides various prompt policies. You can select a predefined prompt template or specify a custom prompt template for better inference results.

4. Perform model inference

Retrieval

The chatbot returns the top K relevant results from the vector database.

LLM

The chatbot uses only the LLM application to generate an answer.

RAG (retrieval + LLM)

The chatbot enters the results returned from the vector database and the query into the selected prompt template and sends the template to the LLM application to provide an answer.

Step 4: Call API operations to perform model inference

Obtain the invocation information of the RAG-based chatbot.
1. Click the name of the RAG-based chatbot to go to the Service Details page.
2. In the Basic Information section, click View Endpoint Information.
3. On the Public Endpoint tab, obtain the service endpoint and token.
Connect to the vector database on the web UI and upload business data files. For more information, see 1. Configure the RAG-based chatbot and 2. Upload specified business data files in this topic.

Call API operations to perform model inference.

PAI allows you to call the RAG-based chatbot by using the chat/retrieval, chat/llm, or chat/rag API. Sample code:

cURL command

Method 1: chat/retrieval

curl -X 'POST'  '<service_url>chat/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{"question": "What is PAI?","score_threshold": 900, "vector_topk": 3}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.

Method 2: chat/llm

curl -X 'POST'  '<service_url>chat/llm' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.

You can add other adjustable inference parameters, such as {"question":"What is PAI?", "topk": 3, "topp": 0.8, "temperature": 0.9}.

Method 3: chat/rag

curl -X 'POST'  '<service_url>chat/rag' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?","score_threshold": 900, "vector_topk": 3}'
# Replace <service_url> with the service endpoint that you obtained in Step 1, and <service_token> with the service token that you obtained in Step 1.

You can add other adjustable inference parameters, such as {"question":"What is PAI?", "score_threshold": 900, "vector_topk": 3, "topk": 3, "topp": 0.8, "temperature": 0.9}.

Python script

import requests

EAS_URL = 'http://chatbot-langchain.xx.cn-beijing.pai-eas.aliyuncs.com'


def test_post_api_chat():
    url = EAS_URL + '/chat/retrieval'
    # url = EAS_URL + '/chat/llm'
    # url = EAS_URL + '/chat/rag'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json',
        'Authorization': 'xxxxx==',
    }
    data = {
       "question":"What is PAI?", "score_threshold": 900, "vector_topk": 3
    }
    # The chat/llm and chat/rag APIs support other adjustable inference parameters. 
    """
    data = {
      "question":"What is PAI?", "topk": 3, "topp": 0.8, "temperature": 0.9
    }
    """
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = response.json()
    return ans['response']
print(test_post_api_chat())

Set EAS_URL to the endpoint of the RAG-based chatbot and Authorization to the token of the RAG-based chatbot.

References

You can also use EAS to deploy the following items:

You can deploy an LLM application that can be called by using the web UI or API operations. After the LLM application is deployed, use the LangChain framework to integrate enterprise knowledge bases into the LLM application to implement intelligent Q&A and automation features. For more information, see Quickly deploy open source LLMs in EAS.
You can deploy an AI video generation model service by using ComfyUI and Stable Video Diffusion models. This helps you complete tasks such as short video generation and animation on social media platforms. For more information, see Use ComfyUI to deploy an AI video generation model service.

FAQ

How do I disable the with chat history feature of the RAG-based chatbot?

On the web UI page of the RAG-based chatbot, set the With Chat History parameter to No.