All Products
Search
Document Center

:RAG-based LLM chatbot

Last Updated:Aug 18, 2025

Retrieval-Augmented Generation (RAG) technology enhances the capability of large language models (LLMs) in private domain knowledge Q&A by retrieving relevant information from external knowledge bases and merging it with user inputs. EAS provides scenario-based deployment methods that support flexible selection of large language models and vector databases, enabling the rapid construction and deployment of RAG chatbots. This topic describes how to deploy a RAG-based chatbot and perform model inference.

Step 1: Deploy the RAG service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Model Online Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment area, click Deploy Rag-based LLM Chatbot.

    6eea7736f88e6ec8b3b900e4d028bb48

  3. On the Deploy Rag-based LLM Chatbot page, configure the parameters and then click Deploy. When the Service Status changes to Running, the service is successfully deployed. Key parameter descriptions are as follows.

    • Basic Information

      Parameter

      Description

      Version

      The following two versions are supported for deployment:

      • Integrated LLM Deployment: Deploy the LLM service and RAG service within the same service.

      • Separate LLM Deployment: Only deploy the RAG service. However, within the RAG service, you can freely switch and connect LLM services for greater flexibility.

      Model Type

      When you select Integrated LLM Deployment, you need to choose the LLM to deploy. You can select an open-source model based on your specific use case.

    • Resource Deployment

      Parameter

      Description

      Deployment Resources

      After selecting the model category, the system automatically matches suitable resource specifications. Switching to other resource specifications may cause the model service to fail to start.

    • Vector Database Settings

      RAG supports building vector databases using Faiss (Facebook AI Similarity Search), Elasticsearch, Hologres, OpenSearch, or RDS PostgreSQL. Select a version type based on your scenario to serve as the vector database.

      FAISS

      Faiss enables the creation of a local vector database efficiently, eliminating the necessity to acquire or activate online vector databases.

      Parameter

      Description

      Version Type

      Select FAISS.

      OSS Address

      Select the OSS storage path created in the current region to store the uploaded knowledge base files. If no storage path is available, you can refer to Quick Start for Console to create one.

      Note

      If you choose to use a custom fine-tuned model deployment service, ensure that the selected OSS storage path does not overlap with the path of the custom fine-tuned model to avoid conflicts.

      Elasticsearch

      Specify the connection information of an Alibaba Cloud Elasticsearch instance. For information on how to create an Elasticsearch instance and prepare configuration items, see Prepare the vector database Elasticsearch.

      Parameter

      Description

      Version Type

      Select Elasticsearch.

      Private Network Address/port

      Configure the private network address and port of the Elasticsearch instance in the format http://<private network address>:<private network port>. For information on how to obtain the private network address and port number of an Elasticsearch instance, see View the basic information of an instance.

      Index Name

      Enter a new index name or an existing index name. For an existing index name, the index schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the index that is automatically created when you deploy the RAG-based chatbot by using EAS.

      Account

      Specify the logon name that you configured when you created the Elasticsearch instance. The default logon name is elastic.

      Password

      Specify the password that you configured when you created the Elasticsearch instance. If you forget the logon password, you can reset the instance access password.

      Hologres

      Specify the connection information of a Hologres instance. If you have not activated a Hologres instance, you can refer to Purchase Hologres for more information.

      Parameter

      Description

      Version Type

      Select Hologres.

      Invocation Information

      Specify the host information of the designated VPC. Go to the instance details page of the Hologres Management Console. In the Network Information area, click Copy after Designated VPC to obtain the host information before the domain name :80.

      Database Name

      Specify the database name of the Hologres instance. For more information about how to create a database, see Create a Database.

      Account

      Specify the custom account that you created. For more information, see Create a Custom User, where you select Select Member Role and choose Superuser.

      Password

      Specify the password of the custom account that you created.

      Table Name

      Enter a new table name or an existing table name. For an existing table name, the table schema must meet the requirements of the RAG-based chatbot. For example, you can enter the name of the Hologres table that is automatically created when you deploy the RAG-based chatbot by using EAS.

      OpenSearch

      Specify the connection information of an OpenSearch instance of Vector Search Edition. For information on how to create an OpenSearch instance and prepare configuration items, see Prepare the vector database OpenSearch.

      Parameter

      Description

      Version Type

      Select OpenSearch.

      Access Address

      Specify the public endpoint of the OpenSearch instance of Vector Search Edition. You need to enable the public access feature for the OpenSearch instance of Vector Search Edition. For more information, see Prepare the vector database OpenSearch.

      Instance ID

      Obtain the instance ID from the OpenSearch Vector Search Edition instance list.

      Username

      Enter the username and password that you specified when you created the OpenSearch instance of Vector Search Edition.

      Password

      Table Name

      Enter the name of the index table of the OpenSearch instance of Vector Search Edition that you created. For more information about how to prepare the index table, see Prepare the vector database OpenSearch.

      RDS PostgreSQL

      Specify the connection information of the ApsaraDB RDS for PostgreSQL instance. For information on how to create an ApsaraDB RDS for PostgreSQL instance and prepare configuration items, see Prepare the vector database RDS PostgreSQL.

      Parameter

      Description

      Version Type

      Select RDS PostgreSQL.

      Host Address

      Specify the internal network address of the ApsaraDB RDS for PostgreSQL instance. You can go to the ApsaraDB RDS for PostgreSQL console page and view the Database Connection page of the ApsaraDB RDS for PostgreSQL instance.

      Port

      The default value is 5432. Enter the value based on your actual situation.

      Database

      Specify the name of the database that you created. For more information about how to create a database and an account, see Create an Account and a Database, where:

      • When you create an account, select Account Type and choose Privileged Account.

      • When you create a database, select Authorized Account and choose the privileged account that you created.

      Table Name

      Specify the name of the database table.

      Account

      Specify the privileged account and password that you created. For more information about how to create a privileged account, see Create an Account and a Database, where you select Account Type and choose Privileged Account.

      Password

    • Virtual Private Cloud Configuration

      Parameter

      Description

      VPC

      • If you selected Separate LLM Deployment when deploying the RAG service, you must make sure it is connected to the LLM service:

        • Public access: Associate a virtual private cloud (VPC) with the RAG service and configure Internet access.

        • Private access: Associate the same VPC with both the RAG service and the LLM service.

      • If you need to use a model from Model Studio or perform Q&A with online search, you must configure a VPC and enable public access. For more information, see Configure Public Network Connection.

      • Network requirements for the vector database:

        • The Faiss vector database does not require network access.

        • Hologres, Elasticsearch, or RDS PostgreSQL can be accessed by EAS through the public network or private network. Private network access is recommended. Private network access requires that the VPC configured in EAS is consistent with the VPC of the vector database. For more information about how to create a VPC, a switch, and a security group, see Create and Manage a VPC and Create a Security Group.

        • EAS can only access OpenSearch through the public network. For more information about how to configure the access method, see Step 2: Prepare Configuration Items.

      Switch

      Security Group Name

Step 2: Debug on the WebUI page

After the RAG service is successfully deployed, click View Web Application in the Service Method column to launch the WebUI page.

Follow the steps below to upload your knowledge base file on the WebUI page and test the Q&A chatbot.

1. Vector database and large language model settings

On the Settings tab, you can modify the embedding-related parameters and the large language model in use. It is recommended to use the default configuration.

Note

To use Dashscope, you need to configure public network connection for EAS and configure the API key for Bailian. The Bailian model call is billed separately. For more information, see Bailian Billing Items Description.

  • Index parameter descriptions:

    Parameter

    Description

    Index Name

    The system supports updating existing indexes. You can select New from the drop-down list to add a new index and isolate different knowledge base data by specifying the index name. For more information, see How to Use RAG Service for Knowledge Base Data Isolation?.

    EmbeddingType

    Supports two model sources: Huggingface and Dashscope.

    • Huggingface: The system provides built-in embedding models for you to choose from.

    • Dashscope: Uses the Bailian model, which defaults to the text-embedding-v2 model. For more information, see Embedding.

    Embedding Dimension

    The output vector dimension. The dimension setting directly affects the model's performance. After you select an embedding model, the system automatically configures the embedding dimension, and no manual operation is required.

    Embedding Batch Size

    Batch processing size.

  • Large Language Model parameter descriptions

    When you select Separate LLM Deployment, you need to refer to Deploy an LLM to deploy the LLM service. Then, click the LLM service name. In the Basic Information area, click View Invocation Information to obtain the service access address and token. Then configure the following parameters:

    Parameter

    Description

    LLM Base URL

    • When using separate LLM deployment, specify the access address and token of the LLM service that you obtained.

    • When using integrated LLM deployment, the system has already configured this parameter by default, and no modification is required.

    API Key

    Model Name

    When deploying an LLM, if you select the accelerated deployment-vLLM mode, be sure to specify the model name, such as qwen2-72b-instruct. For other deployment modes, simply set the model name to default.

2. Upload knowledge base files

On the Upload tab, you can upload knowledge base files. The system automatically stores the knowledge base in the vector database in the PAI-RAG format. Supported file types include .html, .htm, .txt, .pdf, .pptx, .md, Excel (.xlsx or .xls), .jsonl, .jpeg, .jpg, .png, .csv, or Word (.docx), such as rag_chatbot_test_doc.txt. The supported upload methods are as follows:

  • Upload files from local (supports multi-file upload) or corresponding directory (Files or Directory tab)

  • Upload from OSS (Aliyun OSS tab)

Before uploading, you can modify the concurrency control and semantic chunking parameters. The parameter descriptions are as follows:

Parameter

Description

Number of workers to parallelize data-loading over

The concurrency control parameter. The default value is 4, indicating that the system supports starting four processes simultaneously to upload files. It is recommended to set the concurrency to the size of . For example, if the current GPU video memory is 24 GB, the concurrency can be set to 4.

Chunk Size

The size of each chunk. Unit: bytes. Default value: 500.

Chunk Overlap

The overlap between adjacent chunks. Default value: 10.

Process with MultiModal

Use a multimodal model to process images in PDF, Word, and MD files. If you choose to use a multimodal LLM, turn on this switch.

Process PDF with OCR

Use OCR mode to parse PDF files.

3. Model inference verification

On the Chat tab, select the knowledge base index (Index Name) to use, configure the Q&A strategy, and perform Q&A tests.The following four Q&A strategies are supported:

  • Retrieval: Directly retrieve and return the top K similar results from the vector database.

  • LLM: Directly use the LLM to answer.

  • Chat(Web Search): Automatically determine whether online search is needed based on the user's question. If online search is needed, input the search results and the user's question into the LLM service. To use online search, you need to configure public network connection for EAS.

  • Chat(Knowledge Base): Merge the results returned from the vector database retrieval with the user's question and fill them into the selected prompt template. Then, input them into the LLM service for processing to obtain the Q&A results.

    image

More inference parameter descriptions are as follows:

  • General Parameters

    Parameter

    Description

    Streaming Output

    After you select Streaming Output, the system outputs the results in a streaming manner.

    Need Citation

    Whether a reference is needed in the answer.

    Inference with multi-modal LLM

    Whether to display images when using a multimodal LLM.

  • Vector Retrieval Parameters

    Retrieval Mode: The following three retrieval methods are supported:

    • Embedding Only: Vector database-based retrieval.

    • Keyword Only: Keyword-based retrieval.

    • Hybrid: Multimodal retrieval that combines vector database-based retrieval and keyword-based retrieval.

    Note

    In most complex scenarios, vector database-based retrieval delivers good performance. However, in some vertical fields that lack information or in scenarios in which accurate matching is required, vector database-based retrieval may not achieve the same effect as the traditional retrieval based on sparse and dense vectors. Retrieval based on sparse and dense vectors is simpler and more efficient by calculating the keyword overlap between user queries and knowledge files.

    PAI provides keyword-based retrieval algorithms, such as BM25, to perform retrieval based on sparse and dense vectors. Vector database-based retrieval and keyword-based retrieval have their own advantages and disadvantages. Combining the results of the two types of retrieval methods can improve the overall accuracy and efficiency.

    The reciprocal rank fusion (RRF) algorithm calculates the weighted sum value of ranks by which a file is sorted in different retrieval methods to obtain a total score. If you select Hybrid for the Retrieval Mode parameter, PAI uses the RRF algorithm by default to combine results returned from the vector database-based retrieval and keyword-based retrieval.

  • Online Search Parameters

    Parameter

    Description

    bing: Configure Bing search.

    Bing API Key

    Used to access Bing search. For more information about how to obtain a Bing API key, see Bing Web Search API.

    Search Count

    The number of web pages to search. The default value is 10.

    Language

    The search language. You can select zh-CN (Chinese) or en-US (English).

  • LLM Parameters

    Temperature: Controls the randomness of the generated content. The lower the temperature, the more fixed the output result. The higher the temperature, the more diverse and creative the output result.

Step 3: API call

The following content describes how to call the RAG API.

Important

The query and upload APIs can specify index_name to switch the knowledge base. If the index_name parameter is omitted, the default knowledge base is default_index. For more information, see How to Use RAG Service for Knowledge Base Data Isolation?.

Obtain invocation information

  1. Click the RAG service name to go to the Service Details page.

  2. In the Basic Information area, click View Invocation Information.

  3. On the Invocation Information dialog box, click the Public Address Call tab to obtain the service access address and token.

Upload knowledge base files

You can upload local knowledge base files through the API. You can query the status of the file upload task based on the task_id returned by the upload interface.

In the following example, replace <service_url> with the access address of the RAG service and <service_token> with the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

  • Upload data

    curl -X 'POST' '<service_url>api/v1/upload_data' -H 'Authorization: <service_token>' -H 'Content-Type: multipart/form-data' -F 'files=@<file_path>'
    # Return: {"task_id": "****557733764fdb9fefa063538914da"}
  • Query the upload task status

    curl '<service_url>api/v1/get_upload_state?task_id=****557733764fdb9fefa063538914da' -H 'Authorization: <service_token>'
    # Return: {"task_id":"****557733764fdb9fefa063538914da","status":"completed"}

Single-round conversation request

CURL command

Note: In the following example, replace <service_url> with the access address of the RAG service and <service_token> with the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

  • Retrieval: api/v1/query/retrieval

    curl -X 'POST'  '<service_url>api/v1/query/retrieval' -H 'Authorization: <service_token>' -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'
  • LLM: /api/v1/query/llm

    curl -X 'POST'  '<service_url>api/v1/query/llm' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

    Supports adding other adjustable inference parameters, such as {"question":"What is PAI?", "temperature": 0.9}.

  • Chat(Knowledge Base): api/v1/query

    curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

    Supports adding other adjustable inference parameters, such as {"question":"What is PAI?", "temperature": 0.9}.

  • Chat(Web Search):api/v1/query/search

    curl --location '<service_url>api/v1/query/search' \
    --header 'Authorization: <service_token>' \
    --header 'Content-Type: application/json' \
    --data '{"question":"China movie box office ranking", "stream": true}'

Python script

Note: In the following example, SERVICE_URL is configured as the access address of the RAG service and Authorization is configured as the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com/'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkNzkyMGM1Zj****YzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query(url):
    data = {
       "question":"What is PAI?" 
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())

    print(f"======= Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"======= Answer =======\n {ans['answer']}")
    if 'docs' in ans.keys():
        print(f"======= Retrieved Docs =======\n {ans['docs']}\n\n")
 
# LLM 
test_post_api_query(SERVICE_URL + 'api/v1/query/llm')
# Retrieval
test_post_api_query(SERVICE_URL + 'api/v1/query/retrieval')
# Chat (Knowledge Base)
test_post_api_query(SERVICE_URL + 'api/v1/query')

Multi-round conversation request

LLM and Chat (Knowledge Base) support sending multi-round conversation requests. The following code example shows how to do this:

CURL command

Note: In the following example, replace <service_url> with the access address of the RAG service and <service_token> with the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

The following example shows how to perform RAG conversation:

# Send a request. 
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What is PAI?"}'

# Provide the session ID returned for the request. This ID uniquely identifies a conversation in the conversation history. After the session ID is provided, the corresponding conversation is stored and is automatically included in subsequent requests to call an LLM.
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question": "What are the benefits of PAI?","session_id": "ed7a80e2e20442eab****"}'

# Provide the chat_history parameter, which contains the conversation history between you and the chatbot. The parameter value is a list in which each element indicates a single round of conversation in the {"user":"Inputs","bot":"Outputs"} format. Multiple conversations are sorted in chronological order.
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are its features?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is Alibaba Cloud's platform for AI..."}]}'

# If you provide both the session_id and chat_history parameters, the conversation history is appended to the conversation that corresponds to the specified session ID. 
curl -X 'POST'  '<service_url>api/v1/query' -H 'Authorization: <service_token>' -H 'accept: application/json'  -H 'Content-Type: application/json'  -d '{"question":"What are its features?", "chat_history": [{"user":"What is PAI?", "bot":"PAI is Alibaba Cloud's platform for AI..."}], "session_id": "1702ffxxad3xxx6fxxx97daf7c"}'

Python

Note: In the following example, SERVICE_URL is configured as the access address of the RAG service and Authorization is configured as the token of the RAG service. For more information about how to obtain the access address and token, see Obtain Invocation Information.

import requests

SERVICE_URL = 'http://xxxx.****.cn-beijing.pai-eas.aliyuncs.com'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
    'Authorization': 'MDA5NmJkN****jNlMDgzYzM4M2YwMDUzZTdiZmI5YzljYjZmNA==',
}

def test_post_api_query_with_chat_history(url):
    # Round 1 query
    data = {
       "question": "What is PAI?"
    }
    response = requests.post(url, headers=headers, json=data)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans = dict(response.json())
    print(f"=======Round 1: Question =======\n {data['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 1: Answer =======\n {ans['answer']} session_id: {ans['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 1: Retrieved Docs =======\n {ans['docs']}")
   
    # Round 2 query
    data_2 = {
       "question": "What are the benefits of PAI?",
       "session_id": ans['session_id']
    }
    response_2 = requests.post(url, headers=headers, json=data_2)

    if response.status_code != 200:
        raise ValueError(f'Error post to {url}, code: {response.status_code}')
    ans_2 = dict(response_2.json())
    print(f"=======Round 2: Question =======\n {data_2['question']}")
    if 'answer' in ans.keys():
        print(f"=======Round 2: Answer =======\n {ans_2['answer']} session_id: {ans_2['session_id']}")
    if 'docs' in ans.keys():
        print(f"=======Round 2: Retrieved Docs =======\n {ans['docs']}")
    print("\n")

# LLM
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query/llm")
# Chat (Knowledge Base)
test_post_api_query_with_chat_history(SERVICE_URL + "api/v1/query")

Precautions

This practice is subject to the maximum number of tokens of an LLM service and is designed to help you understand the basic retrieval feature of a RAG-based LLM chatbot:

  • The chatbot is limited by the server resource size of the LLM service and the default number of tokens. The conversation length supported by the chatbot is also limited.

  • If multi-round conversation is not required, it is recommended to disable the with chat history feature to effectively reduce the possibility of reaching the limit.

    WebUI operation method: On the Chat tab of the RAG service WebUI page, deselect the Chat history check box.

    image

FAQ

How to use RAG service for knowledge base data isolation?

When different departments or individuals use their own independent knowledge bases, you can achieve effective data isolation by using the following methods:

  1. On the Settings tab of the WebUI page, configure the following parameters and then click Add Index.

    • Index Name: Select NEW.

    • New Index Name: Customize a new index name. For example, INDEX_1.

    • Path: When you select Faiss as the VectorStore, you need to update the Path path synchronously to ensure that the index name at the end of the path is consistent with the new index name.

    image

  2. When you upload knowledge base files on the Upload tab, you can select Index Name. After the upload is complete, the files are saved under the selected index. image

  3. When you perform a conversation on the Chat tab, select the corresponding index name. The system uses the knowledge base files under the index for knowledge Q&A, thereby achieving data isolation for different knowledge bases. image

References

You can also use EAS to deploy the following items:

  • Deploy an LLM application that can be called by using the web UI or API operations. After the LLM application is deployed, use the LangChain framework to integrate enterprise knowledge bases into the LLM application and implement intelligent Q&A and automation features. For more information, see Deploy an LLM Application in 5 Minutes by Using EAS.

  • Deploy an AI video generation model service by using ComfyUI and Stable Video Diffusion models. This helps you complete tasks such as short video generation and animation on social media platforms. For more information, see AI Video Generation - ComfyUI Deployment.

  • Deploy a model service based on Stable Diffusion WebUI by configuring a few parameters. For more information, see AI Painting - SDWebUI Deployment.

  • When you deploy a RAG-based LLM chatbot, only the fees related to EAS resources are charged. For more information about billing, see Model Online Service (EAS) Billing Description. If you use other products such as Bailian, vector databases, or Object Storage Service (OSS) during the use of the RAG-based LLM chatbot, they are billed separately in the corresponding products.