Deploy a Production-Ready RAG Chatbot with DeepSeek and PAI EAS - Hologres

Hologres is deeply integrated with Proxima, a high-performance vector computing software library independently developed by DAMO Academy, to provide high-performance, low-latency, and easy-to-use vector computing capabilities. This topic describes how to use Elastic Algorithm Service (EAS) of Platform for AI (PAI) and DeepSeek to deploy a Retrieval-Augmented Generation (RAG)-based chatbot and associate this chatbot with a Hologres instance. This topic also provides an overview of the basic features of the RAG-based chatbot and the high-performance vector capabilities of Hologres.

Background information

Limits of LLMs

Field knowledge limits: In most cases, large language models (LLMs) are trained by using large-scale general datasets. In this case, LLMs struggle to provide in-depth and targeted processing for specialized vertical fields.
Information update delay: The static nature of the training datasets prevents LLMs from accessing and incorporating real-time information and knowledge updates.
Misleading outputs: LLMs are prone to hallucinations, producing outputs that appear plausible but are factually incorrect. This is attributed to factors such as data bias and inherent model limits.

RAG technology

With the rapid development of AI technology, generative AI has made remarkable achievements in various fields such as text generation and image generation. To address the limits of LLMs and enhance the capabilities and accuracy of LLMs, the RAG technology emerges.

Technology characteristics: By integrating external knowledge bases, RAG significantly reduces the issue of generating fictional content in LLMs and enhances its ability to access and apply the latest knowledge. This results in more personalized and precise customization of LLMs.

Core components of a RAG-based chatbot

Component	Description
Hologres	Hologres is a one-stop real-time data warehousing service developed by Alibaba Cloud. It supports online analytical processing (OLAP) of large amounts of data, as well as high-concurrency and low-latency online data serving. Hologres is also deeply integrated with Proxima, a high-performance vector computing software library independently developed by DAMO Academy, to provide high-performance, low-latency, and easy-to-use vector computing capabilities. For more information, see Proxima-based vector processing.
EAS of PAI	EAS of PAI provides scenario-specific deployment methods. You can configure parameters to deploy a RAG-based chatbot that is integrated with an LLM. This significantly shortens the service deployment duration. When you use the chatbot to perform model inference, the chatbot effectively retrieves relevant information from the knowledge base and combines the retrieved information with answers from the LLM to provide accurate and informative answers. This significantly improves the quality of Q&A and overall performance. The chatbot is suitable for Q&A, summarization, and other natural language processing (NLP) tasks that rely on specific knowledge bases.
DeepSeek	DeepSeek is an LLM developed by the DeepSeek company based on the mixture of experts (MoE) architecture. It is designed to support efficient inference and retrieval tasks, helping users quickly build and scale LLM applications. Currently, EAS of PAI supports one-click deployment of DeepSeek.

Case effect

By building a RAG-based chatbot using Hologres, DeepSeek, and EAS of PAI, you can conduct conversations using different strategies. The following table provides the questions and answers for different strategies.

Strategy	Question	Answer
`Retrieval`	Serverless Computing	The content related to Serverless Computing in the vector database will be returned item by item.
`LLM`	Can a Fixed Plan use Hologres Serverless Computing for acceleration?	A Fixed Plan is typically a basic plan that may not include advanced features such as acceleration with Hologres Serverless Computing. If you need to use Hologres Serverless Computing for acceleration, it is recommended to upgrade to a higher-tier plan, such as Pro Plan or Enterprise Plan, which usually offers more features and higher performance. If you have further questions or need more information, feel free to let me know.
`Chat (Knowledge Base)`		Hologres Serverless Computing cannot directly accelerate online service scenarios of a Fixed Plan, such as point queries in a Fixed Plan. This is because the computing resources in Serverless Computing are shared and unable to support real-time online service scenarios. For latency-sensitive scenarios like point queries in a Fixed Plan, it is recommended to use exclusive computing resources.

With the same question, leveraging professional information from the vector database in Hologres, the RAG-based chatbot powered by an LLM can effectively retrieve relevant information from the knowledge base and combine it with the LLM's response. This way, it can generate accurate and information-rich answers, significantly improving the quality of responses and overall performance.

Key advantages of Hologres in building a RAG-based chatbot:

Real-time vector retrieval capability
Deeply integrated with Proxima, Hologres supports efficient vector similarity searches.
Unified multimodal data processing
Hologres supports joint queries of traditional structured data and vector data, enabling hybrid retrieval in the knowledge base.
Support for ultra-large-scale knowledge bases
Hologres is capable of handling storage and retrieval needs for enterprise-level ultra-large-scale knowledge bases, with dynamic scaling to adapt to rapid business growth.
Enterprise-level reliability
Hologres ensures atomicity and consistency of knowledge base updates through distributed transactions, ensuring high availability and meeting SLA requirements in production environments.
Seamless ecosystem integration
Hologres is deeply integrated with EAS of PAI and seamlessly connected with Alibaba Cloud PAI to support end-to-end RAG pipeline construction.

Prerequisites

A virtual private cloud (VPC), a vSwitch, and a security group are created. For more information, see Create a VPC with an IPv4 CIDR block and Create a security group.

Note

The Hologres instance and the RAG-based chatbot must be deployed in the same VPC.

Procedure

Step 1: Prepare a Hologres vector database

Purchase a Hologres instance and create a database. For more information, see Purchase a Hologres instance and Create a database.
Note
- After you create an account, you must grant database-related permissions to the account. For more information, see Hologres permission models. You can check whether the permissions are granted by following the instructions in Connect to HoloWeb and perform queries.
- We recommend that you use the simple permission model (SPM) to grant the permissions at or above the developer level to the account.
Obtain the Hologres instance endpoint, which is required in Step 2.
1. Log on to the Hologres console.
2. In the left-side navigation pane, click Instances.
3. On the page that appears, find the desired instance and click the name of the instance. In the Network Information section of the Instance Details page, find Select VPC and click Copy in the Endpoint column to copy the Hologres instance endpoint.

Step 2: Deploy a RAG-based chatbot that uses DeepSeek

Log on to the PAI console.
In the left-side navigation pane, click Workspaces.
On the Workspaces page, click Create Workspace. On the Create Workspace page, configure parameters to create a workspace. For more information, see Create and manage a workspace.
In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the Elastic Algorithm Service (EAS) page.
On the Inference Service tab, click Deploy Service. On the Deploy Service page, click RAG-based Smart Dialogue Deployment in the Scenario-based Model Deployment section.

On the RAG-based LLM Chatbot Deployment page, configure the parameters. The following table describes the parameters.

Parameter		Description
Basic Information	Version	The deployment version. Valid values: LLM-Integrated Deployment: Deploy a RAG service that is integrated with an LLM. LLM-Separated Deployment: Deploy only a RAG service. The RAG service allows for easy connection to and flexible replacement of an LLM. This deployment version provides a higher flexibility. In this example, LLM-Integrated Deployment is selected. You can select a deployment version based on your business requirements.
Basic Information	Model Type	In this example, a DeepSeek series model is used. You can select an open source model based on your business scenario.
Resource Deployment	Resource configurations	The system recommends the appropriate resource specifications based on the selected model type. If you use other resource specifications, the model service may fail to start.
Vector Database Settings	Vector Database Type	Select Hologres.
	Invocation Information	The host information of the specified VPC. Go to the Instance Details page in the Hologres console. In the Network Information section, find Select VPC and click Copy in the Endpoint column to copy the instance endpoint. The host information refers to the instance endpoint with `:80` excluded from the end.
	Database Name	The name of the database in the Hologres instance. For more information about how to create a database, see Create a database.
	Account	The custom account that you created. For more information, see User management.
	Password	The password of the custom account that you created.
	Table Name	The name of the table. You can enter a new table name or an existing table name. A test table named `feature_tb` is used in this example. We recommend that you enter a new table name. This way, PAI-RAG automatically creates the corresponding table in Hologres. If you enter a table name that does not exist, PAI-RAG automatically creates the corresponding vector index table. If you enter an existing table name, the table schema must meet the requirements of PAI-RAG. For example, you can enter the name of a Hologres table that is automatically created when you previously deploy a RAG-based chatbot by using EAS.
VPC	VPC (VPC)	We recommend that you use a private network for access. To use a private network for access, you must make sure that the VPC configured in EAS is the same as the VPC where the Hologres instance is deployed. For more information about how to create a VPC, a vSwitch, and a security group, see Create and manage a VPC and Create a security group.
	vSwitch
	Security Group Name

After you configure the parameters, click Deploy. When the service status changes to Running, the RAG-based chatbot is deployed.

Step 3: Perform model inference on the web UI

After the RAG-based chatbot is deployed, click the Inference Service tab on the Elastic Algorithm Service (EAS) page. Find the service and click View Web App in the Service Type column to start the web UI.

Perform the following operations to debug the service on the web UI.

Configure parameters related to the vector database and LLMs.

On the Settings tab, you can modify embedding-related parameters and LLM-related parameters. After you configure the parameters, click Update Index to save the configurations.

Embedding-related parameters

Parameter	Description
Index Name	The system can update existing indexes. You can select New from the Index Name drop-down list and specify an index name in the New Index Name field to isolate data of different knowledge bases. For more information, see How to Isolate Knowledge Base Data Using the RAG Service?
Embedding Type	The `huggingface` and `dashscope` models are supported. In this example, the huggingface model is selected. `huggingface`: The system provides built-in embedding models for you to choose. `dashscope`: Models supported by Alibaba Cloud Model Studio are available. By default, the text-embedding-v2 model is used. For more information, see Embedding. Note If you select `dashscope`, you must configure a public network connection for EAS and configure an API key for Model Studio. You are separately billed for calling models supported by Model Studio. For more information, see Billable items.
Embedding Dimension	The output vector dimension. The dimension setting directly affects the model performance. After you select an embedding model, the system automatically configures this parameter. No manual operation is required.
Embedding Batch Size	The batch processing size.

LLM-related parameters
- If you select LLM-Integrated Deployment for Version when you deploy the RAG-based chatbot, an LLM is integrated in the RAG service. In this case, you do not need to configure LLM-related parameters.
- If you select LLM-Separated Deployment for Version, perform the following steps to obtain the endpoint and token. Then, configure the required parameters.
  1. On the Inference Service tab of the Elastic Algorithm Service (EAS) page, click the name of the service that you deployed.
  2. On the Overview tab, click View Endpoint Information in the Basic Information section.
  3. In the Invocation Method dialog box, obtain the endpoint and token based on the connection method.

Upload a business data file.

On the Upload tab, you can configure the semantic-based chunking parameters and upload your business data file. The system automatically completes embedding and stores the embedded data in the Hologres vector database.

Configure semantic-based chunking parameters.

You can configure the following parameters to specify the chunk size and extract Q&A information.

Parameter	Description
Chunk Size	The size of each chunk. Unit: bytes. Default value: 500.
Chunk Overlap	The portion of overlap between adjacent chunks. Default value: 10.
Process with MultiModal	Specifies whether to use multimodal model processing. If you select the option, you can process images in PDF, Word, and Markdown files. A file in the TXT format is used in this example. Therefore, you do not need to select this option.
Process PDF with OCR	Specifies whether to parse PDF files in OCR mode.

Upload a business data file.
In this example, the test file rag_hologres.txt is used.
- The following file formats are supported: TXT, PDF, XLSX, XLS, CSV, DOCX, DOC, MD, and HTML.
- You can upload local files, local directories, or objects from OSS.
After you upload the business data file, the system performs data cleansing and semantic-based chunking, and then stores the data to the Hologres vector database. Data cleansing includes text extraction and hyperlink replacement.
The following figure shows the data in the Hologres vector database. You can log on to the HoloWeb console to query the data. For more information, see Connect to HoloWeb and perform queries.

Configure model inference parameters.

On the Chat tab, you can configure parameters for vector retrieval and model inference.

Policy parameters

Parameter	Description
Retrieval	Returns the top K relevant results from the vector database.
LLM	Uses the answer from the LLM.
Chat (Web Search)	Determines whether Internet search is needed based on the user question. If Internet search is required, the chatbot imports both the search results and the user question into the LLM. To support Internet search, you must configure a public network connection for EAS.
Chat (Knowledge Base)	Merges the results retrieved from the vector database with the user question, populates them into the selected prompt template, and then submits the prompt to the LLM to generate an answer.

General parameters

Parameter	Description
Streaming Output	Specifies whether to generate results in streaming mode. Streaming Output is selected by default.
Need Citation	Specifies whether to include citations in answers.
Inference with multi-modal LLM	Specifies whether to display images when you use a multimodal LLM.

Vector retrieval parameters. The following vector retrieval methods are supported:
- Embedding Only: Uses vector database-based retrieval.
- Keyword Only: Uses keyword-based retrieval.
- Hybrid: Uses the multi-channel recall fusion of vector database-based retrieval and keyword-based retrieval.
LLM-related parameters
Temperature: Controls the randomness of the generated content. Lower temperature values result in more deterministic and fixed outputs, while higher temperatures lead to more diverse and creative results.

Perform model inference.
On the Chat tab, select a knowledge base index by specifying the Index Name parameter and configure a Q&A policy. You can use multiple methods to perform inference verification on the LLM and optimize the chatbot.
- Retrieval: Enter the keyword Serverless Computing as a question.
- LLM: Enter Can I use Hologres Serverless Computing to accelerate queries when using a Fixed Plan? as a question.
- Chat (Knowledge Base): Enter Can I use Hologres Serverless Computing to accelerate queries when using a Fixed Plan? as a question.

Step 4: Call an API operation to perform model inference

After you test the Q&A performance of the RAG-based chatbot on the web UI, you can perform the following steps to call an API operation provided by PAI to apply the RAG-based chatbot to your business system.

Obtain the invocation information of the RAG-based chatbot.
1. Click the name of the RAG-based chatbot to go to the Overview tab.
2. In the Basic Information section, click View Endpoint Information.
3. In the Invocation Method dialog box, obtain the endpoint and token of the model service based on your network environment.
Call an API operation. For more information, see Call the API of a RAG-based LLM chatbot.

References

For more information about RAG-based LLM chatbots, see RAG-based LLM chatbot.
For more information about the vector computing capabilities of Hologres, see Vector processing based on Proxima.