How do I use the pgvector extension to build a dedicated chatbot on top of ApsaraDB RDS for PostgreSQL? - ApsaraDB RDS

With the development of ChatGPT, large language models (LLMs) and generative artificial intelligence (AI) start to play important roles in various fields, such as writing, image generation, code optimization, and information retrieval. LLMs are helpful to individual users and enterprises. LLMs accelerate ecosystem innovation and lead the way in building super applications. This topic describes how to build a dedicated chatbot on top of ApsaraDB RDS for PostgreSQL.

Background information

An increasing number of enterprises and individual users want to use LLMs and generative AI to build AI-powered applications that are dedicated for specific business scenarios. LLMs deliver excellent performance in resolving common issues. However, LLMs cannot better meet requirements for timeliness and professional knowledge due to the limits of corpus training and large model generation. In the information age, enterprises frequently update their knowledge bases. The knowledge bases in industry verticals, including documents, images, audio files, and videos, may be confidential or cannot be disclosed to the public. If an enterprise wants to use LLMs to build AI-powered applications for industry verticals, the enterprise must constantly import its knowledge base into LLMs for training.

The following list describes the common training approaches:

Fine-tuning: New datasets are provided to fine-tune the weight of an existing model, and the datasets are constantly updated to obtain the required effect. This approach is suitable for training models based on small datasets or models that are task-specific. However, the approach is costly and requires intensive training.
Prompt-tuning: Prompts instead of model weights are adjusted to obtain the required effect. Prompt-tuning generates less computing costs, requires fewer resources and training time, and is more flexible than fine-tuning.

The following list describes the advantages of building a chatbot on top of ApsaraDB RDS for PostgreSQL:

ApsaraDB RDS for PostgreSQL provides the pgvector extension that can be used to translate real-time data or expertise of industry verticals into vector embeddings. ApsaraDB RDS for PostgreSQL stores the embeddings to accelerate vector-based searches and improve the accuracy of Q&A about a specific private domain.
ApsaraDB RDS for PostgreSQL is an advanced open source online transaction processing (OLTP) database system that can be used to simultaneously execute online database transactions and data storage tasks. For example, ApsaraDB RDS for PostgreSQL can be used to process interactions, historical records, and time spent in chats of a transaction. ApsaraDB RDS for PostgreSQL is a popular database system that provides various features. This facilitates the building of private domain services and makes their architectures easy to use.
The pgvector extension has been widely used in the developer community and open source PostgreSQL databases. In addition, tools such as ChatGPT Retrieval Plugin are supported by PostgreSQL databases at the earliest opportunity. ApsaraDB RDS for PostgreSQL has been extensively applied and is fully supported to perform vector-based searches. It also provides abundant tools and resources.

This topic provides an example on how to build a dedicated chatbot by using the pgvector extension and OpenAI embeddings. The pgvector extension is provided by ApsaraDB RDS for PostgreSQL to perform vector similarity searches. For more information about the extension, see Use the pgvector extension to perform high-dimensional vector similarity searches.

Prerequisites

An ApsaraDB RDS for PostgreSQL instance that meets the following requirements is created:
- The RDS instance runs PostgreSQL 14 or later.
- The minor engine version of the RDS instance is 20230430 or later.
Note
For more information about how to upgrade the major engine version or update the minor engine version of an RDS instance, see Upgrade the major engine version or Update the minor engine version.
You are familiar with the usage and basic terms of the pgvector extension. For more information, see Use the pgvector extension to perform high-dimensional vector similarity searches.
A Secret API key is obtained, and OpenAI is available in your network environment. The sample code used in this topic is deployed on an Elastic Compute Service (ECS) instance in the Singapore region.
A Python development environment is deployed. In this topic, Python 3.11.4 and PyCharm 2023.1.2 are used.

Terms

embedding

Embedding is a method that translates high-dimensional data into a low-dimensional space. In machine learning and natural language processing (NLP), embedding is a common method that is used to represent sparse symbols or objects as continuous vectors.

In NLP, word embedding is an method that is used to represent words in the form of real-valued vectors. This enables computers to better understand the text. Word embedding is used to translate the semantics and grammatical relations of words into a vector space.

OpenAI supports embeddings.

Implementation principles

This section describes the phases to build a dedicated chatbot:

Phase 1: Data preparation

Extract a text from a domain-specific knowledge base and split the text into chunks. For example, you can split a large piece of text into paragraphs or sentences, and extract keywords or entities. This facilitates the arrangement and management of the data in the knowledge base.
Call an LLM operation such as an operation provided by OpenAI to import the text chunks to models and then generate text embeddings. The embeddings capture the semantic and contextual information of the text for subsequent search and matching.
Store the generated text embeddings, text chunks, and text metadata to your RDS instance.

Phase 2: Q&A

Raise a question.
Call the embedding operation that is provided by OpenAI to create an embedding for the question.
Use the pgvector extension to search text chunks whose similarity is greater than a specified threshold in the RDS instance and obtain the result.

The following figure shows the process.

image..png

Procedure

Phase 1: Data preparation

In this example, the text data in Create an ApsaraDB RDS for PostgreSQL instance is split and stored in your RDS instance. You must prepare a domain-specific knowledge base.

In this phase, the most important thing is to translate the domain-specific knowledge base into text embeddings and store and match the embeddings in an efficient manner. This way, you can obtain domain-specific high-quality answers and suggestions by leveraging the powerful semantic understanding capabilities of LLMs. Some existing open source frameworks allow you to upload and parse knowledge base files in formats such as URL, Markdown, PDF, and Word, in a convenient manner. For example, the LangChain and open source ChatGPT Retrieval Plugin frameworks of OpenAI are supported. LangChain and ChatGPT Retrieval Plugin can use PostgreSQL databases that supports the pgvector extension as backend vector databases. This facilitates the integration between LangChain and ChatGPT Retrieval Plugin and the RDS instance. After the integration, you can prepare the domain-specific knowledge base in an efficient manner and make the most of the vector indexes and similarity searches that are provided by the pgvector extension to implement efficient text matching and query.

Connect to the RDS instance.
Create a test database. Example: rds_pgvector_test.
```
CREATE DATABASE rds_pgvector_test;
```
Go to the test database and create the pgvector extension.
```
CREATE EXTENSION IF NOT EXISTS vector;
```

Create a test table to store the data of the knowledge base. Example: rds_pg_help_docs.

CREATE TABLE rds_pg_help_docs (
  id bigserial PRIMARY KEY, 
  title text,			-- Title
  description text, 		-- Description
  doc_chunk text, 		-- Text chunk
  token_size int, 		-- Number of words in a text chunk
  embedding vector(1536));	-- Text embedding

Create an index for the embedding column to optimize and accelerate queries.
```
CREATE INDEX ON rds_pg_help_docs USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
```
Note
For more information about how to create indexes for vectors, see Use the pgvector extension to perform high-dimensional vector similarity searches.
Create a project in PyCharm, open the terminal, and then enter the following statement to install the following dependency:
```
pip install openai psycopg2 tiktoken requests beautifulsoup4 numpy
```

Create a .py file to split the data in the knowledge base and store the split data in the RDS instance. In this example, the knowledge_chunk_storage.py file is created. Sample code:

Note

In the following sample code, the custom splitting method only split the data in the knowledge base into chunks with a fixed number of characters. You can use the splitting methods that are provided by LangChain and open source ChatGPT Retrieval Plugin of OpenAI. The quality of data in the knowledge base and the chunking result have a great impact on the final result.

import openai
import psycopg2
import tiktoken
import requests
from bs4 import BeautifulSoup

EMBEDDING_MODEL = "text-embedding-ada-002"
tokenizer = tiktoken.get_encoding("cl100k_base")

# Connect to an RDS instance.
conn=psycopg2.connect (database="<Database name>",
                        host="<Endpoint of the RDS instance>",
                        user="<Username>",
                        password="<Password>",
                        port="<Database port number>")
conn.autocommit = True

# API key of OpenAI
openai.api_key = '<Secret API Key>'

# Custom splitting method (The following method is an example.)
  def get_text_chunks(text, max_chunk_size):
    chunks_ = []
    soup_ = BeautifulSoup(text, 'html.parser')

    content = ''.join(soup_.strings).strip()

    length = len(content)
    start = 0
    while start < length:
        end = start + max_chunk_size
        if end >= length:
            end = length

        chunk_ = content[start:end]
        chunks_.append(chunk_)

        start = end

    return chunks_


# Specify the web page to be split.
url = 'https://help.aliyun.com/document_detail/148038.html'

response = requests.get(url)
if response.status_code == 200:
    # Obtain the data on the web page.
    web_html_data = response.text
    soup = BeautifulSoup(web_html_data, 'html.parser')
    # Obtain the title with the H1 tag.
    title = soup.find('h1').text.strip()

    # Obtain the description (The description is enclosed in p tags and belongs to the shortdesc class.)
    description = soup.find('p', class_='shortdesc').text.strip()

    # Split and store the data.
    chunks = get_text_chunks(web_html_data, 500)
    for chunk in chunks:
        doc_item = {
            'title': title,
            'description': description,
            'doc_chunk': chunk,
            'token_size': len(tokenizer.encode(chunk))
        }

        query_embedding_response = openai.Embedding.create(
            model=EMBEDDING_MODEL,
            input=chunk,
        )

        doc_item['embedding'] = query_embedding_response['data'][0]['embedding']

        cur = conn.cursor()
        insert_query = '''
        INSERT INTO rds_pg_help_docs 
            (title, description, doc_chunk, token_size, embedding) VALUES (%s, %s, %s, %s, %s);
        '''

        cur.execute(insert_query, (
            doc_item['title'], doc_item['description'], doc_item['doc_chunk'], doc_item['token_size'],
            doc_item['embedding']))

        conn.commit()

else:
    print('Failed to fetch web page')

Run a Python program.
Log on to the database and execute the following statement to check whether the data in the knowledge base is split and stored as vectors:
```
SELECT * FROM rds_pg_help_docs;
```

Phase 2: Q&A

In the Python project, create a .py file, define a question in the file, and compare the similarity between the question and the data in the knowledge base of the RDS instance. In this example, the chatbot.py file is created. Sample code:

import openai
import psycopg2
from psycopg2.extras import DictCursor

GPT_MODEL = "gpt-3.5-turbo"
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_COMPLETIONS_MODEL = "text-davinci-003"
MAX_TOKENS = 1024

# API key of OpenAI
openai.api_key = '<Secret API Key>'

prompt ='Create an RDS instance.'

prompt_response = openai.Embedding.create(
    model=EMBEDDING_MODEL,
    input=prompt,
)
prompt_embedding = prompt_response['data'][0]['embedding']

# Connect to an RDS instance.
conn=psycopg2.connect (database="<Database name>",
                        host="<Endpoint of the RDS instance>",
                        user="<Username>",
                        password="<Password>",
                        port="<Database port number>")
conn.autocommit = True


def answer(prompt_doc, prompt):
    improved_prompt = f"""
    Answer the following questions based on the following file and steps:
    a. Check whether the content in the file is related to the question.
    b. Reference the content in the file to answer the question and provide the detailed answer as a Markdown file.
    c. Use "I don't know because the question is beyond my scope" to answer the questions that are irrelevant to ApsaraDB RDS for PostgreSQL.

    File:
    \"\"\"
    {prompt_doc}
    \"\"\"

    Question: {prompt}
    """

    response = openai.Completion.create(
        model=GPT_COMPLETIONS_MODEL,
        prompt=improved_prompt,
        temperature=0.2,
        max_tokens=MAX_TOKENS
    )

    print(f"{response['choices'][0]['text']}\n")


similarity_threshold = 0.78
max_matched_doc_counts = 8

# Use the pgvector extension to search file chunks whose similarity is greater than the specified threshold.
similarity_search_sql = f'''
SELECT doc_chunk, token_size, 1 - (embedding <=> '{prompt_embedding}') AS similarity 
FROM rds_pg_help_docs WHERE 1 - (embedding <=> '{prompt_embedding}') > {similarity_threshold} ORDER BY id LIMIT {max_matched_doc_counts};
'''

cur = conn.cursor(cursor_factory=DictCursor)
cur.execute(similarity_search_sql)
matched_docs = cur.fetchall()

total_tokens = 0
prompt_doc = ''
print('Answer: \n')
for matched_doc in matched_docs:
    if total_tokens + matched_doc['token_size'] <= 1000:
        prompt_doc += f"\n---\n{matched_doc['doc_chunk']}"
        total_tokens += matched_doc['token_size']
        continue

    answer(prompt_doc,prompt)

    total_tokens = 0
    prompt_doc = ''

answer(prompt_doc,prompt)

View the answer after you run the Python program. The following figure provides an example answer.
Note
To obtain a more accurate and complete answer, you can optimize the splitting method and the question prompts.

Summary

If a dedicated knowledge base is not connected, the answer of OpenAI to the question "How do I create an ApsaraDB RDS for PostgreSQL instance?" is irrelevant to Alibaba Cloud. Example:

image..png

After a dedicated knowledge base that is stored in your RDS instance is connected, the answer to the question "How do I create an ApsaraDB RDS for PostgreSQL instance?" is closely relevant to ApsaraDB RDS for PostgreSQL instances.

In summary, ApsaraDB RDS for PostgreSQL is fully capable of building an LLM-driven knowledge base for an industry vertical.