All Products
Search
Document Center

ApsaraDB RDS:Use ApsaraDB RDS for PostgreSQL to create a RAG application

Last Updated:Mar 28, 2026

ApsaraDB RDS for PostgreSQL combines vector storage with full-text search, making it a capable vector database for retrieval-augmented generation (RAG) applications. This guide walks through building a ticket chatbot to demonstrate how multi-path recall, result merging, and Q&A generation work together in a single PostgreSQL instance.

After completing this guide, you will understand:

  • How to ingest and structure documents for a RAG pipeline

  • How each of the four recall methods works and when to use each

  • How to merge and rerank results using the reciprocal rank fusion (RRF) algorithm and the bce-reranker-base_v1 model

How it works

The ticket chatbot pipeline has four stages:

  1. Data processing — Split source documents (help docs, knowledge bases, historical tickets) into chunks and generate embeddings. Store both in the RDS instance.

  2. Multi-path recall — Run four parallel searches against user questions: document keyword-based, content keyword-based, BM25-based, and embedding-based.

  3. Result merging — Merge and rerank results from all four paths using the RRF algorithm and the bce-reranker-base_v1 model.

  4. Q&A analysis — Score Q&A pairs during testing to evaluate and compare retrieval policies.

Prerequisites

Before you begin, make sure you have:

  • An ApsaraDB RDS for PostgreSQL instance

  • A privileged database account to install extensions

  • An Alibaba Cloud Model Studio account and API key (for the text embedding model). See Get an API key

  • A NAT gateway configured for the virtual private cloud (VPC) where the RDS instance resides, so the instance can reach external models. See Use the rds_embedding extension to generate vectors

Data processing

Acquire source data

Collect data based on the purpose of your RAG application. The ticket chatbot in this guide uses three sources:

  • Help documentation

  • Knowledge base articles

  • Historical support tickets

Segment and embed documents

Split documents into chunks before embedding them. Use the LangChain HTMLHeaderTextSplitter class to segment help documents by HTML hierarchy (H1, H2). Configure the chunk size and overlap to balance retrieval precision and recall. See LangChain text splitters for the full list of splitter options.

For Markdown documents, use MarkdownHeaderTextSplitter with heading flags such as # and ## for hierarchical splitting.

Store data

Store the processed data in two core tables: document and embedding.

SQL statements for creating a trigger

The document table stores document-level metadata and keywords used for full-text matching.

ColumnTypeDescription
idbigintPrimary key, auto-incremented
titlevarchar(255)Document title (unique — used to deduplicate on update)
urlvarchar(255)Source URL
key_wordvarchar(255)Keywords for full-text matching
tagvarchar(255)Source tag (e.g., direct, aone); controls tsvector weight
createdtimestampCreation time
modifiedtimestampLast modification time
key_word_tsvectortsvectorWeighted tsvector built from key_word; used in keyword-based recall
product_namevarchar(255)Product name for filtering

Indexes:

IndexTypePurpose
document_pkeybtree (id)Primary key
document_title_keybtree (title), UNIQUEDeduplication — documents are updated by title
document_product_name_keybtree (product_name)Product-scoped filtering
document_key_word_tsvector_ginGIN (key_word_tsvector)Fast keyword matching

Trigger: trigger_update_tsvector runs on INSERT or UPDATE and rebuilds key_word_tsvector automatically.

The trigger assigns weights A–D based on the tag column, so direct documents rank above aone documents in keyword-based recall. Weights A, B, C, D are in descending priority:

CREATE OR REPLACE FUNCTION update_tsvector()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
        IF NEW.tag = 'direct' THEN
            NEW.key_word_tsvector := setweight(to_tsvector('jiebacfg', NEW.key_word), 'A');
        ELSIF NEW.tag = 'aone' THEN
            NEW.key_word_tsvector := setweight(to_tsvector('jiebacfg', NEW.key_word), 'B');
        ELSIF NEW.tag IS NOT NULL THEN
            NEW.key_word_tsvector := setweight(to_tsvector('jiebacfg', NEW.key_word), 'C');
        ELSE
            NEW.key_word_tsvector := setweight(to_tsvector('jiebacfg', NEW.key_word), 'D');
        END IF;
    END IF;

    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION

CREATE TRIGGER trigger_update_tsvector
BEFORE INSERT OR UPDATE ON document
FOR EACH ROW
EXECUTE FUNCTION update_tsvector();
CREATE TRIGGER

SQL statements for creating a trigger

The embedding table stores the document chunks and their vector representations.

ColumnTypeDescription
idbigintPrimary key, auto-incremented
doc_idintegerForeign key to document.id
content_chunktextText chunk after segmentation
content_embeddingvector(1536)Embedding generated from content_chunk
createdtimestampCreation time
modifiedtimestampLast modification time
ts_vector_extratsvectortsvector built from content_chunk; used in content keyword-based recall

Indexes:

IndexTypePurpose
embedding_pkeybtree (id)Primary key
embedding_doc_id_keybtree (doc_id)Document join lookup
embedding_content_embedding_idxHNSW (content_embedding, vector_cosine_ops), m=16, ef_construction=64Approximate nearest neighbor (ANN) vector search
embedding_rumidxRUM (ts_vector_extra)Fast full-text search with ranking

Trigger: embedding_tsvector_update rebuilds ts_vector_extra on INSERT or UPDATE:

CREATE TRIGGER embedding_tsvector_update
BEFORE INSERT OR UPDATE ON embedding
FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger('ts_vector_extra','public.jiebacfg','content_chunk');

Multi-path recall

When to use each recall method

Different recall methods suit different query characteristics. Use this table to choose the right strategy for your use case:

MethodBest forLimitations
Document keyword-based recallQueries that match known product keywords or tags; fast and precise for exact termsMisses semantic meaning; fails on typos or synonyms
Content keyword-based recallQueries targeting document body text; benefits from RUM index for fast full-text rankingSame limitations as keyword matching
BM25-based recallQueries where term frequency and document-level importance matter; complements ts_rankPurely statistical; no semantic understanding
Embedding-based recallQueries expressed in natural language; handles synonyms and paraphrasesRequires embedding model; may miss exact keyword matches

For most RAG applications, combine all four methods and merge results using the RRF algorithm. This produces better recall than any single method alone.

Document keyword-based recall

Document keyword-based recall matches user questions against keywords stored in the document table and returns the top N documents by similarity.

Step 1: Index document keywords as weighted tsvector values.

Use to_tsvector to segment keywords and setweight to assign weights. The Chinese word segmentation extensions pg_jieba and zhparser handle Chinese text. See Manage extensions for installation instructions.

SELECT setweight(to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'), 'A');
                                   setweight
-------------------------------------------------------------------------------
 'postgresql':1A '世界':3A '先进':5A '关系':8A '型':9A '开源':7A '数据库':10A

Step 2: Convert the user question to a tsquery and match against keywords.

SELECT
    id,
    title,
    url,
    key_word,
    ts_rank(
        key_word_tsvector,
        to_tsquery(replace(text(plainto_tsquery('jiebacfg', '%s')), '&', '|'))
    ) AS score
FROM
    public.document
WHERE
    key_word_tsvector @@ to_tsquery(replace(text(plainto_tsquery('jiebacfg', '%s')), '&', '|'))
    AND product_name = '%s'
ORDER BY
    score DESC
LIMIT 1;

Key functions:

to_tsquery converts a question into a tsquery value:

SELECT to_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库');
                                   to_tsquery
--------------------------------------------------------------------------------
 'postgresql' <2> '世界' <2> '先进' <2> '开源' <-> '关系' <-> '型' <-> '数据库'

<2> indicates the distance between words. <-> indicates adjacent words. Stopwords such as and are removed automatically. & means AND, | means OR, and ! means NOT.

Use plainto_tsquery instead of to_tsquery when user input may contain invalid operators:

SELECT to_tsquery('jiebacfg','日志|&堆积');
ERROR:  syntax error in tsquery: "日志|&堆积"

SELECT plainto_tsquery('jiebacfg','日志|&堆积');
 plainto_tsquery
-----------------
 '日志' & '堆积'

Use the text function to convert the plainto_tsquery result to a string, then replace to change & to | for OR matching:

-- Use the plainto_tsquery function.
SELECT plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库');
                          plainto_tsquery
--------------------------------------------------------------------
 'postgresql' & '世界' & '先进' & '开源' & '关系' & '型' & '数据库'

-- Use the plainto_tsquery, text, and replace functions.
SELECT replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|');
                            replace
--------------------------------------------------------------------
 'postgresql' | '世界' | '先进' | '开源' | '关系' | '型' | '数据库'

-- Use the to_tsquery, plainto_tsquery, text, and replace functions.
SELECT to_tsquery(replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|'));
                           to_tsquery
--------------------------------------------------------------------
 'postgresql' | '世界' | '先进' | '开源' | '关系' | '型' | '数据库'

Add custom terms to the pg_jieba dictionary to improve segmentation of domain-specific phrases:

-- Automatically add 关系型 to custom dictionary 0 with a weight of 100000.
INSERT INTO jieba_user_dict VALUES ('关系型',0,100000);

-- Load custom dictionary 0. The first 0 is the dictionary sequence number; the second 0 loads the default dictionary.
SELECT jieba_load_user_dict(0,0);
 jieba_load_user_dict
----------------------

-- Convert a user question into a tsquery value.
SELECT to_tsquery(replace(text(plainto_tsquery('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库')), '&', '|'));
                           to_tsquery
--------------------------------------------------------------------
 'postgresql' | '世界' | '先进' | '开源' | '关系型' | '数据库'

Before adding 关系型 to the dictionary, the term splits into '关系' & '型'. After adding it, the dictionary returns the compound term '关系型' as a single token.

Operators and scoring:

Use @@ to check whether a tsvector matches a tsquery. Weight matching affects results:

SELECT to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库') @@ to_tsquery('jiebacfg', 'postgresql:A');
 ?column?
----------
 f

SELECT setweight(to_tsvector('jiebacfg', 'PostgreSQL是世界上先进的开源关系型数据库'),'A') @@ to_tsquery('jiebacfg', 'postgresql:A');
 ?column?
----------
 t

The query looks for postgresql with weight A. The first statement returns false because no weight is assigned to the tsvector. After applying setweight, the match succeeds.

Use ts_rank to score how well a tsvector matches a tsquery:

WITH sentence AS (
    SELECT 'PostgreSQL是世界上先进的开源关系型数据库' AS content
    UNION ALL
    SELECT 'MySQL是应用广泛的开源关系数据库'
    UNION ALL
    SELECT 'MySQL在全球非常流行'
)
SELECT content,
       ts_rank(to_tsvector('jiebacfg', content), to_tsquery('jiebacfg', 'postgresql | 开源')) AS score
FROM sentence
WHERE to_tsvector('jiebacfg', content) @@ to_tsquery('jiebacfg', 'postgresql | 开源')
ORDER BY score DESC;

                  content                   |    score
--------------------------------------------+-------------
 PostgreSQL是世界上先进的开源关系型数据库 |  0.06079271
 MySQL是应用广泛的开源关系数据库          | 0.030396355

The first sentence matches both postgresql and 开源, scoring higher than the second sentence, which only matches 开源. The third sentence is filtered out by @@ because it matches neither term.

Handling segmentation failures:

If word segmentation fails or the input contains unexpected characters, keyword matching may return no results. Use the pg_bigm extension for fuzzy matching as a fallback:

WITH sentence AS (
    SELECT 'PostgreSQL是世界上先进的开源关系型数据库' AS content
    UNION ALL
    SELECT 'MySQL是应用广泛的开源关系数据库'
    UNION ALL
    SELECT 'MySQL在全球非常流行'
)
SELECT
    content,
    bigm_similarity(content, 'postgres | 开源产品') AS score
FROM
    sentence
ORDER BY
    score DESC;

                  content                   |   score
--------------------------------------------+------------
  PostgreSQL是世界上先进的开源关系型数据库 | 0.23076923
  MySQL是应用广泛的开源关系数据库          | 0.05263158
  MySQL在全球非常流行                      |        0.0
(3 rows)

bigm_similarity converts both strings to bigrams (pairs of consecutive characters) and calculates overlap. The result ranges from 0 to 1, where 1 is an exact match. This works well when there are typos, abbreviations, or segmentation errors. For more information, see Use the pg_bigm extension to perform fuzzy match-based queries.

Content keyword-based recall

Content keyword-based recall searches the content_chunk text in the embedding table using the same full-text search approach as document keyword-based recall. Because document chunks are longer than keyword fields, the RUM extension provides a significant performance advantage.

The following three query plans compare the same full-text search query using different index approaches. All use the query 'wal日志堆积怎么办'.

Use the RUM extension to run full-text searches

Option 1: RUM index (recommended)

The RUM index stores word positions and timestamps alongside the inverted index entries, so it can sort results by similarity without a separate sort step. Query time: 3.2 ms.

EXPLAIN ANALYZE
SELECT
    id,
    doc_id,
    content_chunk,
    ts_vector_extra <=> to_tsquery(
        REPLACE(
            TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
            '&',
            '|'
        )
    ) AS similarity
FROM
    embedding
WHERE
    ts_vector_extra @@ to_tsquery(
        REPLACE(
            TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
            '&',
            '|'
        )
    )
ORDER BY
    similarity
LIMIT
    10;
                                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=10.15..22.14 rows=10 width=521) (actual time=3.117..3.182 rows=10 loops=1)
   ->  Index Scan using embedding_rumidx on embedding  (cost=10.15..6574.53 rows=5474 width=521) (actual time=3.115..3.179 rows=10 loops=1)
         Index Cond: (ts_vector_extra @@ to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
         Order By: (ts_vector_extra <=> to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
 Planning Time: 0.296 ms
 Execution Time: 3.219 ms
(6 rows)

Use native GIN indexes to accelerate queries

Option 2: GIN index on the ts_vector_extra column

The GIN index does not store word positions, so PostgreSQL must perform a bitmap heap scan followed by a sort. Query time: 14.2 ms.

EXPLAIN ANALYZE
SELECT
    id,
    doc_id,
    content_chunk,
    ts_rank(
        ts_vector_extra,
        to_tsquery(
            REPLACE(
                TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
                '&',
                '|'
            )
        )
    ) AS similarity
FROM
    embedding
WHERE
    ts_vector_extra @@ to_tsquery(
        REPLACE(
            TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
            '&',
            '|'
        )
    )
ORDER BY
    similarity
LIMIT
    10;
                                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7178.59..7179.76 rows=10 width=520) (actual time=10.526..14.192 rows=10 loops=1)
   ->  Gather Merge  (cost=7178.59..7718.33 rows=4626 width=520) (actual time=10.525..14.189 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=6178.57..6184.35 rows=2313 width=520) (actual time=6.879..6.880 rows=10 loops=3)
               Sort Key: (ts_rank(ts_vector_extra, to_tsquery('''wal'' | ''日志'' | ''堆积'''::text)))
               Sort Method: top-N heapsort  Memory: 37kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 39kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 40kB
               ->  Parallel Bitmap Heap Scan on embedding  (cost=56.47..6128.59 rows=2313 width=520) (actual time=0.567..6.367 rows=1637 loops=3)
                     Recheck Cond: (ts_vector_extra @@ to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
                     Heap Blocks: exact=1515
                     ->  Bitmap Index Scan on embedding_ts_vector_gin  (cost=0.00..55.08 rows=5551 width=0) (actual time=0.794..0.794 rows=4910 loops=1)
                           Index Cond: (ts_vector_extra @@ to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
 Planning Time: 0.291 ms
 Execution Time: 14.234 ms

GIN indexes built on to_tsvector('jiebacfg'::regconfig, content_chunk)

Option 3: GIN index on an expression (not recommended)

Building a GIN index on to_tsvector('jiebacfg'::regconfig, content_chunk) means PostgreSQL must re-run to_tsvector on every matched row to compute ts_rank because GIN does not store word location information. Query time: 1,081 ms — over 300x slower than the RUM index.

EXPLAIN ANALYZE
SELECT
    id,
    doc_id,
    content_chunk,
    ts_rank(
        to_tsvector('jiebacfg', content_chunk),
        to_tsquery(
            REPLACE(
                TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
                '&',
                '|'
            )
        )
    ) AS similarity
FROM
    embedding
WHERE
    to_tsvector('jiebacfg', content_chunk) @@ to_tsquery(
        REPLACE(
            TEXT(plainto_tsquery('jiebacfg', 'wal日志堆积怎么办')),
            '&',
            '|'
        )
    )
ORDER BY
    similarity
LIMIT
    10;
                                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=8253.58..8254.75 rows=10 width=521) (actual time=1079.289..1081.510 rows=10 loops=1)
   ->  Gather Merge  (cost=8253.58..8786.55 rows=4568 width=521) (actual time=1079.287..1081.508 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=7253.56..7259.27 rows=2284 width=521) (actual time=1073.189..1073.191 rows=10 loops=3)
               Sort Key: (ts_rank(to_tsvector('jiebacfg'::regconfig, content_chunk), to_tsquery('''wal'' | ''日志'' | ''堆积'''::text)))
               Sort Method: top-N heapsort  Memory: 43kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 42kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 37kB
               ->  Parallel Bitmap Heap Scan on embedding  (cost=55.93..7204.20 rows=2284 width=521) (actual time=2.127..1072.159 rows=1637 loops=3)
                     Recheck Cond: (to_tsvector('jiebacfg'::regconfig, content_chunk) @@ to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
                     Heap Blocks: exact=1028
                     ->  Bitmap Index Scan on embedding_content_gin  (cost=0.00..54.56 rows=5481 width=0) (actual time=0.808..0.809 rows=4910 loops=1)
                           Index Cond: (to_tsvector('jiebacfg'::regconfig, content_chunk) @@ to_tsquery('''wal'' | ''日志'' | ''堆积'''::text))
 Planning Time: 0.459 ms
 Execution Time: 1081.547 ms
(16 rows)

Index comparison summary:

Index typeExecution timeHow sorting works
RUM on ts_vector_extra3.2 msIndex scan with built-in ordering — no sort step
GIN on ts_vector_extra14.2 msBitmap scan + parallel sort
GIN on to_tsvector(content_chunk)1,081 msBitmap scan + per-row tsvector recomputation + sort

Use the RUM index (embedding_rumidx) for content keyword-based recall. It is 4x faster than the GIN-on-stored-column approach and 300x faster than the GIN-on-expression approach.

BM25-based recall

BM25 is a classic text matching algorithm that scores documents based on term frequency (TF) and inverse document frequency (IDF). A high TF score indicates a word appears frequently in a document. A high IDF score indicates a word appears across few documents and is therefore more distinctive.

BM25 optimizes the TF-IDF model with additional parameters to improve retrieval quality. Its output complements ts_rank-based keyword recall, since both operate on term statistics but weight factors differently.

Embedding-based recall

ApsaraDB RDS for PostgreSQL supports the pgvector extension for vector storage and similarity search, and the rds_embedding extension to generate embeddings from text.

The pgvector extension supports two indexing methods for approximate nearest neighbor (ANN) search:

  • HNSW (Hierarchical Navigable Small World) — builds the index incrementally as data is inserted; no training required; faster query speed

  • IVFFlat (Inverted File with Flat Compression) — requires training on existing data before building; slightly lower query speed

This guide uses HNSW, which works without pre-training and provides faster retrieval. For IVFFlat best practices, see Build an LLM-driven dedicated chatbot on top of ApsaraDB RDS for PostgreSQL.

SELECT
    embedding.id,
    doc_id,
    content_chunk,
    content_embedding <=> '%s' AS similarity
FROM
    public.embedding
LEFT JOIN
    document ON document.id = embedding.doc_id
WHERE
    product_name = '%s'
ORDER BY
    similarity
LIMIT %s;

For pgvector performance benchmarks, see Use the pgvector extension to perform high-dimensional vector similarity searches.

Result merging

Merge and rerank results from all four recall paths using the RRF algorithm and the bce-reranker-base_v1 model.

RRF algorithm:

RRF assigns a score to each document based on its rank across multiple retrieval systems:

RRF(d) = Σ 1/(k + r_i(d))

Where d is the document, r_i(d) is the document's rank in system i, and k is a smoothing constant (commonly set to 60). Documents that rank highly across multiple systems accumulate higher RRF scores, producing a merged ranking that reflects consensus across all recall paths.

bce-reranker-base_v1 model:

bce-reranker-base_v1 is a cross-language semantic reranking model supporting Chinese, English, Japanese, and Korean. It produces more accurate rankings than RRF alone, but takes longer when processing many chunks.

Choose between approaches based on your latency requirements:

  • For fast responses, use RRF alone or apply bce-reranker-base_v1 only on the top RRF results.

  • For highest accuracy, run bce-reranker-base_v1 directly on the merged result set.

Q&A analysis

The ticket chatbot applies different policies based on data source to control how each content type flows into the large language model (LLM) prompt:

  • Knowledge base content — provided directly to the user without LLM processing

  • Help documentation — summarized and formatted by the LLM, because segmented HTML may contain layout artifacts and duplicate text

  • LLM fallback — the LLM answers directly only when the knowledge base and documentation cannot provide a relevant result

  • Historical tickets — provided as a title and URL only, pointing users to the full ticket

Each answer includes links to the source documents. If the answer does not fully resolve the issue, users can follow the links for more complete information.

During testing, score Q&A pairs to assess retrieval policy effectiveness. Write multiple policy variants and compare scores to determine the most effective recall configuration:

prompt = f\'\'\'请整理并格式化下面的内容并整理输出格式,
        ```
        {prompt_content}
        ```
        基于自己的能力做出回答,我的问题是:{question}。
        \'\'\'

Connect to a DingTalk chatbot

You can use Streamlit to create a web app or connect to the DingTalk chatbot. A web app is used for self-testing and document management. The DingTalk chatbot allows all users to access the service.

In each Q&A session in a DingTalk group, a new connection needs to be established at the database level. Frequent short-lived connections consume time and memory. If connections are not released promptly, the instance may hit its connection limit. Use a connection pool to prevent this — either implement one in your application or use the built-in PgBouncer connection pool of ApsaraDB RDS for PostgreSQL.

Example

This example demonstrates multi-path recall using a sample dataset about PostgreSQL, MySQL, and SQL Server. The query is 介绍一下postgresql.

Prepare data

  1. Install the required extensions using a privileged account:

    Important
    • Before installing pg_jieba, add pg_jieba to the value of the shared_preload_libraries parameter. For more information about how to modify the shared_preload_libraries parameter, see Set instance parameters.

    • You can execute the SELECT * FROM pg_extension; statement to view the installed extensions.

    CREATE EXTENSION IF NOT EXISTS pg_jieba;
    CREATE EXTENSION IF NOT EXISTS vector;
    CREATE EXTENSION IF NOT EXISTS rum;
    CREATE EXTENSION IF NOT EXISTS rds_embedding;
  2. Create a NAT gateway for the virtual private cloud (VPC) in which the RDS instance resides to allow the RDS instance to access external models. For more information, see Use the rds_embedding extension to generate vectors.

    Note

    By default, you cannot connect to an RDS instance over the Internet. When you use an external large model, such as the text embedding model provided by Alibaba Cloud Model Studio, you must create a NAT gateway for the VPC in which the RDS instance resides for the RDS instance to access external models.

  3. Execute the following SQL statements on the required database to create the test tables named doc and embed, and create indexes for the tables:

    --Create the doc test table and an index for the table.
    DROP TABLE IF EXISTS doc;
    
    CREATE TABLE doc (
        id bigserial PRIMARY KEY,
        title character varying(255) UNIQUE,
        key_word character varying(255) DEFAULT \'\'
    );
    
    CREATE INDEX doc_gin ON doc
    USING GIN (to_tsvector(\'jiebacfg\', key_word));
    
    --Create the embed test table and an index for the table.
    DROP TABLE IF EXISTS embed;
    
    CREATE TABLE embed (
        id bigserial PRIMARY KEY,
        doc_id integer,
        content text,
        embedding vector(1536),
        ts_vector_extra tsvector
    );
    
    CREATE INDEX ON embed
    USING hnsw (embedding vector_cosine_ops)
    WITH (
        m = 16,
        ef_construction = 64
    );
  4. Execute the following SQL statements to create a trigger. When a row in the embed table is updated or data is inserted into the row, the ts_vector_extra column is automatically updated.

    -- Convert the text into a tsvector value for full-text search based on keywords.
    CREATE TRIGGER embed_tsvector_update
    BEFORE UPDATE OR INSERT
    ON embed
    FOR EACH ROW
    EXECUTE PROCEDURE tsvector_update_trigger(\'ts_vector_extra\', \'public.jiebacfg\', \'content\');
  5. Execute the following SQL statements. Each time an insert or update is performed on the embed table, an embedding is generated based on the inserted or updated content and stored in the embedding column.

    Important

    In this example, the text embedding model provided by Alibaba Cloud Model Studio is used. You must activate Alibaba Cloud Model Studio and obtain the required API key. For more information, see Get an API key.

    -- Convert the question into an embedding. Specify the api_key parameter based on your business requirements.
    CREATE OR REPLACE FUNCTION update_embedding()
    RETURNS TRIGGER AS $$
    BEGIN
        NEW.embedding := rds_embedding.get_embedding_by_model(\'dashscope\', \'sk-****\', NEW.content)::real[];
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;
    CREATE TRIGGER set_embedding BEFORE INSERT OR UPDATE ON embed FOR EACH ROW EXECUTE FUNCTION update_embedding();
  6. Insert test data into the table.

    INSERT INTO doc(id, title, key_word) VALUES
    (1, \'PostgreSQL介绍\', \'PostgreSQL 插件\'),
    (2, \'MySQL介绍\', \'MySQL MGR\'),
    (3, \'SQL Server介绍\', \'SQL Server Microsoft\');
    
    INSERT INTO embed(doc_id, content) VALUES
    (1, \'PostgreSQL是以加州大学伯克利分校计算机系开发的POSTGRES,版本 4.2为基础的对象关系型数据库管理系统(ORDBMS)。 POSTGRES领先的许多概念在很久以后才出现在一些商业数据库系统中\'),
    (1, \'PostgreSQL是最初的伯克利代码的开源继承者。 它支持大部分SQL标准并且提供了许多现代特性:复杂查询、外键、触发器、可更新视图、事务完整性、多版本并发控制,同样,PostgreSQL可以用许多方法扩展,比如,通过增加新的:数据类型、函数、操作符、聚集函数、索引方法、过程语言\'),
    (1, \'并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、修改和分发PostgreSQL,不管是私用、商用还是学术研究目的。 \'),
    (1, \'Ganos插件和PostGIS插件不能安装在同一个Schema下\'),
    (1, \'丰富的生态系统:有大量现成的插件和扩展可供使用,比如PostGIS(地理信息处理)、TimescaleDB(时间序列数据库)、pg_stat_statements(性能监控)等,能够满足不同场景的需要\');
    
    INSERT INTO embed(doc_id, content) VALUES
    (2, \'MySQL名称的起源不明。 10多年来,我们的基本目录以及大量库和工具均采用了前缀"my"。 不过,共同创办人Monty Widenius的女儿名字也叫"My"。 时至今日,MySQL名称的起源仍是一个迷,即使对我们也一样\'),
    (2, \'MySQL软件采用双许可方式。 用户可根据GNU通用公共许可(http://www.fsf.org/licenses/)条款,将MySQL软件作为开放源码产品使用,或从MySQL AB公司购买标准的商业许可证。 关于我方许可策略的更多信息,请参见http://www.mysql.com/company/legal/licensing/。 \'),
    (2, \'组复制MySQL Group Replication(简称MGR)是MySQL官方在已有的Binlog复制框架之上,基于Paxos协议实现的一种分布式复制形态。 RDS MySQL集群系列实例支持组复制。 本文介绍如何使复制方式为组复制。 使用了组复制的MySQL集群能够基于分布式Paxos协议自我管理,具有很强的数据可靠性和数据一致性。 相比传统主备复制方式,组复制具有以下优势:数据的强一致性,数据的强可靠性,全局事务强一致性\');
    
    INSERT INTO embed(doc_id, content) VALUES
    (3, \'Microsoft SQL Server是一种关系数据库管理系统 (RDBMS)。 应用程序和工具连接到SQL Server实例或数据库,并使用Transact-SQL (T-SQL)进行通信。 \'),
    (3, \'SQL Server 2022 (16.x)在早期版本的基础上构建,旨在将SQL Server发展成一个平台,以提供开发语言、数据类型、本地或云环境以及操作系统选项。 \'),
    (3, \'SQL Server在企业级应用中广受欢迎,与其他Microsoft产品(如Excel、Power BI)无缝集成,便于数据分析\');

Run multi-path recall

Execute the following SQL statements to retrieve the query text of 介绍一下postgresql from multiple methods and sort relevant documents by similarity.

-- The text that you want to query.
WITH query AS (
    SELECT \'介绍一下postgresql\' AS query_text
),
-- Convert the question into an embedding. Replace sk-**** with the API key obtained from Alibaba Cloud Model Studio.
query_embedding AS (
    SELECT rds_embedding.get_embedding_by_model(\'dashscope\', \'sk-****\', query.query_text)::real[]::vector AS embedding
    FROM query
),
-- Document keyword search: score by ts_rank (higher = better match).
first_method AS (
    SELECT
        id,
        title,
        ts_rank(to_tsvector(\'jiebacfg\', doc.key_word),
                to_tsquery(replace(text(plainto_tsquery(\'jiebacfg\', (SELECT query_text FROM query))), \'&\', \'|\'))) AS score,
        \'doc_key_word\' AS method
    FROM doc
    WHERE
        to_tsvector(\'jiebacfg\', doc.key_word) @@
        to_tsquery(replace(text(plainto_tsquery(\'jiebacfg\', (SELECT query_text FROM query))), \'&\', \'|\'))
    ORDER BY
        score DESC
    LIMIT 3
),
-- Content keyword search: score by RUM <=> operator (lower = better match).
second_method AS (
    SELECT
        id,
        doc_id,
        content,
        to_tsvector(\'jiebacfg\', content) <=>
        to_tsquery(replace(text(plainto_tsquery(\'jiebacfg\', (SELECT query_text FROM query))), \'&\', \'|\')) AS score,
        \'content_key_word\' AS method
    FROM embed
    WHERE
        to_tsvector(\'jiebacfg\', content) @@
        to_tsquery(replace(text(plainto_tsquery(\'jiebacfg\', (SELECT query_text FROM query))), \'&\', \'|\'))
    ORDER BY
        score
    LIMIT 3
),
-- Embedding search: score by cosine distance <=> (lower = better match).
third_method AS (
    SELECT
        embed.id,
        embed.doc_id,
        embed.content,
        embedding <=> (SELECT embedding FROM query_embedding LIMIT 1) AS score,
        \'embedding\' AS method
    FROM embed
    ORDER BY score
    LIMIT 3
)
-- Join to retrieve document titles and combine results from all three methods.
SELECT
    first_method.title,
    embed.id AS chunk_id,
    SUBSTRING(embed.content FROM 1 FOR 30),
    first_method.score,
    first_method.method
FROM first_method
LEFT JOIN embed ON first_method.id = embed.doc_id
UNION
SELECT
    doc.title,
    second_method.id AS chunk_id,
    SUBSTRING(second_method.content FROM 1 FOR 30),
    second_method.score,
    second_method.method
FROM second_method
LEFT JOIN doc ON second_method.doc_id = doc.id
UNION
SELECT
    doc.title,
    third_method.id AS chunk_id,
    SUBSTRING(third_method.content FROM 1 FOR 30),
    third_method.score,
    third_method.method
FROM third_method
LEFT JOIN doc ON third_method.doc_id = doc.id
ORDER BY method, score;

Output:

     title      | chunk_id |                          substring                           |        score         |      method
----------------+----------+--------------------------------------------------------------+----------------------+------------------
 PostgreSQL介绍 |        3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大           |   13.159472465515137 | content_key_word
 PostgreSQL介绍 |        2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO             |     16.4493408203125 | content_key_word
 PostgreSQL介绍 |        4 | 并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、 |     16.4493408203125 | content_key_word
 PostgreSQL介绍 |        6 | 丰富的生态系统:有大量现成的插件和扩展可供使用,比如Post     | 0.020264236256480217 | doc_key_word
 PostgreSQL介绍 |        5 | Ganos插件和PostGIS插件不能安装在同一个Schem                  | 0.020264236256480217 | doc_key_word
 PostgreSQL介绍 |        3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大           | 0.020264236256480217 | doc_key_word
 PostgreSQL介绍 |        2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO             | 0.020264236256480217 | doc_key_word
 PostgreSQL介绍 |        4 | 并且,因为自由宽松的许可证,任何人都可以以任何目的免费使用、 | 0.020264236256480217 | doc_key_word
 PostgreSQL介绍 |        2 | PostgreSQL是以加州大学伯克利分校计算机系开发的PO             |   0.2546271233144539 | embedding
 PostgreSQL介绍 |        3 | PostgreSQL是最初的伯克利代码的开源继承者。 它支持大           |  0.28679098231865074 | embedding
 PostgreSQL介绍 |        6 | 丰富的生态系统:有大量现成的插件和扩展可供使用,比如Post     |  0.41783296077761967 | embedding

All three recall methods return results for PostgreSQL介绍, confirming that keyword-based, content-based, and embedding-based approaches all identify the correct document.

Related topics

For more information about the best practices of ApsaraDB RDS for PostgreSQL in RAG, see the following topics: