Perform two-way retrieval based on vector search and full-text search in V7.0 - AnalyticDB

In most cases, the vector search method alone can achieve a high recall rate for vector similarity retrieval. However, in specific scenarios such as poorly performing embedding models or complex queries, generated vectors may not align well with the data to be retrieved. In this case, the use of the vector search method alone cannot achieve the desired results. To increase the recall rate, you can use the two-way retrieval method based on vector search and full-text search.

The two-way retrieval method in AnalyticDB for PostgreSQL separately retrieves subsets of data by using vector search and full-text search, and then reranks and post-processes the merged retrieval results for better effects.

Vector search: Based on dense vector representations, AnalyticDB for PostgreSQL uses the approximate nearest neighbor search (ANNS) algorithm to capture semantic relevance and retrieve TopK similar items.
Full-text search: AnalyticDB for PostgreSQL performs precise matching based on statistical features such as term frequency and inverse document frequency to supplement results with strong keyword relevance. In AnalyticDB for PostgreSQL V6.0, full-text search relies on GIN indexes. In V7.0, the pgsearch-based BM25 indexes are used to improve retrieval efficiency and relevance.
Reranking and post-processing: AnalyticDB for PostgreSQL merges the two-way retrieved data and performs further ranking and processing to ensure the relevance and accuracy of the final results.

This topic describes how to perform two-way retrieval based on vector search and full-text search in AnalyticDB for PostgreSQL V7.0. If you use an AnalyticDB for PostgreSQL V6.0 instance, refer to Two-way retrieval based on vector search and full-text search.

Supported versions

AnalyticDB for PostgreSQL V7.0 instances of V7.2.1.0 or later.

Note

For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor version of an instance. If your instance does not meet the preceding requirements, update the minor version of the instance. For more information, see UpgradeDBVersion.

Prerequisites

The vector search engine optimization feature is enabled for an AnalyticDB for PostgreSQL instance.
The pgsearch extension is installed. If you have installed the pgsearch extension, a schema named pgsearch is included in the schema list. If you have not installed the pgsearch extension, submit a ticket for installation assistance. After the pgsearch extension is installed, you must restart your instance.

Procedure

Step 1: Create a sample table

Create a sample table named documents and insert 5 rows of data into the table.

-- The vector column specifies the vector data.
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT);
-- Set the storage mode of the vector column to PLAIN.
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- Insert sample data.
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe', 
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}', 
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),

('2', 'doc_2', 'The Art of Cooking', 
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}', 
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),

('3', 'doc_3', 'Technology and Society', 
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}', 
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),

('4', 'doc_4', 'Psychology of Happiness', 
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}', 
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),

('5', 'doc_5', 'Sustainable Living Practices', 
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}', 
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');

Step 2: Create indexes

Create a vector index for the vector column.

CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);

Create a full-text index for the text column.

CALL pgsearch.create_bm25(
    index_name => 'documents_bm25_idx',
    table_name => 'documents',
    text_fields => '{text: {}}'
);

Step 3: Perform two-way retrieval

The first temporary table t1 retrieves 1 row of data by using full-text search, and the second temporary table t2 retrieves 5 rows of data by using vector search. Use FULL OUTER JOIN to obtain the final score based on the BM25 and vector similarity scores, and then return the results sorted by the final score.

WITH t1 AS (
    SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source
    FROM
        documents
    ORDER BY score
    LIMIT 10
),
t2 AS (
    SELECT
        id,
        docname,
        title,
        text,
        cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
        1 AS source
    FROM
        documents
    ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
    LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- The weight distribution for scores is for demonstration only. You can choose appropriate parameters and calculation methods based on your business requirements.
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id 
ORDER BY  hybrid_score DESC;

You can also use reciprocal rank fusion (RRF) to calculate the final score. RRF determines the final ranking of retrieval results based on the rankings from vector search and full-text search. Typically, if a retrieval result ranks high in both search methods, its final score is higher.

The parameter k in the RRF formula is used to smooth the impact of rankings on the final score. A larger k value reduces the score difference between different rankings, achieving a better smoothing effect. By default, the value of k is set to 60.

RRF example

WITH bm25 AS (
        SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source,
            ROW_NUMBER() OVER () AS rank_bm25
        FROM
            documents
        ORDER BY score
        LIMIT 10
), hnsw AS (
        SELECT
            id,
            docname,
            title,
            text,
            vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] AS score,
            1 AS source,
            ROW_NUMBER() OVER (ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]) AS rank_hnsw
        FROM
            documents
        ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
        LIMIT 10
)
SELECT 
    COALESCE(bm25.id, hnsw.id) AS id,
    COALESCE(bm25.docname, hnsw.docname) AS docname,
    COALESCE(bm25.title, hnsw.title) AS title,
    COALESCE(bm25.text, hnsw.text) as text,
    CASE 
        WHEN bm25.rank_bm25 > 0 AND hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0) + COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        WHEN bm25.rank_bm25 > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0)
        WHEN hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        ELSE 0
    END AS hybrid_score
FROM 
    bm25
FULL OUTER JOIN hnsw ON bm25.id = hnsw.id 
ORDER BY hybrid_score DESC;

Step 4: (Optional) Encapsulate and invoke a function

Encapsulate the query in Step 3 as a function to simplify invocation.

Encapsulate the query as a function.

CREATE OR REPLACE FUNCTION search_documents(
    table_name TEXT,
    vector_column TEXT,
    text_column TEXT,
    search_keyword TEXT,
    search_vector REAL[],
    limit_size INT,
    hnsw_weight FLOAT8 DEFAULT 0.8  -- Default weight for hnsw
)
RETURNS TABLE (
    id TEXT,
    docname TEXT,
    title TEXT,
    text TEXT,
    hybrid_score FLOAT8
) AS $$
DECLARE
    query_string TEXT;
    bm25_weight FLOAT8;
BEGIN
    bm25_weight := 1.0 - hnsw_weight;

    query_string := 'WITH t1 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score,
                                2 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY score	
                            LIMIT ' || limit_size || '
                    ), t2 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                cosine_similarity(' || vector_column || ', $1) AS score,
                                1 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY ' || vector_column || ' <-> $1
                            LIMIT ' || limit_size || '
                    )
                    SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', 
                    COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + 
                    COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score
                    FROM t1
                    FULL OUTER JOIN t2 ON t1.id = t2.id 
                    ORDER BY hybrid_score DESC;';
    
 RETURN QUERY EXECUTE query_string USING search_vector;
END; $$
LANGUAGE plpgsql;

Invoke the encapsulated search_documents() function.

SELECT * 
FROM search_documents(
    'documents', 
    'vector',
    'text', 
    'astronomical', 
    ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 
    10,
    0.8
);

References

Use hybrid search to perform two-way retrieval based on vector search and full-text search.