In most cases, the vector search method alone can achieve a high recall rate for vector similarity retrieval. However, poorly performing embedding models or complex queries in specific scenarios may cause generated vectors to be far away from the data to be retrieved. In this case, the use of the vector search method alone cannot achieve the desired results. To increase the recall rate, you can use the two-way retrieval method based on vector search and full-text search.
The two-way retrieval method in AnalyticDB for PostgreSQL separately retrieves subsets of data by using vector search and full-text search, and then reranks and post-processes the merged retrieval results for better effects.
Vector search: Based on dense vector representations, AnalyticDB for PostgreSQL uses the approximate nearest neighbor search (ANNS) algorithm to capture semantic relevance and retrieve TopK similar items.
Full-text search: AnalyticDB for PostgreSQL performs precise matching based on statistical features such as term frequency and inverse document frequency to supplement results with strong keyword relevance. In AnalyticDB for PostgreSQL V6.0, full-text search relies on GIN indexes. In V7.0, the pgsearch-based BM25 indexes are used to improve retrieval efficiency and relevance.
Reranking and post-processing: AnalyticDB for PostgreSQL merges the two-way retrieved data and performs further ranking and processing to ensure the relevance and accuracy of the final results.
This topic describes how to perform two-way retrieval based on vector search and full-text search in AnalyticDB for PostgreSQL V7.0. If you use an AnalyticDB for PostgreSQL V6.0 instance, refer to two-way retrieval in V6.0.
Supported versions
AnalyticDB for PostgreSQL V7.0 instances of V7.2.1.0 or later.
For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor version of an instance. If your instance does not meet the preceding requirements, update the minor version of the instance. For more information, see UpgradeDBVersion.
Prerequisites
The vector search engine optimization feature is enabled for an AnalyticDB for PostgreSQL instance.
The pgsearch extension is installed. If you have installed the pgsearch extension, a schema named pgsearch is included in the schema list. If you have not installed the pgsearch extension, submit a ticket for installation assistance. After the pgsearch extension is installed, you must restart your instance.
Procedure
Step 1: Create a sample table
Create a sample table named documents
and insert 5 rows of data into the table.
-- The vector column specifies the vector data.
CREATE TABLE IF NOT EXISTS documents(
id TEXT,
docname TEXT,
title TEXT,
vector real[],
text TEXT);
-- Set the storage mode of the vector column to PLAIN.
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- Insert sample data.
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe',
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}',
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),
('2', 'doc_2', 'The Art of Cooking',
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}',
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),
('3', 'doc_3', 'Technology and Society',
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}',
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),
('4', 'doc_4', 'Psychology of Happiness',
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}',
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),
('5', 'doc_5', 'Sustainable Living Practices',
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}',
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');
Step 2: Create indexes
Create a vector index for the vector column.
CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);
Create a full-text index for the text column.
CALL pgsearch.create_bm25( index_name => 'documents_bm25_idx', table_name => 'documents', text_fields => '{text: {}}' );
Step 3: Perform two-way retrieval
The first temporary table t1 retrieves 1 row of data by using full-text search, and the second temporary table t2 retrieves 5 rows of data by using vector search. Use FULL OUTER JOIN
to obtain the final score based on the BM25 and vector similarity scores, and then return the results sorted by the final score.
WITH t1 AS (
SELECT
id,
docname,
title,
text,
text @@@ pgsearch.config('text:astronomical') AS score,
2 AS source
FROM
documents
ORDER BY score
LIMIT 10
),
t2 AS (
SELECT
id,
docname,
title,
text,
cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
1 AS source
FROM
documents
ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- The weight distribution for scores is for demonstration only. You can choose appropriate parameters and calculation methods based on your business requirements.
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id
ORDER BY hybrid_score DESC;
You can also use reciprocal rank fusion (RRF) to calculate the final score. RRF determines the final ranking of retrieval results based on the rankings from vector search and full-text search. Typically, if a retrieval result ranks high in both search methods, its final score is higher.
The parameter k
in the RRF formula is used to smooth the impact of rankings on the final score. A larger k
value reduces the score difference between different rankings, achieving a better smoothing effect. By default, the value of k
is set to 60.
Step 4: (Optional) Encapsulate and invoke a function
Encapsulate the query in Step 3 as a function to simplify invocation.
Encapsulate the query as a function.
CREATE OR REPLACE FUNCTION search_documents( table_name TEXT, vector_column TEXT, text_column TEXT, search_keyword TEXT, search_vector REAL[], limit_size INT, hnsw_weight FLOAT8 DEFAULT 0.8 -- Default weight for hnsw ) RETURNS TABLE ( id TEXT, docname TEXT, title TEXT, text TEXT, hybrid_score FLOAT8 ) AS $$ DECLARE query_string TEXT; bm25_weight FLOAT8; BEGIN bm25_weight := 1.0 - hnsw_weight; query_string := 'WITH t1 AS ( SELECT id, docname, title, ' || text_column || ', ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score, 2 AS source FROM ' || table_name || ' ORDER BY score LIMIT ' || limit_size || ' ), t2 AS ( SELECT id, docname, title, ' || text_column || ', cosine_similarity(' || vector_column || ', $1) AS score, 1 AS source FROM ' || table_name || ' ORDER BY ' || vector_column || ' <-> $1 LIMIT ' || limit_size || ' ) SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id ORDER BY hybrid_score DESC;'; RETURN QUERY EXECUTE query_string USING search_vector; END; $$ LANGUAGE plpgsql;
Invoke the encapsulated
search_documents()
function.SELECT * FROM search_documents( 'documents', 'vector', 'text', 'astronomical', ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 10, 0.8 );
References
Two-way retrieval based on vector search and full-text search in V6.0