結合向量檢索與全文檢索索引實現雙路召回-雲原生資料倉儲AnalyticDB-阿里雲

在大多數情境下，僅使用向量檢索就能在相似性召回中獲得較高的召回率。然而在某些情況下，例如當Embedding模型表現不佳或查詢複雜導致產生的向量與庫內資料距離較遠時，僅靠向量相似性召回可能無法達到預期效果。這時為了提高召回率，可以採用向量檢索和全文檢索索引雙路召回策略。

AnalyticDB for PostgreSQL的雙路召回通過向量檢索和全文檢索索引分別召回部分資料，然後合并兩部分召回資料，做精排和後處理，以獲得更佳的召回效果。具體步驟如下。

向量檢索：基於嵌入向量的稠密表徵，通過近似最近鄰搜尋（ANN）捕獲語義相關性，召回Top-K相似項。
全文檢索索引：針對詞頻、逆文檔頻率等統計特徵做精準匹配，補充關鍵詞強相關結果。AnalyticDB for PostgreSQL6.0版中，全文檢索索引依賴GIN索引實現。而在7.0版，則升級為基於pgsearch的BM25索引，進一步提升了檢索效率和相關性。
精排和後處理：將兩路召回的資料合併，並做進一步的排序和處理，以確保最終結果的相關性和準確性。

本文介紹AnalyticDB for PostgreSQL7.0版的向量檢索和全文檢索索引雙路召回。如果執行個體是6.0版，請查看向量檢索與全文檢索索引雙路召回。

版本限制

核心版本為7.2.1.0及以上的AnalyticDB for PostgreSQL7.0版執行個體。

說明

您可以在控制台執行個體的基本資料頁查看核心小版本。如不滿足上述版本要求，需要您升級核心小版本。

前提條件

已為執行個體開啟向量引擎最佳化。
已安裝pgsearch外掛程式。如果您已安裝，在資料庫的Schema列表中可以看到pgsearch。如未安裝請提交工單，聯絡支援人員協助安裝（需要重啟執行個體）。

操作步驟

步驟一：建立範例表

建立範例表documents並寫入5條測試資料。

-- vector欄位為向量
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT);
-- 將向量列設定為內聯模式
ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;
-- 插入樣本資料
INSERT INTO documents (id, docname, title, vector, text) VALUES
('1', 'doc_1', 'Exploring the Universe', 
'{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}', 
'The universe is vast, filled with mysteries and astronomical wonders waiting to be discovered.'),

('2', 'doc_2', 'The Art of Cooking', 
'{0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1}', 
'Cooking combines ingredients artfully, creating flavors that nourish and bring people together.'),

('3', 'doc_3', 'Technology and Society', 
'{0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2}', 
'Technology transforms society, reshaping communication, work, and our daily interactions significantly.'),

('4', 'doc_4', 'Psychology of Happiness', 
'{0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3}', 
'Happiness is complex, influenced by relationships, gratitude, and the pursuit of meaningful experiences.'),

('5', 'doc_5', 'Sustainable Living Practices', 
'{0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4}', 
'Sustainable living involves eco-friendly choices, reducing waste, and promoting environmental awareness.');

步驟二：建立索引

為向量欄位建立向量索引。

CREATE INDEX documents_idx ON documents USING ann(vector) WITH (dim = 10, algorithm = hnswflat, distancemeasure = L2, vector_include = 0);

為文字欄位建立全文索引。

CALL pgsearch.create_bm25(
    index_name => 'documents_bm25_idx',
    table_name => 'documents',
    text_fields => '{text: {}}'
);

步驟三：雙路召回查詢

第一張暫存資料表t1通過全文檢索索引召回1條結果，第二張暫存資料表通過向量檢索召回5條結果。通過FULL OUTER JOIN綜合BM25得分和向量相似性得分得到總得分，最後按照總得分排序返回結果。

WITH t1 AS (
    SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source
    FROM
        documents
    ORDER BY score
    LIMIT 10
),
t2 AS (
    SELECT
        id,
        docname,
        title,
        text,
        cosine_similarity(vector,ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]::real[]) AS score,
        1 AS source
    FROM
        documents
    ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
    LIMIT 10
)
SELECT t2.*, COALESCE(ABS(t1.score), 0.0) * 0.2 + COALESCE(t2.score, 0.0) * 0.8 AS hybrid_score
-- 此處得分的權重分配僅為示範，您可以根據業務需求選取合適的參數和計算方法。
FROM t1
FULL OUTER JOIN t2 ON t1.id = t2.id 
ORDER BY  hybrid_score DESC;

您也可以使用RRF（Reciprocal Rank Fusion）來計算最終得分。RRF通過結合向量檢索和全文檢索索引的排名來確定召回結果的最終排名。通常，如果某個召回結果在兩種檢索方法中的排名都比較靠前，那麼它的綜合得分也會更高。

RRF公式中的參數 k 用於平滑排名對最終得分的影響。較大的 k 值會使不同排名之間的得分差異減小，從而達到更好的凹凸貼圖。預設情況下，k 的值為60。

RRF使用樣本

WITH bm25 AS (
        SELECT
            id,
            docname,
            title,
            text,
            text @@@ pgsearch.config('text:astronomical') AS score,
            2 AS source,
            ROW_NUMBER() OVER () AS rank_bm25
        FROM
            documents
        ORDER BY score
        LIMIT 10
), hnsw AS (
        SELECT
            id,
            docname,
            title,
            text,
            vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1] AS score,
            1 AS source,
            ROW_NUMBER() OVER (ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]) AS rank_hnsw
        FROM
            documents
        ORDER BY vector <-> ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
        LIMIT 10
)
SELECT 
    COALESCE(bm25.id, hnsw.id) AS id,
    COALESCE(bm25.docname, hnsw.docname) AS docname,
    COALESCE(bm25.title, hnsw.title) AS title,
    COALESCE(bm25.text, hnsw.text) as text,
    CASE 
        WHEN bm25.rank_bm25 > 0 AND hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0) + COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        WHEN bm25.rank_bm25 > 0 THEN 
            COALESCE(1.0 / (60 + bm25.rank_bm25), 0)
        WHEN hnsw.rank_hnsw > 0 THEN 
            COALESCE(1.0 / (60 + hnsw.rank_hnsw), 0)
        ELSE 0
    END AS hybrid_score
FROM 
    bm25
FULL OUTER JOIN hnsw ON bm25.id = hnsw.id 
ORDER BY hybrid_score DESC;

步驟四：封裝並調用函數（可選步驟）

將步驟三中的查詢封裝為函數，簡化調用。

封裝為函數。

CREATE OR REPLACE FUNCTION search_documents(
    table_name TEXT,
    vector_column TEXT,
    text_column TEXT,
    search_keyword TEXT,
    search_vector REAL[],
    limit_size INT,
    hnsw_weight FLOAT8 DEFAULT 0.8  -- Default weight for hnsw
)
RETURNS TABLE (
    id TEXT,
    docname TEXT,
    title TEXT,
    text TEXT,
    hybrid_score FLOAT8
) AS $$
DECLARE
    query_string TEXT;
    bm25_weight FLOAT8;
BEGIN
    bm25_weight := 1.0 - hnsw_weight;

    query_string := 'WITH t1 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                ' || text_column || ' @@@ pgsearch.config(''' || search_keyword || ''') AS score,
                                2 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY score	
                            LIMIT ' || limit_size || '
                    ), t2 AS (
                            SELECT
                                id,
                                docname,
                                title,
                                ' || text_column || ',
                                cosine_similarity(' || vector_column || ', $1) AS score,
                                1 AS source
                            FROM
                                ' || table_name || '
                            ORDER BY ' || vector_column || ' <-> $1
                            LIMIT ' || limit_size || '
                    )
                    SELECT t2.id, t2.docname, t2.title, t2.' || text_column || ', 
                    COALESCE(ABS(t1.score), 0.0) * ' || bm25_weight || ' + 
                    COALESCE(t2.score, 0.0) * ' || hnsw_weight || ' AS hybrid_score
                    FROM t1
                    FULL OUTER JOIN t2 ON t1.id = t2.id 
                    ORDER BY hybrid_score DESC;';
    
 RETURN QUERY EXECUTE query_string USING search_vector;
END; $$
LANGUAGE plpgsql;

調用封裝好的search_documents函數查詢。

SELECT * 
FROM search_documents(
    'documents', 
    'vector',
    'text', 
    'astronomical', 
    ARRAY[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 
    10,
    0.8
);

AnalyticDB：7.0版向量檢索與全文檢索索引雙路召回