構建PolarDB PostgreSQL混合向量和全文檢索索引AI搜尋 - 雲原生資料庫 PolarDB

PolarDB PostgreSQL版支援多種檢索方式，包括密集檢索、稀疏檢索和混合檢索。

背景

密集檢索：利用語義上下文來理解查詢背後的含義。
稀疏檢索：強調文本匹配，根據特定術語尋找結果，相當於全文檢索索引。
混合檢索：結合密集檢索和稀疏檢索，捕捉完整的上下文和特定的關鍵詞，從而獲得全面的搜尋結果。

基礎資料準備

使用高許可權帳號建立檢索所需外掛程式。
```
CREATE EXTENSION IF NOT EXISTS rum;
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS polar_ai;
```
其中，各外掛程式功能如下：
- rum（全文檢索索引加速）：支援全文檢索索引和相關性排序。
- vector（向量檢索）：支援向量檢索。
- polar_ai：用於建立相應模型實現文本轉向量。

建立表並插入測試資料。

CREATE TABLE t_chunk(id serial, chunk text, embedding vector(1536), v tsvector);

INSERT INTO t_chunk(chunk) VALUES('Unlock the Power of AI 1 million free tokens 88% Price Reduction Activate Now AI Search Contact Sales English Cart Console Log In Why Us Pricing Products Solutions Marketplace Developers Partners Documentation Services Model Studio PolarDB Filter in menu Product Overview Benefits Billing Announcements and Updates Getting Started User Guide Use Cases Developer Reference Support Home Page PolarDBProduct OverviewSearch for Help ContentProduct OverviewUpdated at: 2025-01-06 08:50ProductCommunityWhat is PolarDB?PolarDB is a new-generation database service that is developed by Alibaba Cloud. This service decouples computing from storage and uses integrated software and hardware. PolarDB is a secure and reliable database service that provides auto scaling within seconds, high performance, and mass storage. PolarDB is 100% compatible with MySQL and PostgreSQL and highly compatible with Oracle.');
INSERT INTO t_chunk(chunk) VALUES('PolarDB provides three engines: PolarDB for MySQL, PolarDB for PostgreSQL, and PolarDB-X. Years of best practices in Double 11 events prove that PolarDB can offer the flexibility of open source ecosystems and the high performance and security of commercial cloud-native databases.Database engine Ecosystem Compatibility Architecture Platform Scenario PolarDB for MySQL MySQL 100% compatible with MySQL Shared storage and compute-storage decoupled architecture Public cloud, Apsara Stack Enterprise Edition, DBStack');
INSERT INTO t_chunk(chunk) VALUES('PolarDB for PostgreSQL PostgreSQL and Oracle 100% compatible with MySQL and highly compatible with Oracle Shared storage and compute-storage decoupled architecture Public cloud, Apsara Stack Enterprise Edition, DBStack Cloud-native databases in the PostgreSQL ecosystem PolarDB-X MySQL Standard Edition is 100% compatible with MySQL and Enterprise Edition is highly compatible with MySQL shared nothing and distributed architecture Public cloud, Apsara Stack Enterprise Edition, DBStack');
INSERT INTO t_chunk(chunk) VALUES('Architecture of PolarDB for MySQL and PolarDB for PostgreSQL PolarDB for MySQL and PolarDB for PostgreSQL both use an architecture of shared storage and compute-storage decoupling. They are featured by cloud-native architecture, integrated software and hardware, and shared distributed storage. Physical replication and RDMA are used between, the primary node and read-only nodes to reduce latency and accelerate data synchronization. This resolves the issue of non-strong data consistency caused by asynchronous replication and ensures zero data loss in case of single point of failure (SPOF). The architecture also enables node scaling within seconds.');
INSERT INTO t_chunk(chunk) VALUES('Core components PolarProxy PolarDB uses PolarProxy to provide external services for the applications. PolarProxy forwards the requests from the applications to database nodes. You can use the proxy to perform authentication, data protection, and session persistence. The proxy parses SQL statements, sends write requests to the primary node, and evenly distributes read requests to multiple read-only nodes.Compute nodes A cluster contains one primary node and multiple read-only nodes. A cluster of Multi-master Cluster Edition (only for PolarDB for MySQL) supports multiple primary nodes and multiple read-only nodes. Compute nodes can be either general-purpose or dedicated.Shared storage Multiple nodes in a cluster share storage resources. A single cluster supports up to 500 TB of storage capacity.');
INSERT INTO t_chunk(chunk) VALUES('Architecture benefits Large storage capacity The maximum storage capacity of a cluster is 500 TB. You do not need to purchase clusters for database sharding due to the storage limit of a single host. This simplifies application development and reduces the O&M workload.Cost-effectiveness PolarDB decouples computing and storage. You are charged only for the computing resources when you add read-only nodes to a PolarDB cluster. In traditional database solutions, you are charged for both computing and storage resources when you add nodes.Elastic scaling within minutes PolarDB supports rapid scaling for computing resources. This is based on container virtualization, shared storage, and compute-storage decoupling. It requires only 5 minutes to add or remove a node. The storage capability is automatically scaled up. During the scale-up process, your services are not interrupted.');
INSERT INTO t_chunk(chunk) VALUES('Read consistency PolarDB uses log sequence numbers (LSNs) for cluster endpoints that have read/write splitting enabled. This ensures global consistency for read operations and prevents the inconsistency that is caused by the replication delay between the primary node and read-only nodes.Millisecond-level latency in physical replication PolarDB performs physical replication from the primary node to read-only nodes based on redo logs. The physical replication replaces the logical replication that is based on binary logs. This way, the replication efficiency and stability are improved. No delays occur even if you perform DDL operations on large tables, such as adding indexes or fields.Data backup within seconds Snapshots that are implemented based on the distributed storage can back up a database with terabytes of data in a few minutes. During the entire backup process, no locks are required, which ensures high efficiency and minimized impacts on your business. Data can be backed up anytime.');
INSERT INTO t_chunk(chunk) VALUES('Architecture of PolarDB-X PolarDB-X uses an architecture of shared nothing and compute-storage decoupling. This architecture allows you to achieve hierarchical capacity planning based on your business requirements and implement mass scaling.Core components Global meta service (GMS): provides distributed metadata and a global timestamp distributor named Timestamp Oracle (TSO) and maintains meta information such as tables, schemas, and statistics. GMS also maintains security information such as accounts and permissions.Compute node (CN): provides a distributed SQL engine that contains core optimizers and executors. A CN uses a stateless SQL engine to provide distributed routing and computing and uses the two-phase commit protocol (2PC) to coordinate distributed transactions. A CN also executes DDL statements in a distributed manner and maintains global indexes.Data node (DN): provides a data storage engine. A data node uses Paxos to provide highly reliable storage services and uses multiversion concurrency control (MVCC) for distributed transactions. A data node also provides the pushdown computation feature to push down operators such as Project, Filter, Join, and Agg in distributed systems, and supports local SSDs and shared storage.Change data capture (CDC): provides a primary/secondary replication protocol that is compatible with MySQL. The primary/secondary replication protocol is compatible with the protocols and data formats that are supported by MySQL binary logging. CDC uses the primary/secondary replication protocol to exchange data.');

產生向量資料。您可以通過建立自訂模型並調用實現文本轉化向量。

-- 執行Embedding
UPDATE t_chunk SET embedding = <自訂模型調用函數>('<自訂模型名稱>', chunk);

建立檢索所需索引。

向量索引，此處使用的是L2距離，您可按需調整。
```
CREATE INDEX ON t_chunk using hnsw(embedding vector_l2_ops);
```

全文索引。

UPDATE t_chunk SET v = to_tsvector('english', chunk);

CREATE INDEX ON t_chunk USING rum (v rum_tsvector_ops);

檢索

密集檢索

只根據向量進行檢索，距離越小表示相似性越高。

SELECT chunk, embedding <-> polar_ai.ai_text_embedding('What database engines does PolarDB provide')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5;

稀疏檢索

只根據全文進行檢索，距離越小表示相似性越高。

SELECT chunk, v <=> to_tsquery('english', 'PolarDB|PostgreSQL|efficiency') as rank
FROM t_chunk 
WHERE v @@ to_tsquery('english', 'PolarDB|PostgreSQL|efficiency')
ORDER by rank ASC
LIMIT 5;

混合檢索

將兩種查詢方式的結果進行合併作業，以實現多路召回的能力。

WITH t AS (
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('What database engines does PolarDB provide')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5 ),
t2 as (
  SELECT chunk, v <=> to_tsquery('english', 'PolarDB|PostgreSQL|efficiency') as rank
FROM t_chunk 
WHERE v @@ to_tsquery('english', 'PolarDB|PostgreSQL|efficiency')
ORDER by rank ASC
LIMIT 5
)
SELECT * FROM t
UNION ALL
SELECT * FROM t2;

由於這兩種距離計算方法無法統一，因此需要採用RRF模型進行統一排名。RRF（Reciprocal Rank Fusion，倒數排名融合）是一種將具有不同相關性指標的多個結果集組合為單個結果集的方法，該方法無需調優，不同的相關性指標也不需要相互關聯即可獲得高品質的結果。基本步驟如下：

召回階段收集排名
多個檢索器（各路召回）對其查詢分別產生排序結果。
排名融合
使用簡單的評分函數（如倒數和）將各檢索器的排名位置加權融合，公式如下：
$RRF_score (d) = i = 1 \sum N \frac{1}{k + r an k _{i} ( d )}$
其中， $N$ 為不同召迴路的數量， $r an k_{i} (d)$ 是第 $i$ 個檢索器對文檔 $d$ 的排名位置， $k$ 是一個平滑參數，通常取60。
綜合排序
根據融合後的評分對文檔重新排序，產生最終結果。

在上述查詢中，當您對查詢結果排序不滿意時，可調整參數 $k$ 改變結果順序。根據查詢時傳遞的 $k$ 參數，對全文檢索索引和向量檢索各自分別查詢到的 $t o p K$ 結果，按照 $\frac{1}{k + r ank ( i )}$ 公式對每一個返回的文檔進行打分，其中 $r ank (i)$ 表示某個文檔排第 $i$ 位。如果文本 $t o p K$ 結果中某個文檔沒有出現在密集向量檢索的 $t o p K$ 結果中，則該文檔只有一個得分，同理稀疏向量結果中的文檔情況亦然。如果某個文檔同時出現在密集向量和稀疏向量的 $t o p K$ 結果集中，則將各自計算的得分相加。

說明

平滑參數 $k$ 決定了每個查詢的單個結果集中文檔對最終排序結果的影響程度。數值越高，排名越低的文檔對最終排序結果的影響越大。

-- 密集向量召回
WITH t1 as 
(
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('What database engines does PolarDB provide')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5
),
t2 as (
SELECT ROW_NUMBER() OVER (ORDER BY dist ASC) AS row_num,
chunk
FROM t1
),
-- 稀疏向量召回
t3 as 
(
  SELECT chunk, v <=> to_tsquery('english', 'PolarDB|PostgreSQL|efficiency') as rank
  FROM t_chunk 
  WHERE v @@ to_tsquery('english', 'PolarDB|PostgreSQL|efficiency')
  ORDER by rank ASC
  LIMIT 5
),
t4 as (  
SELECT ROW_NUMBER() OVER (ORDER BY rank DESC) AS row_num,
chunk
FROM t3
),
-- 分別計算 RRF評分
t5 AS (
SELECT 1.0/(60+row_num) as score, chunk FROM t2
UNION ALL 
SELECT 1.0/(60+row_num), chunk FROM t4
)
-- 評分進行合并
SELECT sum(score) as score, chunk
FROM t5
GROUP BY chunk
ORDER BY score DESC;

按權重加權

您還可以為不同結果集設定不同權重，例如密集檢索權重0.8，稀疏檢索權重0.2。

-- 密集向量召回
WITH t1 as 
(
SELECT chunk, embedding <-> polar_ai.ai_text_embedding('What database engines does PolarDB provide')::vector(1536) as dist
FROM t_chunk
ORDER by dist ASC
limit 5
),
t2 as (
SELECT ROW_NUMBER() OVER (ORDER BY dist ASC) AS row_num,
chunk
FROM t1
),
-- 稀疏向量召回
t3 as 
(
  SELECT chunk, v <=> to_tsquery('english', 'PolarDB|PostgreSQL|efficiency') as rank
  FROM t_chunk 
  WHERE v @@ to_tsquery('english', 'PolarDB|PostgreSQL|efficiency')
  ORDER by rank ASC
  LIMIT 5
),
t4 as (  
SELECT ROW_NUMBER() OVER (ORDER BY rank DESC) AS row_num,
chunk
FROM t3
),
-- 分別計算 RRF評分，權重分別為 0.8和0.2
t5 as (
SELECT (1.0/(60+row_num)) * 0.8 as score , chunk FROM t2
UNION ALL 
SELECT (1.0/(60+row_num)) * 0.2, chunk FROM t4
)
-- 評分進行合并
SELECT sum(score) as score, chunk
FROM t5
GROUP BY chunk
ORDER BY score DESC;