Use hybrid search - AnalyticDB - Alibaba Cloud Documentation Center

If you want to search for structured and unstructured data at the same time, you can use the hybrid search feature of AnalyticDB for PostgreSQL vector databases. Hybrid search supports structured field filtering, semi-structured field filtering, and two-way retrieval based on vector search and full-text search.

This topic is suitable for two-way retrieval in AnalyticDB for PostgreSQL V6.0 and V7.0. However, Two-way retrieval based on vector search and full-text search in V7.0 is more suitable for V7.0.

Introduction to hybrid search

Approximate nearest neighbor search (ANNS) vector indexes can handle only unstructured data search. In actual production environments, you may be required to search structured and unstructured data at the same time. For example, you want to query whether specific persons appear in a specific area within a specific time range. The time range and area range may be structured data, and human faces may be stored as unstructured data. To meet these requirements, AnalyticDB for PostgreSQL provides the hybrid search feature for structured and unstructured data.

Most hybrid search solutions in the industry store structured data in database systems and unstructured data in vector search systems. Search results from the two types of systems are intersected to find the common matches. Then, the intersected results are aggregated to obtain the final result. This method extracts amplified results (quantity = amplification factor × top K) from vector search systems and then uses structured indexes to filter the results. However, structured indexes may filter out a large number of results, failing to meet the requirements for retrieving the top K results. The following section describes how hybrid search works by using structured and unstructured fields.

How hybrid search works

AnalyticDB for PostgreSQL integrates the FastANN vector search engine into databases as an index extension and allows you to perform vector search by using structured and semi-structured indexes at the same time. You can use the ranked-biased overlap (RBO) cost estimation rules of the optimizer to generate different execution plans to search for structured and unstructured data at the same time.

The following example shows how hybrid search works:

Assume that you want to search for the top 100 items that have the highest similarity to a specified image, fall within a price range of USD 100 to USD 200, and were put on sale within the last month.

Sample statements to create a table and indexes:

CREATE TABLE products (
    id serial primary key, 
    name varchar(256), 
    price real, 
    inTime timestamp, 
    url varchar(256),
    feature real[]
);

-- Set the storage mode of the vector column to PLAIN. 
ALTER TABLE products ALTER COLUMN feature SET STORAGE PLAIN;

-- Create a B-tree index for structured columns. 
CREATE INDEX ON products(price, intime);

-- Create a vector index for the vector column. 
CREATE INDEX ON products USING ann(feature) WITH (dim=512);

-- Collect statistics to generate an optimal execution plan for hybrid search. 
ANALYZE products;

Sample statement to perform hybrid search:

SELECT id, price FROM products WHERE 
    price > 100 AND price <=200 
    AND inTime > '2019-03-01 00:00:00' AND inTime <= '2019-03-31 00:00:00' 
ORDER BY 
    feature <-> array[10,2.0,..., 512.0] 
LIMIT 100;

AnalyticDB for PostgreSQL can generate three categories of execution plans based on whether an index is created for the structured condition columns (price and inTime) and the selectivity of structured conditions. To meet performance and retrieval requirements, AnalyticDB for PostgreSQL selects an optimal execution plan.

Category 1: Brute-force query

The system obtains all rows that meet the structured conditions, sorts the rows by vector distance, and then displays 100 rows of data that have the smallest distance. This execution plan provides a recall rate of 100% but returns query results at a low speed, especially when large amounts of data are queried or a large number of rows meet the structured conditions. Sample execution plan:

                                       QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT
     ->  Gather Motion 3:1  (slice1; segments: 3)   
         Merge Key: (l2_squared_distance($0, feature))     
             ->  LIMIT
                 ->  Sort
                     Sort Key: (l2_squared_distance($0, feature))
                     ->  Index Scan using products_price_idx on products
                         Index Cond: the filter condition for the price column.
                         Filter: the filter condition for the time column.

 Optimizer: Postgres query optimizer

Category 2: Pure vector search and structured filtering

The system uses a vector index to query N rows of data that have the highest similarity to a specified image, and then filters the data based on the structured conditions. This execution plan provides a fast query speed. However, if the structured conditions result in a low selectivity, the final output may contain fewer rows than expected. Sample execution plan:

                                       QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT 
    ->  Gather Motion 3:1  (slice1; segments: 3) 
        Merge Key: ((feature <-> $0))     
            ->  LIMIT
                ->  Ann Index Scan using products_feature_idx on products
                    ORDER BY: (feature <-> $0)
                    Filter: the filter condition for the price and time columns.

 Optimizer: Postgres query optimizer

Category 3: Hybrid search that combines vector search and structured filtering
This execution plan combines the advantages of the preceding two execution plans. It uses indexes to query data and prevents the possibility of fewer data results. If a structured condition column is indexed and the index type supports bitmaps, this execution plan can use other indexes to generate bitmaps. This accelerates hybrid search. Sample execution plan:
```
                                       QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT
     ->  Gather Motion 3:1  (slice1; segments: 3)   
         Merge Key: ((feature <-> $0))   
         ->  LIMIT
             ->  Fusion Ann Scan
                 ->  Bitmap Index Scan on products_price_idx
                     Index Cond: the filter condition for the price column.
                     ->  Ann Index Scan with filter using products_feature_idx on products
                         ORDER BY: (feature <-> $0)
                         Filter: the filter condition for the time column.

 Optimizer: Postgres query optimizer
```
The display of the Fusion Ann Scan or Ann Index Scan with filter node indicates that hybrid search is used in this execution plan.
- Ann Index Scan with filter: This node pushes filter conditions down to a vector index so that you can use the filter conditions and the index.
- Fusion Ann Scan: This node is displayed when specific structured condition columns are indexed. This node can accelerate the calculation of structured conditions by generating a bitmap as a left subtree and pushing the bitmap down to the vector index as a right subtree. When a Fusion Ann Scan node is displayed, the right subtree may be Ann Index Scan or Ann Index Scan with filter. If the Ann Index Scan node is displayed, only the bitmap is pushed down. If the Ann Index Scan with filter node is displayed, filter conditions other than the bitmap are pushed down.

Usage methods

Hybrid search of AnalyticDB for PostgreSQL vector databases supports structured field filtering, semi-structured field filtering, and two-way retrieval based on vector search and full-text search. You can use one of the following methods to implement hybrid search:

Vector search combined with structured field filtering. Data types of all structured fields are supported, including BIGINT, BOOLEAN, BYTEA, CHAR, VARCHAR, INTEGER, FLOAT, DOUBLE, DATE, SMALLINT, TIMESTAMP, and SERIAL. The structured fields can help accelerate hybrid search by creating default B-tree indexes. For more information, see Data types.
Vector search combined with semi-structured field filtering. Data types of all semi-structured fields are supported, including JSON, JSONB, and ARRAY. The structured fields can help accelerate hybrid search by creating GIN indexes. For more information, see Manage JSON and JSON-B data and ARRAY type.
Two-way retrieval based on vector search and full-text search.

For information about how structured fields accelerate queries, see the "How hybrid search works" section of this topic. The following examples demonstrate how to perform hybrid search by using semi-structured field filtering and how to perform two-way retrieval based on vector search and full-text search.

Use semi-structured field filtering in hybrid search

Assume that a securities company has a database named stock_analysis_chunks that contains stock analysis articles. The following table describes the fields that are contained in the database.

Field	Data type	Description
id	serial	The serial number.
chunk	varchar(1024)	The text chunk that is a part of the stock analysis article.
release_time	timestamp	The release time of the stock analysis article.
stock_id_list	char(10)[]	The IDs of the stocks that are involved in the stock analysis article.
url	varchar(1024)	The URL of the article to which the text chunk belongs.
feature	real[]	The embedding vector of the text block.

Create a text database table named stock_analysis_chunks.

CREATE TABLE stock_analysis_chunks (
    id serial primary key,
    chunk varchar(1024),
    release_time timestamp,
    stock_id_list integer[],
    url varchar(1024),
    feature real[]
) distributed by (id);

Set the storage mode of the vector column to PLAIN.

ALTER TABLE stock_analysis_chunks ALTER COLUMN feature SET STORAGE PLAIN;

Create a vector index for the vector column.

CREATE INDEX ON stock_analysis_chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1, hnsw_ef_construction=128);

Create indexes for the structured and semi-structured columns associated with hybrid search.

-- Create a B-tree index for the structured column.
CREATE INDEX ON stock_analysis_chunks(release_time);
-- Create a GIN index for the semi-structured column.
CREATE INDEX ON stock_analysis_chunks USING gin(stock_id_list);

Collect statistics.
```
ANALYZE stock_analysis_chunks;
```

Perform hybrid search.

To query the stock analysis articles that were released within the last month and contain specific text content, execute the following statement:

SELECT id, url, cosine_similarity(feature, array[1.0, 2.0, ..., 1536.0]::real[]) as score
FROM stock_analysis_chunks
WHERE release_time >= '2023-07-18 00:00:00' AND release_time <= '2023-08-18 00:00:00' AND stock_id_list @> ARRAY['BABA', 'AAPL', 'MSFT', 'AMZN'] 
ORDER BY feature <=> array[1.0, 2.0, ..., 1536.0]::real[]
LIMIT 10;

Two-way retrieval based on vector search and full-text search

Two-way retrieval refers to a method that retrieves data by using vector search and full-text search at the same time. In most cases, the vector search method alone can achieve a high recall rate for vector similarity retrieval. However, in specific scenarios such as poorly performing embedding models or complex queries, generated vectors may not align well with the data to be retrieved. In this case, the use of the vector search method alone cannot achieve the desired results. To increase the recall rate, you can use the two-way retrieval method based on vector search and full-text search.

The two-way retrieval method in AnalyticDB for PostgreSQL separately retrieves subsets of data by using vector search and full-text search, and then reranks and post-processes the merged retrieval results for better effects.

Procedure

Create a table.

-- vector: the vector column. to_tsvector: the full-text tokenizer column.
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT,
                to_tsvector TSVECTOR);

Set the storage mode of the vector column to PLAIN.

ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;

Create indexes.

Create a vector index for the vector column.

CREATE INDEX ON documents USING ann(vector) WITH (dim=1024, distancemeasure=cosine, hnsw_m=64, pq_enable=1, hnsw_ef_construction=128);

Create a GIN index for the full-text tokenizer column.
```
CREATE INDEX ON documents USING gin(to_tsvector);
```

Create a dependency function.

CREATE OR REPLACE FUNCTION public.to_tsquery_from_text(txt text, lang regconfig DEFAULT 'english'::regconfig)
 RETURNS tsquery
 LANGUAGE sql
 IMMUTABLE STRICT
AS $function$ SELECT to_tsquery(lang, COALESCE(string_agg(split_part(word, ':', 1), ' | '), '')) FROM (SELECT UNNEST(STRING_TO_ARRAY(TO_TSVECTOR(lang, txt)::text, ' ')) AS TEXT) AS words_only;$function$

Perform two-way retrieval.

 WITH combined AS (
   (
     SELECT
        id,
        docname,
        text,
        cosine_similarity(densevec,ARRAY{embedding}::real[]) AS similarity,
        1 AS source
     FROM
        documents
     ORDER BY vector <=> ARRAY[10,2.0,…, 1024.0]
     LIMIT 10
   )
   UNION ALL
   (
     SELECT
        id,
        docname,
        text,
        ts_rank(to_tsvector, to_tsquery_from_text('{query.query}', 'zh_cn'), 32) AS similarity,
        2 AS source
     FROM
        documents
     WHERE to_tsvector@@to_tsquery_from_text('{query.query}', 'zh_cn')
     ORDER BY similarity DESC	
     LIMIT 10
   )
)
SELECT id, docname, title, text, MAX(similarity) AS similarity,
                BIT_OR(source) AS source
                FROM combined
                GROUP BY id
                ORDER BY similarity DESC;

In the UNION clause, the first part uses vector search to retrieve 10 rows of data and the second part uses full-text search to retrieve 10 rows of data. Then, the data results are deduplicated based on the id column. Finally, the preceding two-way retrieval statement returns 20 or fewer rows of data.