Combine vector search with full-text search and field filtering for hybrid search - AnalyticDB

AnalyticDB for PostgreSQL vector databases support hybrid search, which combines structured field filtering, semi-structured field filtering, and two-way retrieval based on vector search and full-text search. This topic applies to V6.0 and V7.0. For V7.0-specific guidance, see Two-way retrieval based on vector search and full-text search in V7.0.

When to use hybrid search

Choose the retrieval method based on your query characteristics:

Method	Best for	Example scenario
Vector search only	Semantic similarity — intent matters more than exact keywords	Finding conceptually related documents
Full-text search only	Exact keyword matching — specific terms must appear	Searching for a specific error message or product code
Vector search + structured filtering	Semantic similarity within a filtered dataset	Finding similar products within a price range
Two-way retrieval (vector + full-text)	Queries where both meaning and exact keywords matter, or when embedding quality is inconsistent	RAG pipelines, e-commerce search, mixed-query workloads

Use two-way retrieval when vector search alone does not achieve the recall rate you need — for example, when embedding models produce poor representations for certain query types, or when your corpus contains keyword-critical terms that must appear verbatim.

How hybrid search works

Approximate nearest neighbor search (ANNS) vector indexes handle only unstructured data. In production environments, you often need to search structured and unstructured data together — for example, finding whether a person appears in a specific location within a time window, where the time range and location are structured data and face images are unstructured.

Most hybrid search implementations store structured data in a database and unstructured data in a vector search system, then intersect results from both. To prevent vector search from returning too few rows after structured filtering, they over-fetch by an amplification factor before applying filters — but high-selectivity filters can still eliminate enough results to fall short of the required top-K count.

AnalyticDB for PostgreSQL takes a different approach: it integrates the FastANN vector search engine directly as an index extension and uses the ranked-biased overlap (RBO) cost estimation rules of the optimizer to generate execution plans that search structured and unstructured data together.

Execution plan categories

Based on whether indexed structured condition columns exist and how selective those conditions are, AnalyticDB for PostgreSQL selects one of three execution plans.

The following example uses a product table to demonstrate all three plans.

Create the table and indexes:

CREATE TABLE products (
    id serial primary key,
    name varchar(256),
    price real,
    inTime timestamp,
    url varchar(256),
    feature real[]
);

-- Set the storage mode of the vector column to PLAIN.
ALTER TABLE products ALTER COLUMN feature SET STORAGE PLAIN;

-- Create a B-tree index for structured columns.
CREATE INDEX ON products(price, intime);

-- Create a vector index for the vector column.
CREATE INDEX ON products USING ann(feature) WITH (dim=512);

-- Collect statistics to generate an optimal execution plan for hybrid search.
ANALYZE products;

Query: find the top 100 items most similar to a given image, priced between USD 100 and USD 200, added in March 2019:

SELECT id, price FROM products WHERE
    price > 100 AND price <=200
    AND inTime > '2019-03-01 00:00:00' AND inTime <= '2019-03-31 00:00:00'
ORDER BY
    feature <-> array[10,2.0,..., 512.0]
LIMIT 100;

Category 1: brute-force query

The optimizer fetches all rows matching the structured conditions, sorts them by vector distance, and returns the closest 100. This guarantees 100% recall but is slow when the structured conditions match a large number of rows.

QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT
     ->  Gather Motion 3:1  (slice1; segments: 3)
         Merge Key: (l2_squared_distance($0, feature))
             ->  LIMIT
                 ->  Sort
                     Sort Key: (l2_squared_distance($0, feature))
                     ->  Index Scan using products_price_idx on products
                         Index Cond: the filter condition for the price column.
                         Filter: the filter condition for the time column.

 Optimizer: Postgres query optimizer

Category 2: pure vector search and structured filtering

The optimizer queries the vector index for the N most similar rows, then filters by the structured conditions. This is fast, but when structured conditions are highly selective, the final result may contain fewer rows than requested.

QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT
    ->  Gather Motion 3:1  (slice1; segments: 3)
        Merge Key: ((feature <-> $0))
            ->  LIMIT
                ->  Ann Index Scan using products_feature_idx on products
                    ORDER BY: (feature <-> $0)
                    Filter: the filter condition for the price and time columns.

 Optimizer: Postgres query optimizer

Category 3: hybrid search (vector + structured indexes)

This plan combines both approaches. When structured condition columns are indexed and their index type supports bitmaps, the optimizer generates a bitmap from the structured index and pushes it into the vector index scan. This accelerates filtering while preserving recall.

QUERY PLAN
-----------------------------------------------------------------------------------
 LIMIT
     ->  Gather Motion 3:1  (slice1; segments: 3)
         Merge Key: ((feature <-> $0))
         ->  LIMIT
             ->  Fusion Ann Scan
                 ->  Bitmap Index Scan on products_price_idx
                     Index Cond: the filter condition for the price column.
                     ->  Ann Index Scan with filter using products_feature_idx on products
                         ORDER BY: (feature <-> $0)
                         Filter: the filter condition for the time column.

 Optimizer: Postgres query optimizer

Two execution plan nodes indicate that hybrid search is in use:

Ann Index Scan with filter: pushes filter conditions down to the vector index, combining filtering with the index scan.
Fusion Ann Scan: used when structured condition columns have a supported index. It generates a bitmap from the structured index (left subtree) and pushes the bitmap into the vector index scan (right subtree). The right subtree is either Ann Index Scan (bitmap only) or Ann Index Scan with filter (bitmap plus additional filter conditions).

Hybrid search methods

AnalyticDB for PostgreSQL supports three hybrid search methods:

Method	Index types used	Supported data types
Vector search + structured field filtering	B-tree index	BIGINT, BOOLEAN, BYTEA, CHAR, VARCHAR, INTEGER, FLOAT, DOUBLE, DATE, SMALLINT, TIMESTAMP, SERIAL
Vector search + semi-structured field filtering	GIN index	JSON, JSONB, ARRAY
Two-way retrieval: vector + full-text search	Vector index + GIN index	—

For more information on supported structured data types, see Data types.

The following sections show how to set up semi-structured field filtering and two-way retrieval. Structured field filtering with B-tree indexes is demonstrated in the Execution plan categories section above.

Use semi-structured field filtering

This example uses a stock analysis database where each record contains an article chunk, its release time, the stock IDs it mentions, and an embedding vector.

Table schema

Field	Data type	Description
id	serial	Serial number
chunk	varchar(1024)	A text chunk from a stock analysis article
release_time	timestamp	The article's release time
stock_id_list	char(10)[]	The IDs of stocks mentioned in the article
url	varchar(1024)	The URL of the source article
feature	real[]	The embedding vector of the text chunk

Prerequisites

Before you begin, ensure that you have:

An AnalyticDB for PostgreSQL V6.0 or V7.0 instance
The permission to create tables and indexes

Set up semi-structured field filtering

Create the table.

CREATE TABLE stock_analysis_chunks (
    id serial primary key,
    chunk varchar(1024),
    release_time timestamp,
    stock_id_list integer[],
    url varchar(1024),
    feature real[]
) distributed by (id);

Set the storage mode of the vector column to PLAIN.

ALTER TABLE stock_analysis_chunks ALTER COLUMN feature SET STORAGE PLAIN;

Create a vector index.

CREATE INDEX ON stock_analysis_chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1, hnsw_ef_construction=128);

Create indexes for the structured and semi-structured columns.

-- Create a B-tree index for the structured column.
CREATE INDEX ON stock_analysis_chunks(release_time);
-- Create a GIN index for the semi-structured column.
CREATE INDEX ON stock_analysis_chunks USING gin(stock_id_list);

Collect statistics.
```
ANALYZE stock_analysis_chunks;
```

Run a hybrid search query. The following query returns stock analysis articles published between July 18 and August 18, 2023, that mention all four specified stocks, ranked by semantic similarity to a query vector:

SELECT id, url, cosine_similarity(feature, array[1.0, 2.0, ..., 1536.0]::real[]) as score
FROM stock_analysis_chunks
WHERE release_time >= '2023-07-18 00:00:00' AND release_time <= '2023-08-18 00:00:00' AND stock_id_list @> ARRAY['BABA', 'AAPL', 'MSFT', 'AMZN']
ORDER BY feature <=> array[1.0, 2.0, ..., 1536.0]::real[]
LIMIT 10;

Use two-way retrieval (vector + full-text search)

Two-way retrieval runs vector search and full-text search in parallel, then merges the results. Use this method when vector search alone does not achieve sufficient recall — for example, when embedding models underperform on certain query types or when the corpus contains terms that require exact keyword matching.

Prerequisites

Before you begin, ensure that you have:

An AnalyticDB for PostgreSQL V6.0 or V7.0 instance
The permission to create tables, indexes, and functions

Set up two-way retrieval

Create the table. The to_tsvector column stores precomputed full-text search tokens.

Populate the to_tsvector column when inserting rows using PostgreSQL's to_tsvector() function, for example: to_tsvector('english', text). Keep this column in sync with the text column when updating rows.

-- vector: the vector column. to_tsvector: the full-text tokenizer column.
CREATE TABLE IF NOT EXISTS documents(
                id TEXT,
                docname TEXT,
                title TEXT,
                vector real[],
                text TEXT,
                to_tsvector TSVECTOR);

Set the storage mode of the vector column to PLAIN.

ALTER TABLE documents ALTER COLUMN vector SET STORAGE PLAIN;

Create indexes.

Create a vector index for the vector column.

CREATE INDEX ON documents USING ann(vector) WITH (dim=1024, distancemeasure=cosine, hnsw_m=64, pq_enable=1, hnsw_ef_construction=128);

Create a GIN index for the full-text tokenizer column.
```
CREATE INDEX ON documents USING gin(to_tsvector);
```

Create a helper function to convert query text into a tsquery.

CREATE OR REPLACE FUNCTION public.to_tsquery_from_text(txt text, lang regconfig DEFAULT 'english'::regconfig)
 RETURNS tsquery
 LANGUAGE sql
 IMMUTABLE STRICT
AS $function$ SELECT to_tsquery(lang, COALESCE(string_agg(split_part(word, ':', 1), ' | '), '')) FROM (SELECT UNNEST(STRING_TO_ARRAY(TO_TSVECTOR(lang, txt)::text, ' ')) AS TEXT) AS words_only;$function$

Run a two-way retrieval query. The query uses a common table expression (CTE) to run both searches and merge results. The first branch retrieves the top 10 rows by vector similarity; the second retrieves the top 10 rows by full-text relevance. UNION ALL combines both sets, and the outer SELECT deduplicates by id, taking the higher similarity score for any row found by both methods.

Result columns

Column	Description
`similarity`	The higher score from either the vector or full-text search for this row
`source`	A bitmask indicating which method(s) found the row: `1` = vector only, `2` = full-text only, `3` = both

WITH combined AS (
   (
     SELECT
        id,
        docname,
        text,
        cosine_similarity(densevec,ARRAY{embedding}::real[]) AS similarity,
        1 AS source
     FROM
        documents
     ORDER BY vector <=> ARRAY[10,2.0,…, 1024.0]
     LIMIT 10
   )
   UNION ALL
   (
     SELECT
        id,
        docname,
        text,
        ts_rank(to_tsvector, to_tsquery_from_text('{query.query}', 'zh_cn'), 32) AS similarity,
        2 AS source
     FROM
        documents
     WHERE to_tsvector@@to_tsquery_from_text('{query.query}', 'zh_cn')
     ORDER BY similarity DESC
     LIMIT 10
   )
)
SELECT id, docname, title, text, MAX(similarity) AS similarity,
                BIT_OR(source) AS source
                FROM combined
                GROUP BY id
                ORDER BY similarity DESC;

The query returns at most 20 rows (10 from vector search + 10 from full-text search, deduplicated by id).

Tuning the balance between vector and full-text search

By default, the query gives equal weight to vector similarity scores and full-text relevance scores. To favor one method over the other, multiply the similarity value in either branch before the UNION ALL. For example, to double the weight of full-text results, replace ts_rank(...) AS similarity with ts_rank(...) * 2 AS similarity in the second branch.

AnalyticDB:Dual-source recall with vector search and full-text search

When to use hybrid search

How hybrid search works

Execution plan categories

Category 1: brute-force query

Category 2: pure vector search and structured filtering

Category 3: hybrid search (vector + structured indexes)

Hybrid search methods

Use semi-structured field filtering

Prerequisites

Set up semi-structured field filtering

Use two-way retrieval (vector + full-text search)

Prerequisites

Set up two-way retrieval

What's next