All Products
Search
Document Center

AnalyticDB:Use AnalyticDB for PostgreSQL sparse vectors

Last Updated:Mar 28, 2026

Sparse vectors store only non-zero values, making them efficient for high-dimensional data where most dimensions are zero — such as keyword-based text representations generated by BM25. This guide covers how to create sparse vector tables and indexes, import data, run exact and approximate searches, and combine sparse with dense vectors for hybrid search.

When to use sparse vectors

Sparse vectorsDense vectors
Data representationMost values are zero; only a few dimensions are non-zeroAll dimensions have non-zero values
Computational efficiencyHigher, especially for zero-element operationsLower, since all elements are processed
Information densityFocuses on key featuresCaptures nuanced semantic relationships
Typical use casesKeyword search (BM25), hybrid searchSemantic search, RAG, general machine learning

Use sparse vectors when your data is naturally sparse — for example, when each document is represented as a term-frequency vector over a large vocabulary. Most dimensions will be zero, and only the terms that appear in the document have non-zero values.

Prerequisites

Before you begin, ensure that you have:

How it works

Sparse vectors in AnalyticDB for PostgreSQL use the SVECTOR (Sparse Vector) data type. Instead of storing all dimensions, SVECTOR stores only the non-zero dimensions using a JSON format with two fields:

  • indices: the positions (non-negative integers) of non-zero values

  • values: the corresponding floating-point values

For example, the 20-dimensional vector [0, 0, 1.1, 0, 0, 0, 2.2, 0, 0, 3.3, 0, ...] is stored as:

{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}

The figure below shows the structural difference between dense and sparse vectors.

image

Use sparse vectors

The typical workflow has four steps: create a table, create indexes, import data, and run searches.

Step 1: Create a sparse vector table

CREATE TABLE <table_name> (
    id int PRIMARY KEY,
    description text,
    sparse_vector svector(<MAX_DIM>)
) DISTRIBUTED BY (id);

-- Set the storage mode of the sparse vector column to PLAIN.
ALTER TABLE <table_name> ALTER COLUMN sparse_vector SET STORAGE PLAIN;

Parameters:

ParameterDescription
<table_name>Name of the sparse vector table
<MAX_DIM>Maximum number of dimensions, not the count of non-zero values

Why PLAIN storage? PostgreSQL uses The Oversized-Attribute Storage Technique (TOAST) to store large field values. PLAIN mode disables TOAST compression and splitting, keeping each vector in a single row without being moved to an external table. This avoids the overhead that TOAST introduces for sparse vector access patterns.

Example:

-- Create a table with a sparse vector column.
CREATE TABLE svector_test (
    id bigint,
    type int,
    tag varchar(10),
    document text,
    sparse_features svector(250000)
) DISTRIBUTED BY (id);

-- Set PLAIN storage to prevent TOAST overhead.
ALTER TABLE svector_test ALTER COLUMN sparse_features SET STORAGE PLAIN;

Step 2: Create indexes

Sparse vector indexes have two constraints:

  • Distance metric: inner product (IP) only — do not use other metrics to avoid errors

  • No support for product quantization (PQ) or memory mapping

-- Create B-tree indexes for structured columns used in filtered search.
CREATE INDEX ON svector_test (type);
CREATE INDEX ON svector_test (tag);

-- Create an HNSW (Hierarchical Navigable Small World) index on the sparse vector column.
-- DISTANCEMEASURE must be IP (inner product). Using other metrics causes errors.
-- pq_enable=0 and external_storage=0 are required for sparse vectors.
CREATE INDEX ON svector_test USING ANN (sparse_features)
WITH (DISTANCEMEASURE=IP, HNSW_M=64, pq_enable=0, external_storage=0);

For HNSW_M and HNSW_EF_CONSTRUCTION tuning, see Create a vector index.

Step 3: Import sparse vector data

Use INSERT or COPY to load data. Cast the JSON string to svector using ::svector.

-- Insert rows with sparse vector values.
INSERT INTO svector_test VALUES (1, 1, 'a', 'document one',
    '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector);
INSERT INTO svector_test VALUES (2, 2, 'b', 'document two',
    '{"indices":[50, 100, 200], "values":[2.1, 3.2, 4.3]}'::svector);
INSERT INTO svector_test VALUES (3, 3, 'b', 'document three',
    '{"indices":[150, 60, 90], "values":[5, 1e3, 1e-3]}'::svector);

To verify the cast:

SELECT '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector;
-- Output: {"indices":[2,6,9],"values":[1.1,2.2,3.3]}

Step 4: Search sparse vectors

Choose between exact search and approximate search based on your precision and latency requirements:

Exact searchApproximate search
AlgorithmBrute-force comparison of all vectorsHNSW index traversal
Recall rateUp to 100%Somewhat reduced
SpeedSlowerFaster
Use whenPrecision is critical; dataset is smallLow latency is required; some precision loss is acceptable

Exact search

Exact search compares every vector in the table. Use negative_inner_product in the ORDER BY clause to rank results by descending inner product score.

SELECT id,
       inner_product(sparse_features,
           '{"indices":[2,60,50], "values":[2, 0.01, 0.02]}'::svector) AS score
FROM svector_test
ORDER BY negative_inner_product(sparse_features,
    '{"indices":[2,60,50], "values":[2, 0.01, 0.02]}'::svector)
LIMIT 3;

Approximate search

Approximate search uses the HNSW index. Replace negative_inner_product in ORDER BY with the <#> operator to trigger index-based retrieval.

SELECT id,
       sparse_features,
       '{"indices":[2,60,50], "values":[2, 0.01, 0.02]}' <#> sparse_features AS score
FROM svector_test
ORDER BY score
LIMIT 3;

Example output:

 id |                 sparse_features                 |        score
----+-------------------------------------------------+---------------------
  3 | {"indices":[60,90,150],"values":[1000,0.001,5]} |                 -10
  1 | {"indices":[2,6,9],"values":[1.1,2.2,3.3]}      |   -2.20000004768372
  2 | {"indices":[50,100,200],"values":[2.1,3.2,4.3]} | -0.0419999957084656
(3 rows)
Note

Adjust recall rate for approximate search using fastann.sparse_hnsw_max_scan_points and fastann.sparse_hnsw_ef_search. See Engine parameters for details.

Hybrid search

Hybrid search combines sparse and dense vectors to improve retrieval quality — sparse vectors capture keyword relevance while dense vectors capture semantic meaning.

Create the hybrid search table and indexes

-- Create a table with both dense and sparse vector columns.
CREATE TABLE IF NOT EXISTS hybrid_search_test (
    id integer PRIMARY KEY,
    corpus text,
    filter integer,
    dense_vector float4[],
    sparse_vector svector(250003)
) DISTRIBUTED BY (id);

-- Set PLAIN storage for both vector columns.
ALTER TABLE hybrid_search_test ALTER COLUMN dense_vector SET STORAGE PLAIN;
ALTER TABLE hybrid_search_test ALTER COLUMN sparse_vector SET STORAGE PLAIN;
-- Structured index for filter pushdown.
CREATE INDEX ON hybrid_search_test (filter);

-- Dense vector HNSW index (DIM=1024, inner product, external storage enabled).
CREATE INDEX ON hybrid_search_test USING ANN (dense_vector)
WITH (DISTANCEMEASURE=IP, DIM=1024, HNSW_M=64, HNSW_EF_CONSTRUCTION=600, EXTERNAL_STORAGE=1);

-- Sparse vector HNSW index (inner product only, no external storage).
CREATE INDEX ON hybrid_search_test USING ANN (sparse_vector)
WITH (DISTANCEMEASURE=IP, HNSW_M=64, HNSW_EF_CONSTRUCTION=600);

Run search queries

All three examples below use FILTER IN (...) for pre-filtering results by a structured column before vector ranking.

-- Dense vector search with filter.
SELECT id, corpus
FROM hybrid_search_test
WHERE filter IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900)
ORDER BY dense_vector <#> ARRAY[1,2,3,...,1024]::float4[]
LIMIT 100;

-- Sparse vector search with filter.
SELECT id, corpus
FROM hybrid_search_test
WHERE filter IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900)
ORDER BY sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector
LIMIT 100;

-- Hybrid search with filter.
-- Strategy: retrieve the top 100 candidates from each index independently,
-- then re-rank by combining both distance scores (dist1 - dist2).
-- This avoids score-scale incompatibility between sparse and dense distances.
-- Using MAX(dist1 - dist2) per id deduplicates candidates from the UNION ALL.
WITH combined AS (
    -- Branch 1: top 100 by sparse score.
    -- dist1 = dense distance (secondary signal for re-ranking).
    -- dist2 = sparse distance (primary retrieval signal for this branch).
    (SELECT id, corpus,
            inner_product_distance(dense_vector, ARRAY[1,2,3,...,1024]::float4[]) AS dist1,
            sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector AS dist2
     FROM hybrid_search_test
     WHERE filter IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900)
     ORDER BY dist2 LIMIT 100)
    UNION ALL
    -- Branch 2: top 100 by dense score.
    -- dist1 = sparse distance (secondary signal for re-ranking).
    -- dist2 = dense distance (primary retrieval signal for this branch).
    (SELECT id, corpus,
            inner_product(sparse_vector, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) AS dist1,
            dense_vector <#> ARRAY[1,2,3,...,1024]::float4[] AS dist2
     FROM hybrid_search_test
     WHERE filter IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900)
     ORDER BY dist2 LIMIT 100)
)
SELECT DISTINCT
    first_value(id) OVER (PARTITION BY id) AS id,
    first_value(corpus) OVER (PARTITION BY id) AS corpus,
    MAX(dist1 - dist2) OVER (PARTITION BY id) AS dist
FROM combined
ORDER BY dist DESC
LIMIT 100;

Appendix

Engine parameters

Both parameters apply per session. Use them to tune the trade-off between recall rate and query speed for approximate sparse vector search.

ParameterDescriptionDefaultValid range
fastann.sparse_hnsw_max_scan_pointsMaximum number of scan points during HNSW sparse vector search. Lower values terminate search earlier, trading recall for speed.8000[1, 8000000]
fastann.sparse_hnsw_ef_searchSize of the search candidate set during HNSW traversal. Higher values increase recall but slow down queries.400[10, 10000]

Supported functions

All functions below operate on the SVECTOR data type.

FunctionReturn typeDescription
l2_distanceDOUBLE PRECISIONEuclidean (L2) distance between two sparse vectors
inner_productDOUBLE PRECISIONInner product; equivalent to cosine similarity after vector normalization
dp_distanceDOUBLE PRECISIONDot product distance; same as inner_product
cosine_similarityDOUBLE PRECISIONCosine similarity; range: -1 to 1
svector_addSVECTORElement-wise sum of two sparse vectors
svector_subSVECTORElement-wise difference between two sparse vectors
svector_mulSVECTORElement-wise product of two sparse vectors
svector_normDOUBLE PRECISIONL2 norm of a sparse vector
svector_angleDOUBLE PRECISIONAngle between two sparse vectors
l2_squared_distanceDOUBLE PRECISIONSquared Euclidean distance; faster than l2_distance for sorting
negative_inner_productDOUBLE PRECISIONNegative inner product; use in ORDER BY to sort by descending inner product
cosine_distanceDOUBLE PRECISIONCosine distance; range: 0 to 2

Examples:

-- Euclidean distance
SELECT l2_distance(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

-- Inner product
SELECT inner_product(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

-- Cosine similarity
SELECT cosine_similarity(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

-- Vector arithmetic
SELECT svector_add(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

SELECT svector_sub(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

SELECT svector_mul(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

-- Sorting functions
SELECT l2_squared_distance(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

SELECT negative_inner_product(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

SELECT cosine_distance(
    '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector,
    '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

Supported operators

All operators below work with the SVECTOR type.

OperatorCalculationSorting behavior
<->Squared Euclidean distance (equivalent to l2_squared_distance)Not supported
<#>Negative inner product (equivalent to negative_inner_product)Sorts in descending order by dot product
<=>Cosine distance (equivalent to cosine_distance)Not supported
+Element-wise sumNot supported
-Element-wise differenceNot supported
*Element-wise productNot supported

Examples:

-- Squared Euclidean distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    <-> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Negative inner product (use in ORDER BY for approximate search)
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    <#> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Cosine distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    <=> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Vector arithmetic
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    + '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;

SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    - '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;

SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector
    * '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;