All Products
Search
Document Center

AnalyticDB:Use AnalyticDB for PostgreSQL sparse vectors

Last Updated:Apr 09, 2025

If you want to store vectors whose most elements are zero in an efficient manner, you can create a sparse vector table instead of a dense vector table, to reduce storage space. This topic describes how to use sparse vectors in an AnalyticDB for PostgreSQL vector database. You can create a sparse vector table, create a sparse vector index, use sparse vectors to perform search, and use sparse and dense vectors to perform hybrid search.

Background information

In most cases, vectors are classified into dense vectors and sparse vectors in vector databases. A sparse vector contains a large number of zero values. Among tens of thousands of dimensions, only few dimensions have non-zero values. Sparse vectors can be used to specify various types of data, such as text and images. When you perform keyword-based searches, each sparse vector represents a document, in which each dimension specifies a keyword in a dictionary and each dimension value specifies the importance of a keyword. If the sparse vectors are generated by using the BM25 algorithm, the dimension values contain the number of keyword matches, keyword occurrence frequency, and other text relevance factors. Compared with ordinary arrays or lists, sparse vectors can efficiently store and process high-dimensional data and significantly reduce the usage of storage space and computing resources. The data structure of sparse vectors is suitable for sparse feature scenarios to implement machine learning and natural language processing.

The following figure shows the differences between dense vectors and sparse vectors.

image

Prerequisites

Use sparse vectors

Create a sparse vector table

Syntax

CREATE TABLE <SparseVectorTable> 
(  
    id int PRIMARY KEY,
    description text, 
    ...,
    sparse_vector svector(MAX_DIM), 
)DISTRIBUTED BY (id);

Parameters:

  • SparseVectorTable: the name of the sparse vector table.

  • sparse_vector: the name of the sparse vector column. The values of the column are of the SVECTOR type.

  • MAX_DIM: the maximum number of sparse vector dimensions, which is not the number of dimensions that contain non-zero values.

Example

Create a sparse vector table named svector_test.

-- Create a table that contains a sparse vector column.
CREATE TABLE <svector_test>(  
    id bigint,
    type int,
    tag varchar(10),
    document text,
    sparse_features svector(250000)
) DISTRIBUTED BY (id);

-- Set the storage mode of the sparse vector column to PLAIN.
ALTER TABLE <svector_test> ALTER COLUMN sparse_features SET storage plain;
Note

PLAIN mode: PostgreSQL uses The Oversized-Attribute Storage Technique (TOAST) to store large objects, large field values, or data that cannot be contained within a regular data page. TOAST facilitates efficient storage and access to data. If you set the storage mode of a column to PLAIN, data in the column is not compressed or broken up. Data is stored within a single row and not moved to a TOAST table. In most cases, the PLAIN storage mode is used to process small field values to prevent additional overheads caused by TOAST. You can also use the PLAIN storage mode to process data types that support efficient storage, such as integers and small text fields.

Create a sparse vector index

Syntax

Sparse vector indexes support only the inner product distance metric. Sparse vector indexes do not support the product quantization (PQ) feature or the memory mapping technology. Syntax:

CREATE INDEX <idx_sparse_vector>
ON MyTable USING vector_index (sparse_vector_column);
WITH (DISTANCEMEASURE=IP,
      HNSW_M=$M,
      HNSW_EF_CONSTRUCTION=$EF_CONSTURCTION);

Parameters:

  • idx_sparse_vector: the name of the sparse vector index.

  • DISTANCEMEASURE: the name of the distance metric. To prevent errors, set the value to IP.

  • HNSW_M and HNSW_EF_CONSTRUCTION: For information about the parameters, see Create a vector index.

Example

In this example, the svector_test sparse vector table is used to create sparse vector indexes.

-- Create B-tree indexes for structured fields in hybrid search. If a structured field is of the ARRAY type, you can also create a generalized inverted index (GIN) index for the field.
CREATE INDEX svector_test(type);
CREATE INDEX svector_test(tag);

-- Create a Hierarchical Navigable Small World (HNSW) index for the sparse vector column.
CREATE INDEX ON svector_test USING ANN(sparse_features) WITH(DISTANCEMEASURE=IP,HNSW_M=64,pq_enable=0,external_storage=0);

Representation of sparse vector data

AnalyticDB for PostgreSQL allows you to use the SVECTOR type to represent sparse vector data. The SVECTOR type supports strings in the JSON format. You can use the indices field to specify the index array (non-negative integers) and the values field to specify the value array (floating-point numbers). The following example describes a 20-dimensional sparse vector:

Example of sparse vector data

[0, 0, 1.1, 0, 0, 0, 2.2, 0, 0, 3.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Representation of sparse vector data

{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}

Parameters:

  • indices:[2, 6, 9] specifies that the dimension numbers of non-zero values in the sparse vector are 2, 6, and 9.

  • values:[1.1, 2.2, 3.3] specifies that the non-zero values in the sparse vector are 1.1, 2.2, and 3.3.

In an AnalyticDB for PostgreSQL vector database, you can use svector to convert a string into the SVECTOR type. Sample statement:

postgres=# SELECT '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector;
svector
--------------------------------------------
{"indices":[2,6,9],"values":[1.1,2.2,3.3]}
(1 ROW)

Import sparse vector data

You can use the INSERT or COPY statement to import sparse vector data. In this example, the svector_test sparse vector table and the INSERT statement are used to import sparse vector data.

-- Insert sparse vector data.
INSERT INTO svector_test VALUES (1, 1, 'a', 'xxx', '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector);
INSERT INTO svector_test VALUES (2, 2, 'b', 'xxx', '{"indices":[50, 100, 200], "values":[2.1, 3.2, 4.3]}'::svector);
INSERT INTO svector_test VALUES (3, 3, 'b', 'xxx', '{"indices":[150, 60, 90], "values":[5, 1e3, 1e-3]}'::svector);

Use sparse vectors to perform search

Sparse vector search can be divided into exact search and approximate search.

  • Exact search: a brute-force search that strictly sorts results by similarity distance. This method requires comparing every vector, which results in a low retrieval speed but can achieve a recall rate of up to 100%. It is suitable for scenarios where precision is critical but timeliness is not a priority.

  • Approximate search: a method that uses optimized algorithms to quickly find approximate results. Although the precision is somewhat reduced, the retrieval speed is significantly faster. This method is suitable for scenarios where timeliness is crucial but absolute precision is less important.

Syntax

For sparse vector tables, sparse vector indexes support only the inner product distance metric. Syntax of exact search and approximate search:

Exact search

SELECT id, inner_product(<sparse_vector>, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) 
AS score FROM <SparseVectorTable> ORDER BY negative_inner_product(<sparse_vector>, 
'{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) LIMIT <topk>;

Parameters:

  • sparse_vector: the name of the sparse vector column.

  • SparseVectorTable: the name of the sparse vector table.

  • topk: the top k elements from the result set that you want to obtain.

Approximate search

SELECT id, inner_product(<sparse_vector>, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) 
AS score FROM <SparseVectorTable> ORDER BY <sparse_vector>
'{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector LIMIT <topk>;

Parameters:

  • sparse_vector: the name of the sparse vector column.

  • SparseVectorTable: the name of the sparse vector table.

  • topk: the top k elements from the result set that you want to obtain.

Example

In this example, the svector_test sparse vector table is used to perform sparse vector search.

Sparse vector search

SELECT id, sparse_features, 
'{"indices":[2,60,50], "values":[2,0.01, 0.02]}' <#> sparse_features AS score 
FROM svector_test
ORDER BY score
LIMIT 3;

Search result

   id   |                 sparse_features                 |        score
--------+-------------------------------------------------+---------------------
      3 | {"indices":[60,90,150],"values":[1000,0.001,5]} |                 -10
      1 | {"indices":[2,6,9],"values":[1.1,2.2,3.3]}      |   -2.20000004768372
      2 | {"indices":[50,100,200],"values":[2.1,3.2,4.3]} | -0.0419999957084656
(3 ROWS)
Note

You can use the fastann.sparse_hnsw_max_scan_points and fastann.sparse_hnsw_ef_search parameters to modify the recall rate of sparse vector search. For more information, see the "Appendixes" section of this topic.

Use dense and sparse vectors to perform hybrid search

Create a table named HYBRID_SEARCH_TEST that contains dense and sparse vectors to perform hybrid search. Sample statements for creating the table:

-- Create a table for hybrid search.
CREATE TABLE IF NOT EXISTS HYBRID_SEARCH_TEST (
   id integer PRIMARY key,
   corpus text,
   filter integer,
   dense_vector float4[],
   sparse_vector svector(250003)
) distributed BY (id);

-- Set the storage mode of the sparse vector column to PLAIN.
ALTER TABLE hybrid_search_test ALTER COLUMN dense_vector SET storage plain;
ALTER TABLE hybrid_search_test ALTER COLUMN sparse_vector SET storage plain;

Sample statements for creating indexes:

-- Create a structured index.
CREATE INDEX ON hybrid_search_test(FILTER);

-- Create a dense vector index.
CREATE INDEX ON hybrid_search_test USING ANN(dense_vector) WITH(DISTANCEMEASURE=IP, DIM=1024, HNSW_M=64, HNSW_EF_CONSTRUCTION=600, EXTERNAL_STORAGE=1);

-- Create a sparse vector index.
CREATE INDEX ON hybrid_search_test USING ANN(sparse_vector) WITH(DISTANCEMEASURE=IP, HNSW_M=64, HNSW_EF_CONSTRUCTION=600);

Sample statements for performing dense vector search, sparse vector search, and hybrid search:

-- Dense vector search + filter
SELECT id, corpus FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dense_vector <#> ARRAY[1,2,3,...,1024]::float4[] LIMIT 100;

-- Sparse vector search + filter
SELECT id, corpus FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector LIMIT 100;
  
-- Hybrid search + filter
WITH combined AS (
(SELECT id, corpus, inner_product_distance(dense_vector, ARRAY[1,2,3,...,1024]::float4[]) AS dist1, sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector AS dist2 FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dist2 LIMIT 100)
  UNION ALL
(SELECT id, corpus, inner_product(sparse_vector, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) AS dist1, dense_vector <#> ARRAY[1,2,3,...,1024]::float4[] AS dist2 FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dist2 LIMIT 100 )
) SELECT DISTINCT first_value(id) OVER (PARTITION BY id) AS id,  first_value(corpus) OVER (PARTITION BY id) AS corpus, MAX(dist1 - dist2) OVER (PARTITION BY id) AS dist FROM combined ORDER BY dist DESC LIMIT 100;

Appendixes

Engine parameters related to sparse vector search

Engine parameter related to sparse vector search

Description

Default value

Valid value

fastann.sparse_hnsw_max_scan_points

The maximum number of scan points when you use HNSW indexes to perform sparse vector search. This parameter terminates search in advance. You can use this parameter to test the recall rate.

8000

[1, 8000000]

fastann.sparse_hnsw_ef_search

The size of search candidate sets when you use HNSW indexes to perform sparse vector search. You can use this parameter to test the recall rate.

400

[10, 10000]

Note

You can configure the preceding parameters for sessions.

Supported vector functions for sparse vectors

Purpose

Vector function

Data type of the return value

Description

Supported data type

l2_distance

DOUBLE PRECISION

Calculates the rooted Euclidean distance between two sparse vectors. This function is used to measure the distance between two sparse vectors.

SVECTOR

Calculation

inner_product

DOUBLE PRECISION

Calculates the inner product distance between two vectors, which is equivalent to cosine similarity after vector normalization. This function is used to replace cosine similarity after vector normalization.

SVECTOR

dp_distance

DOUBLE PRECISION

Calculates the dot product distance, which is the same as the inner product distance.

SVECTOR

cosine_similarity

DOUBLE PRECISION

Calculates the cosine similarity between two sparse vectors. This function is used to measure the similarity between two sparse vectors based on the direction regardless of the actual length of the sparse vectors. Valid values of the cosine similarity: -1 to 1.

SVECTOR

svector_add

SVECTOR

Calculates the sum of two sparse vectors.

SVECTOR

svector_sub

SVECTOR

Calculates the difference between two sparse vectors.

SVECTOR

svector_mul

SVECTOR

Calculates the product of two sparse vectors.

SVECTOR

svector_norm

DOUBLE PRECISION

Calculates the norm of a sparse vector.

SVECTOR

svector_angle

DOUBLE PRECISION

Calculates the angle between two sparse vectors.

SVECTOR

Sorting

l2_squared_distance

DOUBLE PRECISION

Sorts vectors based on the squared Euclidean distance. This function requires fewer calculation than the sorting method based on the rooted Euclidean distance.

SVECTOR

negative_inner_product

DOUBLE PRECISION

Sorts vectors based on the negative inner product distance. This function is used to sort vectors in descending order based on the inner product distance.

SVECTOR

cosine_distance

DOUBLE PRECISION

Sorts vectors based on the cosine distance. This function is used to sort vectors in descending order based on the cosine similarity. Valid values of the cosine distance: 0 to 2.

SVECTOR

Examples of vector functions

-- Euclidean distance
SELECT l2_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Inner product distance
SELECT inner_product('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Cosine similarity
SELECT l2_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector sum
SELECT svector_add('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector subtraction
SELECT svector_sub('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector multiplication
SELECT svector_mul('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Squared Euclidean distance
SELECT l2_squared_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Negative inner product distance
SELECT negative_inner_product('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Cosine distance
SELECT cosine_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);

Supported vector operators for sparse vectors

Operator

Calculation

Sorting

Supported data type

<->

Calculates the squared Euclidean distance between two vectors. The result is equivalent to the result of the l2_squared_distance function.

None

SVECTOR

<#>

Calculates the negative inner product distance between two vectors. The result is equivalent to the result of the negative_inner_product function.

Sorts vectors in descending order based on the dot product distance.

SVECTOR

<#>

Calculates the cosine distance between two vectors. The result is equivalent to the result of the cosine_distance function.

None

SVECTOR

+

Calculates the sum of two sparse vectors.

None

SVECTOR

-

Calculates the difference between two sparse vectors.

None

SVECTOR

*

Calculates the product of two sparse vectors.

None

SVECTOR

Examples of vector operators

-- Squared Euclidean distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <-> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Negative inner product distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <#> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Cosine distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <=> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;

-- Sum of two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector + '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;

-- Difference between two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector + '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;

-- Product of two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector * '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;