If you want to store vectors whose most elements are zero in an efficient manner, you can create a sparse vector table instead of a dense vector table, to reduce storage space. This topic describes how to use sparse vectors in an AnalyticDB for PostgreSQL vector database. You can create a sparse vector table, create a sparse vector index, use sparse vectors to perform search, and use sparse and dense vectors to perform hybrid search.
Background information
In most cases, vectors are classified into dense vectors and sparse vectors in vector databases. A sparse vector contains a large number of zero values. Among tens of thousands of dimensions, only few dimensions have non-zero values. Sparse vectors can be used to specify various types of data, such as text and images. When you perform keyword-based searches, each sparse vector represents a document, in which each dimension specifies a keyword in a dictionary and each dimension value specifies the importance of a keyword. If the sparse vectors are generated by using the BM25 algorithm, the dimension values contain the number of keyword matches, keyword occurrence frequency, and other text relevance factors. Compared with ordinary arrays or lists, sparse vectors can efficiently store and process high-dimensional data and significantly reduce the usage of storage space and computing resources. The data structure of sparse vectors is suitable for sparse feature scenarios to implement machine learning and natural language processing.
The following figure shows the differences between dense vectors and sparse vectors.

Prerequisites
An AnalyticDB for PostgreSQL instance of V6.6.2.3 or later is created. For more information, see Create an instance.
The vector search engine optimization feature is enabled for the AnalyticDB for PostgreSQL instance. For more information, see Enable or disable vector search engine optimization.
Use sparse vectors
Create a sparse vector table
Syntax
CREATE TABLE <SparseVectorTable>
(
id int PRIMARY KEY,
description text,
...,
sparse_vector svector(MAX_DIM),
)DISTRIBUTED BY (id);Parameters:
SparseVectorTable: the name of the sparse vector table.
sparse_vector: the name of the sparse vector column. The values of the column are of the SVECTOR type.
MAX_DIM: the maximum number of sparse vector dimensions, which is not the number of dimensions that contain non-zero values.
Example
Create a sparse vector table named svector_test.
-- Create a table that contains a sparse vector column.
CREATE TABLE <svector_test>(
id bigint,
type int,
tag varchar(10),
document text,
sparse_features svector(250000)
) DISTRIBUTED BY (id);
-- Set the storage mode of the sparse vector column to PLAIN.
ALTER TABLE <svector_test> ALTER COLUMN sparse_features SET storage plain;PLAIN mode: PostgreSQL uses The Oversized-Attribute Storage Technique (TOAST) to store large objects, large field values, or data that cannot be contained within a regular data page. TOAST facilitates efficient storage and access to data. If you set the storage mode of a column to PLAIN, data in the column is not compressed or broken up. Data is stored within a single row and not moved to a TOAST table. In most cases, the PLAIN storage mode is used to process small field values to prevent additional overheads caused by TOAST. You can also use the PLAIN storage mode to process data types that support efficient storage, such as integers and small text fields.
Create a sparse vector index
Syntax
Sparse vector indexes support only the inner product distance metric. Sparse vector indexes do not support the product quantization (PQ) feature or the memory mapping technology. Syntax:
CREATE INDEX <idx_sparse_vector>
ON MyTable USING vector_index (sparse_vector_column);
WITH (DISTANCEMEASURE=IP,
HNSW_M=$M,
HNSW_EF_CONSTRUCTION=$EF_CONSTURCTION);Parameters:
idx_sparse_vector: the name of the sparse vector index.
DISTANCEMEASURE: the name of the distance metric. To prevent errors, set the value to IP.
HNSW_M and HNSW_EF_CONSTRUCTION: For information about the parameters, see Create a vector index.
Example
In this example, the svector_test sparse vector table is used to create sparse vector indexes.
-- Create B-tree indexes for structured fields in hybrid search. If a structured field is of the ARRAY type, you can also create a generalized inverted index (GIN) index for the field.
CREATE INDEX svector_test(type);
CREATE INDEX svector_test(tag);
-- Create a Hierarchical Navigable Small World (HNSW) index for the sparse vector column.
CREATE INDEX ON svector_test USING ANN(sparse_features) WITH(DISTANCEMEASURE=IP,HNSW_M=64,pq_enable=0,external_storage=0);Representation of sparse vector data
AnalyticDB for PostgreSQL allows you to use the SVECTOR type to represent sparse vector data. The SVECTOR type supports strings in the JSON format. You can use the indices field to specify the index array (non-negative integers) and the values field to specify the value array (floating-point numbers). The following example describes a 20-dimensional sparse vector:
Example of sparse vector data
[0, 0, 1.1, 0, 0, 0, 2.2, 0, 0, 3.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]Representation of sparse vector data
{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}Parameters:
indices:[2, 6, 9] specifies that the dimension numbers of non-zero values in the sparse vector are 2, 6, and 9.
values:[1.1, 2.2, 3.3] specifies that the non-zero values in the sparse vector are 1.1, 2.2, and 3.3.
In an AnalyticDB for PostgreSQL vector database, you can use svector to convert a string into the SVECTOR type. Sample statement:
postgres=# SELECT '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector;
svector
--------------------------------------------
{"indices":[2,6,9],"values":[1.1,2.2,3.3]}
(1 ROW)Import sparse vector data
You can use the INSERT or COPY statement to import sparse vector data. In this example, the svector_test sparse vector table and the INSERT statement are used to import sparse vector data.
-- Insert sparse vector data.
INSERT INTO svector_test VALUES (1, 1, 'a', 'xxx', '{"indices":[2, 6, 9], "values":[1.1, 2.2, 3.3]}'::svector);
INSERT INTO svector_test VALUES (2, 2, 'b', 'xxx', '{"indices":[50, 100, 200], "values":[2.1, 3.2, 4.3]}'::svector);
INSERT INTO svector_test VALUES (3, 3, 'b', 'xxx', '{"indices":[150, 60, 90], "values":[5, 1e3, 1e-3]}'::svector);Use sparse vectors to perform search
Sparse vector search can be divided into exact search and approximate search.
Exact search: a brute-force search that strictly sorts results by similarity distance. This method requires comparing every vector, which results in a low retrieval speed but can achieve a recall rate of up to 100%. It is suitable for scenarios where precision is critical but timeliness is not a priority.
Approximate search: a method that uses optimized algorithms to quickly find approximate results. Although the precision is somewhat reduced, the retrieval speed is significantly faster. This method is suitable for scenarios where timeliness is crucial but absolute precision is less important.
Syntax
For sparse vector tables, sparse vector indexes support only the inner product distance metric. Syntax of exact search and approximate search:
Exact search
SELECT id, inner_product(<sparse_vector>, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector)
AS score FROM <SparseVectorTable> ORDER BY negative_inner_product(<sparse_vector>,
'{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) LIMIT <topk>;Parameters:
sparse_vector: the name of the sparse vector column.
SparseVectorTable: the name of the sparse vector table.
topk: the top k elements from the result set that you want to obtain.
Approximate search
SELECT id, inner_product(<sparse_vector>, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector)
AS score FROM <SparseVectorTable> ORDER BY <sparse_vector>
'{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector LIMIT <topk>;Parameters:
sparse_vector: the name of the sparse vector column.
SparseVectorTable: the name of the sparse vector table.
topk: the top k elements from the result set that you want to obtain.
Example
In this example, the svector_test sparse vector table is used to perform sparse vector search.
Sparse vector search
SELECT id, sparse_features,
'{"indices":[2,60,50], "values":[2,0.01, 0.02]}' <#> sparse_features AS score
FROM svector_test
ORDER BY score
LIMIT 3;Search result
id | sparse_features | score
--------+-------------------------------------------------+---------------------
3 | {"indices":[60,90,150],"values":[1000,0.001,5]} | -10
1 | {"indices":[2,6,9],"values":[1.1,2.2,3.3]} | -2.20000004768372
2 | {"indices":[50,100,200],"values":[2.1,3.2,4.3]} | -0.0419999957084656
(3 ROWS)You can use the fastann.sparse_hnsw_max_scan_points and fastann.sparse_hnsw_ef_search parameters to modify the recall rate of sparse vector search. For more information, see the "Appendixes" section of this topic.
Use dense and sparse vectors to perform hybrid search
Create a table named HYBRID_SEARCH_TEST that contains dense and sparse vectors to perform hybrid search. Sample statements for creating the table:
-- Create a table for hybrid search.
CREATE TABLE IF NOT EXISTS HYBRID_SEARCH_TEST (
id integer PRIMARY key,
corpus text,
filter integer,
dense_vector float4[],
sparse_vector svector(250003)
) distributed BY (id);
-- Set the storage mode of the sparse vector column to PLAIN.
ALTER TABLE hybrid_search_test ALTER COLUMN dense_vector SET storage plain;
ALTER TABLE hybrid_search_test ALTER COLUMN sparse_vector SET storage plain;Sample statements for creating indexes:
-- Create a structured index.
CREATE INDEX ON hybrid_search_test(FILTER);
-- Create a dense vector index.
CREATE INDEX ON hybrid_search_test USING ANN(dense_vector) WITH(DISTANCEMEASURE=IP, DIM=1024, HNSW_M=64, HNSW_EF_CONSTRUCTION=600, EXTERNAL_STORAGE=1);
-- Create a sparse vector index.
CREATE INDEX ON hybrid_search_test USING ANN(sparse_vector) WITH(DISTANCEMEASURE=IP, HNSW_M=64, HNSW_EF_CONSTRUCTION=600);Sample statements for performing dense vector search, sparse vector search, and hybrid search:
-- Dense vector search + filter
SELECT id, corpus FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dense_vector <#> ARRAY[1,2,3,...,1024]::float4[] LIMIT 100;
-- Sparse vector search + filter
SELECT id, corpus FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector LIMIT 100;
-- Hybrid search + filter
WITH combined AS (
(SELECT id, corpus, inner_product_distance(dense_vector, ARRAY[1,2,3,...,1024]::float4[]) AS dist1, sparse_vector <#> '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector AS dist2 FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dist2 LIMIT 100)
UNION ALL
(SELECT id, corpus, inner_product(sparse_vector, '{"indices":[1,2,..], "values":[1.1,2.2...]}'::svector) AS dist1, dense_vector <#> ARRAY[1,2,3,...,1024]::float4[] AS dist2 FROM hybrid_search_test WHERE FILTER IN (0, 100, 200, 300, 400, 500, 600, 700, 800, 900) ORDER BY dist2 LIMIT 100 )
) SELECT DISTINCT first_value(id) OVER (PARTITION BY id) AS id, first_value(corpus) OVER (PARTITION BY id) AS corpus, MAX(dist1 - dist2) OVER (PARTITION BY id) AS dist FROM combined ORDER BY dist DESC LIMIT 100;Appendixes
Engine parameters related to sparse vector search
Engine parameter related to sparse vector search | Description | Default value | Valid value |
fastann.sparse_hnsw_max_scan_points | The maximum number of scan points when you use HNSW indexes to perform sparse vector search. This parameter terminates search in advance. You can use this parameter to test the recall rate. | 8000 | [1, 8000000] |
fastann.sparse_hnsw_ef_search | The size of search candidate sets when you use HNSW indexes to perform sparse vector search. You can use this parameter to test the recall rate. | 400 | [10, 10000] |
You can configure the preceding parameters for sessions.
Supported vector functions for sparse vectors
Purpose | Vector function | Data type of the return value | Description | Supported data type |
l2_distance | DOUBLE PRECISION | Calculates the rooted Euclidean distance between two sparse vectors. This function is used to measure the distance between two sparse vectors. | SVECTOR | |
Calculation | inner_product | DOUBLE PRECISION | Calculates the inner product distance between two vectors, which is equivalent to cosine similarity after vector normalization. This function is used to replace cosine similarity after vector normalization. | SVECTOR |
dp_distance | DOUBLE PRECISION | Calculates the dot product distance, which is the same as the inner product distance. | SVECTOR | |
cosine_similarity | DOUBLE PRECISION | Calculates the cosine similarity between two sparse vectors. This function is used to measure the similarity between two sparse vectors based on the direction regardless of the actual length of the sparse vectors. Valid values of the cosine similarity: -1 to 1. | SVECTOR | |
svector_add | SVECTOR | Calculates the sum of two sparse vectors. | SVECTOR | |
svector_sub | SVECTOR | Calculates the difference between two sparse vectors. | SVECTOR | |
svector_mul | SVECTOR | Calculates the product of two sparse vectors. | SVECTOR | |
svector_norm | DOUBLE PRECISION | Calculates the norm of a sparse vector. | SVECTOR | |
svector_angle | DOUBLE PRECISION | Calculates the angle between two sparse vectors. | SVECTOR | |
Sorting | l2_squared_distance | DOUBLE PRECISION | Sorts vectors based on the squared Euclidean distance. This function requires fewer calculation than the sorting method based on the rooted Euclidean distance. | SVECTOR |
negative_inner_product | DOUBLE PRECISION | Sorts vectors based on the negative inner product distance. This function is used to sort vectors in descending order based on the inner product distance. | SVECTOR | |
cosine_distance | DOUBLE PRECISION | Sorts vectors based on the cosine distance. This function is used to sort vectors in descending order based on the cosine similarity. Valid values of the cosine distance: 0 to 2. | SVECTOR |
Examples of vector functions
-- Euclidean distance
SELECT l2_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Inner product distance
SELECT inner_product('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Cosine similarity
SELECT l2_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector sum
SELECT svector_add('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector subtraction
SELECT svector_sub('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Vector multiplication
SELECT svector_mul('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Squared Euclidean distance
SELECT l2_squared_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Negative inner product distance
SELECT negative_inner_product('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);
-- Cosine distance
SELECT cosine_distance('{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector, '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector);Supported vector operators for sparse vectors
Operator | Calculation | Sorting | Supported data type |
<-> | Calculates the squared Euclidean distance between two vectors. The result is equivalent to the result of the l2_squared_distance function. | None | SVECTOR |
<#> | Calculates the negative inner product distance between two vectors. The result is equivalent to the result of the negative_inner_product function. | Sorts vectors in descending order based on the dot product distance. | SVECTOR |
<#> | Calculates the cosine distance between two vectors. The result is equivalent to the result of the cosine_distance function. | None | SVECTOR |
+ | Calculates the sum of two sparse vectors. | None | SVECTOR |
- | Calculates the difference between two sparse vectors. | None | SVECTOR |
* | Calculates the product of two sparse vectors. | None | SVECTOR |
Examples of vector operators
-- Squared Euclidean distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <-> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;
-- Negative inner product distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <#> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;
-- Cosine distance
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector <=> '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS score;
-- Sum of two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector + '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;
-- Difference between two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector + '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;
-- Product of two vectors
SELECT '{"indices":[1,2,3,4,5,6], "values":[1,2,3,4,5,6]}'::svector * '{"indices":[1,2,3,4,5,6], "values":[2,3,4,5,6,7]}'::svector AS value;