Create a vector index in AnalyticDB for PostgreSQL - AnalyticDB

You can create a vector index to accelerate vector searches. This is useful for large datasets or scenarios that require fast data access and retrieval. Common use cases include database query optimization, machine learning, data mining, image and video retrieval, and spatial data queries. A vector index improves query performance, accelerates data analytics, and optimizes search tasks. This improves system efficiency and response speed.

Background information

The FastANN vector search engine in cloud-native data warehouse AnalyticDB for PostgreSQL implements the popular Hierarchical Navigable Small World (HNSW) algorithm. The engine is built on the segmented page storage of PostgreSQL. It stores only pointers to vector columns in the index, which significantly reduces storage space. The FastANN vector search engine also supports product quantization (PQ) to reduce the dimensions of high-dimensional vectors. Storing these reduced-dimension vectors in the index minimizes table lookups during vector insertions and queries. This improves vector search performance.

Syntax

CREATE INDEX [INDEX_NAME]
ON [SCHEMA_NAME].[TABLE_NAME]   
USING ANN(COLUMN_NAME) 
WITH (DIM=<DIMENSION>,
      DISTANCEMEASURE=<MEASURE>,
      HNSW_M=<M>,
      HNSW_EF_CONSTRUCTION=<EF_CONSTURCTION>,
      PQ_ENABLE=<PQ_ENABLE>,
      PQ_SEGMENTS=<PQ_SEGMENTS>,
      PQ_CENTERS=<PQ_CENTERS>,
      EXTERNAL_STORAGE=<EXTERNAL_STORAGE>);

The following table describes the fields.

INDEX_NAME: The index name.
SCHEMA_NAME: The schema (namespace) name.
TABLE_NAME: The table name.
COLUMN_NAME: The vector index column name.

Other vector index parameters:

Vector index parameter	Required	Description	Default value	Value range
DIM	Yes	The vector dimensions. This parameter is used for checks during vector insertion. If the dimensions do not match, the system returns an error message.	0	[1, 8192]
DISTANCEMEASURE	No	The supported similarity distance measures: L2: Builds an index using the squared Euclidean distance function. This is typically used for image similarity search scenarios. Formula: IP: Builds an index using the negative inner product distance function. This is typically used as a substitute for cosine similarity after vector normalization. Formula: COSINE: Builds an index using the cosine distance function. This is typically used for text similarity search scenarios. Formula: Note IP and COSINE require an engine version of 6.3.10.18 or later. Before you use them, make sure your engine version meets this requirement. To view and upgrade the engine version, see Upgrade versions. Normalize vector data before you import it. Then, use the IP distance as the similarity measure for the index to achieve optimal performance. When you use IP distance as the similarity measure for an index, you can also directly obtain the Euclidean distance and cosine similarity during queries. For more information about how to use this feature, see Vector search. For more information about user-defined functions (UDFs) for vector search, see Related reference.	L2	(L2, IP, COSINE)
HNSW_M	No	The maximum number of neighbors in the HNSW algorithm.	32	[10, 1000]
HNSW_EF_CONSTRUCTION	No	The size of the candidate set during index building for the HNSW algorithm.	600	[40, 4000]
PQ_ENABLE	No	Specifies whether to enable the PQ vector dimension reduction feature. PQ vector dimension reduction depends on training with existing vector sample data. Do not set this parameter if the data volume is less than 500,000.	0	[0, 1]
PQ_SEGMENTS	No	This parameter takes effect when PQ_ENABLE is set to 1. It specifies the number of segments in the k-means clustering algorithm used for PQ vector dimension reduction. If DIM is divisible by 8, you do not need to specify this parameter. Otherwise, you must set it manually. PQ_SEGMENTS must be greater than 0, and DIM must be divisible by PQ_SEGMENTS.	0	[1, 256]
PQ_CENTERS	No	This parameter takes effect when PQ_ENABLE is set to 1. It specifies the number of centroids in the k-means clustering algorithm used for PQ vector dimension reduction.	2048	[256, 8192]
EXTERNAL_STORAGE	No	Specifies whether to use mmap to build the HNSW index. If set to 0, the index is built using segmented page storage by default. This mode can use the shared_buffer in PostgreSQL for caching and supports operations such as delete and update. If set to 1, the index is built using mmap. Starting from version v6.6.2.3, the index supports mark-for-delete, which allows a small number of delete and update operations on the data table. Important The external_storage parameter is supported only in version 6.0. It is not supported in version 7.0.	0	[0, 1]

Examples

Assume that you have a text knowledge base. You can split articles into chunks, convert them into embedding vectors, and store them in a database. The generated `chunks` table contains the following fields:

Field	Type	Description
id	serial	The ID.
chunk	varchar(1024)	The text chunk after the article is split.
intime	timestamp	The time when the article was imported into the database.
url	varchar(1024)	The link to the article to which the text chunk belongs.
feature	real[]	The embedding vector of the text chunk.

Create a database table to store the vectors.

CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    chunk VARCHAR(1024),
    intime TIMESTAMP,
    url VARCHAR(1024),
    feature REAL[]
) DISTRIBUTED BY (id);

Set the storage mode of the vector column to PLAIN to reduce row scan costs and improve performance.
```
ALTER TABLE chunks ALTER COLUMN feature SET STORAGE PLAIN;
```
Create a vector index on the vector column.
```
-- Create a vector index that uses the Euclidean distance measure.
CREATE INDEX idx_chunks_feature_l2 ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=l2, hnsw_m=64, pq_enable=1);
-- Create a vector index that uses the inner product distance measure.
CREATE INDEX idx_chunks_feature_ip ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=ip, hnsw_m=64, pq_enable=1);
-- Create a vector index that uses the cosine similarity measure.
CREATE INDEX idx_chunks_feature_cosine ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
```
Note
- Currently, you can create multiple vector columns in a vector table and multiple vector indexes on each vector column. Create vector indexes only as needed to avoid creating invalid ones.
- The vector query method must correspond to the distance measure of the vector index. For example, the <-> operator in a vector query corresponds to a vector index that uses the Euclidean distance measure. The <#> operator corresponds to an inner product distance index, and the <=> operator corresponds to a cosine similarity index. If a corresponding vector index does not exist, the vector query degrades to an exact search (brute-force search).
- Use the ANN access method provided by the FastANN vector search plugin. When you enable the Vector Search Engine Optimization feature for an instance, the FastANN vector search plugin is automatically created in that instance.
- If the error message You must specify an operator class or define a default operator class for the data type is returned, the Vector Search Engine Optimization feature is not enabled for the instance. To resolve this issue, enable the feature and retry the operation. For more information, see Enable or disable vector search engine optimization.
Create indexes on frequently used structured columns to improve the performance of hybrid queries.
```
CREATE INDEX ON chunks(intime);
```

Vector index building methods

Vector indexes can be built in two main ways:

Stream building
First, you create an empty table and then build a vector index on it. When you import vector data, the index is built in real time as a stream. This method is suitable for real-time vector search scenarios but can slow down the data import process.
Asynchronous building
First, you create an empty table and import the data without a vector index. Then, you can build the vector index on the existing vector data. This method is suitable for large-scale vector data import scenarios.

Note

AnalyticDB for PostgreSQL provides a concurrent building capability for the asynchronous vector index building method. You can set the degree of parallelism for index building using the `fastann.build_parallel_processes` Grand Unified Configuration (GUC) parameter.

For example, you can build an index asynchronously on DMS as follows:

-- Assume the operation is performed on an instance of the 8-core 32 GB instance type.
-- Set statement_timeout and idle_in_transaction_session_timeout to 0 to prevent timeout interruptions.
-- Set fastann.build_parallel_processes to 8 to ensure full CPU utilization on compute nodes. You must set this value based on the segment specifications. For example, set it to 4 for a 4-core 16 GB instance type, 8 for an 8-core 32 GB instance type, and 16 for a 16-core 64 GB instance type.
-- On DMS, the parameter settings must be on the same line as the CREATE INDEX SQL statement to take effect.
SET statement_timeout = 0;SET idle_in_transaction_session_timeout = 0;SET fastann.build_parallel_processes = 8;CREATE INDEX ON chunks USING ann(feature) 
WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);

To build an index asynchronously using psql, perform the following steps:

-- Assume the operation is performed on an 8-core 32 GB instance, the same as the DMS operation above.

-- Step 1: Save the following content to a file named create_index.sql.
SET statement_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET fastann.build_parallel_processes = 8;
CREATE INDEX ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);

-- Step 2: Run the following command to start building the index in parallel.
psql -h gp-xxxx-master.gpdb.rds.aliyuncs.com -p 5432 -d dbname -U username -f create_index.sql

Important

If the current instance is handling business workloads, do not set the number of parallel processes for index building to match the instance specifications. Otherwise, the performance of running business workloads may be affected.

Related reference

Supported vector functions

Function purpose	Vector function	Return value type	Meaning	Supported data types
Calculation	l2_distance	double precision	Euclidean distance (square root value). It measures the magnitude of two vectors and represents the distance between them. It is commonly used for image similarity search. Important In versions V6.3.10.17 and earlier, `l2_distance` returns the squared Euclidean distance. In versions V6.3.10.18 and later, it returns the Euclidean distance (square root value).	smallint[], float2[], float4[], real[]
	inner_product_distance	double precision	Inner product distance. After vector normalization, it is equal to cosine similarity. It is often used as a substitute for cosine similarity after normalization. Formula:	smallint[], float2[], float4[], real[]
	cosine_similarity	double precision	Cosine similarity. The value range is [-1, 1]. It is typically used to measure the directional similarity between two vectors, regardless of their actual length. Formula:	smallint[], float2[], float4[], real[]
	dp_distance	double precision	Dot product distance. It is identical to the inner product distance. Formula:	smallint[], float2[], float4[], real[]
	hm_distance	integer	Hamming distance. Formula: bitwise operation	int[]
	vector_add	smallint[], float2[], or float4[]	Calculates the sum of two vector arrays.	smallint[], float2[], float4[], real[]
	vector_sub	smallint[], float2[], or float4[]	Calculates the difference between two vector arrays.	smallint[], float2[], float4[], real[]
	vector_mul	smallint[], float2[], or float4[]	Calculates the product of two vector arrays.	smallint[], float2[], float4[], real[]
	vector_div	smallint[], float2[], or float4[]	Calculates the quotient of two vector arrays.	smallint[], float2[], float4[], real[]
	vector_sum	int or double precision	Calculates the sum of all elements in a vector array.	smallint[], int[], float2[], float4[], real[], float8[]
	vector_min	int or double precision	Finds the minimum value among all elements in a vector array.	smallint[], int[], float2[], float4[], real[], float8[]
	vector_max	int or double precision	Finds the maximum value among all elements in a vector array.	smallint[], int[], float2[], float4[], real[], float8[]
	vector_avg	int or double precision	Calculates the average value of all elements in a vector array.	smallint[], int[], float2[], float4[], real[], float8[]
	vector_norm	double precision	Calculates the norm (length) of a vector array.	smallint[], int[], float2[], float4[], real[], float8[]
	vector_angle	double precision	Calculates the angle between two vector arrays.	smallint[], float2[], float4[], real[]
Sorting	l2_squared_distance	double precision	Squared Euclidean distance. Because it avoids the square root calculation of the Euclidean distance, it is mainly used for sorting by Euclidean distance to reduce computational load. Formula:	smallint[], float2[], float4[], real[]
	negative_inner_product_distance	double precision	Negative inner product distance. This is the negated result of the inner product distance. It is mainly used for sorting by inner product distance to ensure the results are sorted in descending order of inner product distance. Formula:	smallint[], float2[], float4[], real[]
	cosine_distance	double precision	Cosine distance. The value range is [0, 2]. It is mainly used for sorting by cosine similarity to ensure the results are sorted in descending order of cosine similarity. Formula:	smallint[], float2[], float4[], real[]

Important

The `vector_add`, `vector_sub`, `vector_mul`, `vector_div`, `vector_sum`, `vector_min`, `vector_max`, `vector_avg`, `vector_norm`, and `vector_angle` functions are supported only in v6.6.1.0 and later versions. To view the instance version, see View the minor engine version.

Examples of vector functions:

-- Euclidean distance
SELECT l2_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Inner product distance
SELECT inner_product_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Cosine similarity
SELECT cosine_similarity(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Dot product distance
SELECT dp_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Hamming distance
SELECT hm_distance(array[1,0,1,0]::int[], array[0,1,0,1]::int[]);
-- Vector array addition
SELECT vector_add(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Vector array subtraction
SELECT vector_sub(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Vector array multiplication
SELECT vector_mul(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Vector array division
SELECT vector_div(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Sum of vector array elements
SELECT vector_sum(array[1,1,1,1]::real[]);
-- Minimum value in a vector array
SELECT vector_min(array[1,1,1,1]::real[]);
-- Maximum value in a vector array
SELECT vector_max(array[1,1,1,1]::real[]);
-- Average value of a vector array
SELECT vector_avg(array[1,1,1,1]::real[]);
-- Norm of a vector array
SELECT vector_norm(array[1,1,1,1]::real[]);
-- Angle between two vectors
SELECT vector_angle(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);

Supported vector operators

Vector operator	Calculation meaning	Sorting meaning	Supported data types
<->	Gets the squared Euclidean distance. The result is equivalent to `l2_squared_distance`.	Sorts by squared Euclidean distance in ascending order.	smallint[], float2[], float4[], real[]
<#>	Gets the negative inner product. The result is equivalent to `negative_inner_product_distance`.	Sorts by dot product distance in descending order.	smallint[], float2[], float4[], real[]
<=>	Gets the cosine distance. The result is equivalent to `cosine_distance`.	Sorts by cosine similarity in descending order.	smallint[], float2[], float4[], real[]
+	Calculates the sum of two vectors.	None	smallint[], float2[], float4[], real[]
-	Calculates the difference between two vectors.	None	smallint[], float2[], float4[], real[]
*	Calculates the product of two vectors.	None	smallint[], float2[], float4[], real[]
/	Calculates the quotient of two vectors.	None	smallint[], float2[], float4[], real[]

Important

The +, -, *, and / vector operators are supported only in v6.6.1.0 and later versions. To view the instance version, see View the minor engine version.

Examples of vector operators:

-- Euclidean distance
SELECT array[1,1,1,1]::real[] <-> array[2,2,2,2]::real[] AS score;
-- Inner product distance
SELECT array[1,1,1,1]::real[] <#> array[2,2,2,2]::real[] AS score;
-- Cosine distance
SELECT array[1,1,1,1]::real[] <=> array[2,2,2,2]::real[] AS score;

-- Addition
SELECT array[1,1,1,1]::real[] + array[2,2,2,2]::real[] AS value;
-- Subtraction
SELECT array[1,1,1,1]::real[] - array[2,2,2,2]::real[] AS value;
-- Multiplication
SELECT array[1,1,1,1]::real[] * array[2,2,2,2]::real[] AS value;
-- Division
SELECT array[1,1,1,1]::real[] / array[2,2,2,2]::real[] AS value;