Create a vector index in AnalyticDB for PostgreSQL - AnalyticDB for PostgreSQL

In search scenarios that involve large amounts of data or require fast data access, such as database query optimization, machine learning, data mining, image and video search, and spatial data query, you can create vector indexes to accelerate vector search. This way, you can accelerate data analysis, optimize search jobs, and improve the query performance and response efficiency of the system.

Background information

AnalyticDB for PostgreSQL vector databases use the FastANN vector search engine to implement the mainstream Hierarchical Navigable Small World (HNSW) algorithm. The FastANN vector search engine uses the segmented paging storage of PostgreSQL and stores only the vector column pointers in indexes to significantly reduce the storage that is occupied by vector indexes. The vector search engine uses the product quantization (PQ) feature to reduce the dimensionality of high-dimensional vectors. This way, low-dimensional vectors are stored in indexes to reduce index lookups when you insert or query vectors and improve vector search performance.

Syntax

CREATE INDEX [INDEX_NAME]
ON [SCHEMA_NAME].[TABLE_NAME]   
USING ANN(COLUMN_NAME) 
WITH (DIM=<DIMENSION>,
      DISTANCEMEASURE=<MEASURE>,
      HNSW_M=<M>,
      HNSW_EF_CONSTRUCTION=<EF_CONSTURCTION>,
      PQ_ENABLE=<PQ_ENABLE>,
      PQ_SEGMENTS=<PQ_SEGMENTS>,
      PQ_CENTERS=<PQ_CENTERS>,
      EXTERNAL_STORAGE=<EXTERNAL_STORAGE>);

Parameters:

INDEX_NAME: the name of the index.
SCHEMA_NAME: the name of the schema or namespace.
TABLE_NAME: the name of the table.
COLUMN_NAME: the name of the vector index column.

The following table describes other vector index parameters.

Vector index parameter	Required	Description	Default value	Valid value
DIM	Yes	The number of vector dimensions. This parameter is used to check vectors when vectors are inserted. If the number of vector dimensions does not match, the system displays an error message.	0	[1, 8192]
DISTANCEMEASURE	No	The distance or similarity measurement. Valid values: L2: uses the squared Euclidean distance function to create indexes. This measurement is suitable for image similarity search scenarios. Formula: IP: uses the negative inner product distance function to create indexes. This measurement is used to replace cosine similarity after vector normalization. Formula: COSINE: uses the cosine distance function to create indexes. This measurement is suitable for text similarity search scenarios. Formula: Note If you set the DISTANCEMEASURE parameter to IP or COSINE, make sure that the minor version of the AnalyticDB for PostgreSQL instance is 6.3.10.18 or later. For information about how to view and update the minor version, see Update the minor engine version. To achieve the best performance, we recommend that you store the normalized vector data in databases and use the inner product distance measurement to create indexes. When you use the inner product distance measure to create indexes, you can also query the Euclidean distance and cosine similarity. For more information, see Perform vector search. For information about user-defined functions (UDFs) for vector search, see the "References" section of this topic.	L2	(L2, IP, COSINE)
HNSW_M	No	The maximum number of neighbors for the HNSW algorithm.	50	[1, 1000]
HNSW_EF_CONSTRUCTION	No	The size of the candidate set when the HNSW algorithm is used to create indexes.	600	[1, 4000]
PQ_ENABLE	No	Specifies whether to enable the PQ feature for dimensionality reduction. The PQ feature uses existing vector sample data for training. If the number of rows that are occupied by data is less than 500,000, we recommend that you do not specify this parameter.	0	[0, 1]
PQ_SEGMENTS	No	The number of segments that are used in the K-means clustering algorithm after you enable the PQ feature. This parameter takes effect only if you set the PQ_ENABLE parameter to 1. If the DIM parameter divided by eight is an integer, you do not need to specify this parameter. Otherwise, you must specify this parameter. The value of the PQ_SEGMENTS parameter must be greater than 0. The DIM parameter divided by the PQ_SEGMENTS parameter must be an integer.	0	[1, 256]
PQ_CENTERS	No	The number of centroids that are used in the K-means clustering algorithm after you enable the PQ feature. This parameter takes effect only if you set the PQ_ENABLE parameter to 1.	2048	[256, 8192]
EXTERNAL_STORAGE	No	Specifies whether to use the memory mapping technology to create HNSW indexes. Valid values: 0: uses segmented paging storage to create indexes. This method uses the shared buffer of PostgreSQL for caching and supports the delete and update operations. 1: uses the memory mapping technology to create indexes. This method does not support the delete or update operation.	0	[0, 1]

Example

In this example, a text knowledge base is used. An article in the knowledge base is split into chunks, converted into embedding vectors, and then stored in a database. The following table describes the fields that are contained in a table named chunks.

Field	Data type	Description
id	Serial	The serial number.
chunk	Varchar(1024)	The text chunk.
intime	Timestamp	The time when the article is stored in the database.
url	Varchar(1024)	The URL of the article to which the text chunk belongs.
feature	Real[]	The embedding vector of the text chunk.

Create a table to store vectors.

CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    chunk VARCHAR(1024),
    intime TIMESTAMP,
    url VARCHAR(1024),
    feature REAL[]
) DISTRIBUTED BY (id);

Set the storage mode of the vector column to PLAIN to reduce data scanning costs and achieve higher performance.
```
ALTER TABLE chunks ALTER COLUMN feature SET STORAGE PLAIN;
```
Create a vector index for the vector column.
```
-- Create a vector index based on the Euclidean distance. 
CREATE INDEX idx_chunks_feature_l2 ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=l2, hnsw_m=64, pq_enable=1);
-- Create a vector index based on the inner product distance. 
CREATE INDEX idx_chunks_feature_ip ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=ip, hnsw_m=64, pq_enable=1);
-- Create a vector index based on the cosine similarity. 
CREATE INDEX idx_chunks_feature_cosine ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
```
Note
- You can create multiple vector columns in each vector table and multiple vector indexes on each vector column. To avoid invalid indexes, you must create vector indexes based on your business requirements.
- The vector search methods must match the vector indexes that are created by using the corresponding distance measurements. For example, the <-> operator works with the vector index created by using the Euclidean distance, the <#> operator works with the vector index created by using the inner product distance, and the <=> operator works with the vector index created by using the cosine similarity. If the operators do not work with the corresponding vector indexes, the vector search method is downgraded to exact search (brute-force search).
- Use the ANN access method that is provided by the FastANN vector search extension. If you enable the vector search engine optimization feature for an AnalyticDB for PostgreSQL instance, the FastANN vector search extension is automatically created in the AnalyticDB for PostgreSQL instance.
- If the You must specify an operator class or define a default operator class for the data type error message is displayed, the vector search engine optimization feature is not enabled for the AnalyticDB for PostgreSQL instance. In this case, enable the feature and try again. For more information, see Enable or disable vector search engine optimization.
Create an index for structured data columns to accelerate hybrid search.
```
CREATE INDEX ON chunks(intime);
```

Index creation modes

You can create vector indexes in one of the following modes:

Streaming mode
Create an empty table and then create a vector index for the table. This way, the streaming index is created in real time when you import vector data. This mode is suitable for real-time vector search scenarios. However, you cannot quickly import vector data in this mode.
Asynchronous mode
Create an empty table, import vector data to the table, and then create a vector index for the imported vector data. This mode is suitable for scenarios in which you want to import large amounts of vector data.

Note

AnalyticDB for PostgreSQL allows you to concurrently create vector indexes in asynchronous mode. You can use the Grand Unified Configuration (GUC) parameter fastann.build_parallel_processes to specify the number of concurrent index creation processes.

For example, you can create an index in asynchronous mode by using Data Management (DMS). Sample code:

-- For example, an AnalyticDB for PostgreSQL instance whose compute node specifications are 8 cores and 32 GB memory is used. 
-- To prevent timeout errors, set the statement_timeout and idle_in_transaction_session_timeout parameters to 0. 
-- Set the fastann.build_parallel_processes parameter to 8 to ensure that the CPU resources of compute nodes are fully utilized. The value of the fastann.build_parallel_processes parameter must be the same as the number of CPU cores in the compute node specifications. 
-- In DMS, you must specify values for parameters in the same line as the index creation statement. 
SET statement_timeout = 0;SET idle_in_transaction_session_timeout = 0;SET fastann.build_parallel_processes = 8;CREATE INDEX ON chunks USING ann(feature) 
WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);

You can create an index in asynchronous mode by using the psql tool. Sample code:

-- For example, an AnalyticDB for PostgreSQL instance whose compute node specifications are 8 cores and 32 GB memory is used. 

-- Step 1: Store the following content in the create_index.sql file: 
SET statement_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET fastann.build_parallel_processes = 8;
CREATE INDEX ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);

-- Step 2: Run the following command to create an index in parallel: 
psql -h gp-xxxx-master.gpdb.rds.aliyuncs.com -p 5432 -d dbname -U username -f create_index.sql

Important

If other services are running on the AnalyticDB for PostgreSQL instance, you cannot set the number of concurrent index creation processes to the number of CPU cores in the compute node specifications.

References

Supported vector functions

Purpose	Vector function	Type of the return value	Description	Supported data type
Calculation	l2_distance	Double precision	Calculates the rooted Euclidean distance between two vectors. This function is used to measure the distance between two vectors and suitable for image similarity search scenarios. Important For AnalyticDB for PostgreSQL instances of V6.3.10.17 or earlier, the l2_distance function calculates the squared Euclidean distance. For AnalyticDB for PostgreSQL instances of V6.3.10.18 or later, the l2_distance function calculates the rooted Euclidean distance.	Smallint[], float2[], float4[], and real[]
	inner_product_distance	Double precision	Calculates the inner product distance between two vectors, which is equivalent to cosine similarity after vector normalization. This function is used to replace cosine similarity after vector normalization. Formula:	Smallint[], float2[], float4[], and real[]
	cosine_similarity	Double precision	Calculates the cosine similarity between two vectors. This function is used to measure the similarity between two vectors based on the direction regardless of the actual length of the vectors. Valid values of the cosine similarity: -1 to 1. Formula:	Smallint[], float2[], float4[], and real[]
	dp_distance	Double precision	Calculates the dot product distance, which is the same as the inner product distance. Formula:	Smallint[], float2[], float4[], and real[]
	hm_distance	Integer	Calculates the Hamming distance, which is the number of bit positions in which the two bits are different.	Int[]
	vector_add	Smallint[], float2[], or float4[]	Calculates the sum of two vector arrays.	Smallint[], float2[], float4[], and real[]
	vector_sub	Smallint[], float2[], or float4[]	Calculates the difference between two vector arrays.	Smallint[], float2[], float4[], and real[]
	vector_mul	Smallint[], float2[], or float4[]	Calculates the product of two vector arrays.	Smallint[], float2[], float4[], and real[]
	vector_div	Smallint[], float2[], or float4[]	Calculates the quotient of two vector arrays.	Smallint[], float2[], float4[], and real[]
	vector_sum	Int or double precision	Calculates the sum of all elements in a vector array.	Smallint[], int[], float2[], float4[], real[], and float8[]
	vector_min	Int or double precision	Calculate the smallest value among all elements in a vector array.	Smallint[], int[], float2[], float4[], real[], and float8[]
	vector_max	Int or double precision	Calculates the largest value among all elements in a vector array.	Smallint[], int[], float2[], float4[], real[], and float8[]
	vector_avg	Int or double precision	Calculates the average value of all elements in a vector array.	Smallint[], int[], float2[], float4[], real[], and float8[]
	vector_norm	Double precision	Calculates the norm of a vector array.	Smallint[], int[], float2[], float4[], real[], and float8[]
	vector_angle	Double precision	Calculates the angle between two vector arrays.	Smallint[], float2[], float4[], and real[]
Sorting	l2_squared_distance	Double precision	Sorts vectors based on the squared Euclidean distance. This function requires fewer calculation than the sorting method based on the rooted Euclidean distance. Formula:	Smallint[], float2[], float4[], and real[]
	negative_inner_product_distance	Double precision	Sorts vectors based on the negative inner product distance. This function is used to sort vectors based on the inner product distance in descending order. Formula:	Smallint[], float2[], float4[], and real[]
	cosine_distance	Double precision	Sorts vectors based on the cosine distance. This function is used to sort vectors based on the cosine similarity in descending order. Valid values of the cosine distance: 0 to 2. Formula:	Smallint[], float2[], float4[], and real[]

Important

Only AnalyticDB for PostgreSQL instances of V6.6.1.0 or later support the following vector functions: vector_add, vector_sub, vector_mul, vector_div, vector_sum, vector_min, vector_max, vector_avg, vector_norm, and vector_angle. For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor engine version.

Examples of vector functions:

-- Euclidean distance
SELECT l2_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Inner product distance
SELECT inner_product_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Cosine similarity
SELECT cosine_similarity(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Dot product distance
SELECT dp_distance(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Hamming distance
SELECT hm_distance(array[1,0,1,0]::int[], array[0,1,0,1]::int[]);
-- Sum of two vector arrays
SELECT vector_add(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Difference between two vector arrays
SELECT vector_sub(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Product of two vector arrays
SELECT vector_mul(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Quotient of two vector arrays
SELECT vector_div(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);
-- Sum of all elements in a vector array
SELECT vector_sum(array[1,1,1,1]::real[]);
-- Smallest value among all elements in a vector array
SELECT vector_min(array[1,1,1,1]::real[]);
-- Largest value among all elements in a vector array
SELECT vector_max(array[1,1,1,1]::real[]);
-- Average value of all elements in a vector array
SELECT vector_avg(array[1,1,1,1]::real[]);
-- Norm of a vector array
SELECT vector_norm(array[1,1,1,1]::real[]);
-- Angle between two vector arrays
SELECT vector_angle(array[1,1,1,1]::real[], array[2,2,2,2]::real[]);

Supported vector operators

Vector operator	Calculation	Sorting	Supported data type
<->	Calculates the squared Euclidean distance between two vectors. The result is equivalent to the result of the l2_squared_distance function.	Sorts vectors based on the squared Euclidean distance in ascending order.	Smallint[], float2[], float4[], and real[]
<#>	Calculates the negative inner product distance between two vectors. The result is equivalent to the result of the negative_inner_product_distance function.	Sorts vectors based on the dot product distance in descending order.	Smallint[], float2[], float4[], and real[]
<=>	Calculates the cosine distance between two vectors. The result is equivalent to the result of the cosine_distance function.	Sorts vectors based on the cosine similarity in descending order.	Smallint[], float2[], float4[], and real[]
+	Calculates the sum of two vectors.	N/A	Smallint[], float2[], float4[], and real[]
-	Calculates the difference between two vectors.	N/A	Smallint[], float2[], float4[], and real[]
*	Calculates the product of two vectors.	N/A	Smallint[], float2[], float4[], and real[]
/	Calculates the quotient of two vectors.	N/A	Smallint[], float2[], float4[], and real[]

Important

Only AnalyticDB for PostgreSQL instances of V6.6.1.0 or later support the following vector operators: +, -, *, and /. For information about how to view the minor version of an AnalyticDB for PostgreSQL instance, see View the minor engine version.

Examples of vector operators:

-- Euclidean distance
SELECT array[1,1,1,1]::real[] <-> array[2,2,2,2]::real[] AS score;
-- Inner product distance
SELECT array[1,1,1,1]::real[] <#> array[2,2,2,2]::real[] AS score;
-- Cosine distance
SELECT array[1,1,1,1]::real[] <=> array[2,2,2,2]::real[] AS score;

-- Sum of two vectors
SELECT array[1,1,1,1]::real[] + array[2,2,2,2]::real[] AS value;
-- Difference between two vectors
SELECT array[1,1,1,1]::real[] - array[2,2,2,2]::real[] AS value;
-- Product of two vectors
SELECT array[1,1,1,1]::real[] * array[2,2,2,2]::real[] AS value;
-- Quotient of two vectors
SELECT array[1,1,1,1]::real[] / array[2,2,2,2]::real[] AS value;