You can create a vector index to accelerate vector searches. This is useful for large datasets or scenarios that require fast data access and retrieval. Common use cases include database query optimization, machine learning, data mining, image and video retrieval, and spatial data queries. A vector index improves query performance, accelerates data analytics, and optimizes search tasks. This improves system efficiency and response speed.
Background information
The FastANN vector search engine in cloud-native data warehouse AnalyticDB for PostgreSQL implements the popular Hierarchical Navigable Small World (HNSW) algorithm. The engine is built on the segmented page storage of PostgreSQL. It stores only pointers to vector columns in the index, which significantly reduces storage space. The FastANN vector search engine also supports product quantization (PQ) to reduce the dimensions of high-dimensional vectors. Storing these reduced-dimension vectors in the index minimizes table lookups during vector insertions and queries. This improves vector search performance.
Syntax
CREATE INDEX [INDEX_NAME]
ON [SCHEMA_NAME].[TABLE_NAME]
USING ANN(COLUMN_NAME)
WITH (DIM=<DIMENSION>,
DISTANCEMEASURE=<MEASURE>,
HNSW_M=<M>,
HNSW_EF_CONSTRUCTION=<EF_CONSTURCTION>,
PQ_ENABLE=<PQ_ENABLE>,
PQ_SEGMENTS=<PQ_SEGMENTS>,
PQ_CENTERS=<PQ_CENTERS>,
EXTERNAL_STORAGE=<EXTERNAL_STORAGE>);The following table describes the fields.
INDEX_NAME: The index name.
SCHEMA_NAME: The schema (namespace) name.
TABLE_NAME: The table name.
COLUMN_NAME: The vector index column name.
Other vector index parameters:
Vector index parameter
Required
Description
Default value
Value range
DIM
Yes
The vector dimensions. This parameter is used for checks during vector insertion. If the dimensions do not match, the system returns an error message.
0
[1, 8192]
DISTANCEMEASURE
No
The supported similarity distance measures:
L2: Builds an index using the squared Euclidean distance function. This is typically used for image similarity search scenarios.
Formula:

IP: Builds an index using the negative inner product distance function. This is typically used as a substitute for cosine similarity after vector normalization.
Formula:

COSINE: Builds an index using the cosine distance function. This is typically used for text similarity search scenarios.
Formula:

NoteIP and COSINE require an engine version of 6.3.10.18 or later. Before you use them, make sure your engine version meets this requirement. To view and upgrade the engine version, see Upgrade versions.
Normalize vector data before you import it. Then, use the IP distance as the similarity measure for the index to achieve optimal performance.
When you use IP distance as the similarity measure for an index, you can also directly obtain the Euclidean distance and cosine similarity during queries. For more information about how to use this feature, see Vector search.
For more information about user-defined functions (UDFs) for vector search, see Related reference.
L2
(L2, IP, COSINE)
HNSW_M
No
The maximum number of neighbors in the HNSW algorithm.
32
[10, 1000]
HNSW_EF_CONSTRUCTION
No
The size of the candidate set during index building for the HNSW algorithm.
600
[40, 4000]
PQ_ENABLE
No
Specifies whether to enable the PQ vector dimension reduction feature. PQ vector dimension reduction depends on training with existing vector sample data. Do not set this parameter if the data volume is less than 500,000.
0
[0, 1]
PQ_SEGMENTS
No
This parameter takes effect when PQ_ENABLE is set to 1. It specifies the number of segments in the k-means clustering algorithm used for PQ vector dimension reduction.
If DIM is divisible by 8, you do not need to specify this parameter. Otherwise, you must set it manually.
PQ_SEGMENTS must be greater than 0, and DIM must be divisible by PQ_SEGMENTS.
0
[1, 256]
PQ_CENTERS
No
This parameter takes effect when PQ_ENABLE is set to 1. It specifies the number of centroids in the k-means clustering algorithm used for PQ vector dimension reduction.
2048
[256, 8192]
EXTERNAL_STORAGE
No
Specifies whether to use mmap to build the HNSW index.
If set to 0, the index is built using segmented page storage by default. This mode can use the shared_buffer in PostgreSQL for caching and supports operations such as delete and update.
If set to 1, the index is built using mmap. Starting from version v6.6.2.3, the index supports mark-for-delete, which allows a small number of delete and update operations on the data table.
ImportantThe external_storage parameter is supported only in version 6.0. It is not supported in version 7.0.
0
[0, 1]
Examples
Assume that you have a text knowledge base. You can split articles into chunks, convert them into embedding vectors, and store them in a database. The generated `chunks` table contains the following fields:
Field | Type | Description |
id | serial | The ID. |
chunk | varchar(1024) | The text chunk after the article is split. |
intime | timestamp | The time when the article was imported into the database. |
url | varchar(1024) | The link to the article to which the text chunk belongs. |
feature | real[] | The embedding vector of the text chunk. |
Create a database table to store the vectors.
CREATE TABLE chunks ( id SERIAL PRIMARY KEY, chunk VARCHAR(1024), intime TIMESTAMP, url VARCHAR(1024), feature REAL[] ) DISTRIBUTED BY (id);Set the storage mode of the vector column to PLAIN to reduce row scan costs and improve performance.
ALTER TABLE chunks ALTER COLUMN feature SET STORAGE PLAIN;Create a vector index on the vector column.
-- Create a vector index that uses the Euclidean distance measure. CREATE INDEX idx_chunks_feature_l2 ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=l2, hnsw_m=64, pq_enable=1); -- Create a vector index that uses the inner product distance measure. CREATE INDEX idx_chunks_feature_ip ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=ip, hnsw_m=64, pq_enable=1); -- Create a vector index that uses the cosine similarity measure. CREATE INDEX idx_chunks_feature_cosine ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);NoteCurrently, you can create multiple vector columns in a vector table and multiple vector indexes on each vector column. Create vector indexes only as needed to avoid creating invalid ones.
The vector query method must correspond to the distance measure of the vector index. For example, the
<->operator in a vector query corresponds to a vector index that uses the Euclidean distance measure. The<#>operator corresponds to an inner product distance index, and the<=>operator corresponds to a cosine similarity index. If a corresponding vector index does not exist, the vector query degrades to an exact search (brute-force search).Use the
ANNaccess method provided by the FastANN vector search plugin. When you enable the Vector Search Engine Optimization feature for an instance, the FastANN vector search plugin is automatically created in that instance.If the error message
You must specify an operator class or define a default operator class for the data typeis returned, the Vector Search Engine Optimization feature is not enabled for the instance. To resolve this issue, enable the feature and retry the operation. For more information, see Enable or disable vector search engine optimization.
Create indexes on frequently used structured columns to improve the performance of hybrid queries.
CREATE INDEX ON chunks(intime);
Vector index building methods
Vector indexes can be built in two main ways:
Stream building
First, you create an empty table and then build a vector index on it. When you import vector data, the index is built in real time as a stream. This method is suitable for real-time vector search scenarios but can slow down the data import process.
Asynchronous building
First, you create an empty table and import the data without a vector index. Then, you can build the vector index on the existing vector data. This method is suitable for large-scale vector data import scenarios.
AnalyticDB for PostgreSQL provides a concurrent building capability for the asynchronous vector index building method. You can set the degree of parallelism for index building using the `fastann.build_parallel_processes` Grand Unified Configuration (GUC) parameter.
For example, you can build an index asynchronously on DMS as follows:
-- Assume the operation is performed on an instance of the 8-core 32 GB instance type.
-- Set statement_timeout and idle_in_transaction_session_timeout to 0 to prevent timeout interruptions.
-- Set fastann.build_parallel_processes to 8 to ensure full CPU utilization on compute nodes. You must set this value based on the segment specifications. For example, set it to 4 for a 4-core 16 GB instance type, 8 for an 8-core 32 GB instance type, and 16 for a 16-core 64 GB instance type.
-- On DMS, the parameter settings must be on the same line as the CREATE INDEX SQL statement to take effect.
SET statement_timeout = 0;SET idle_in_transaction_session_timeout = 0;SET fastann.build_parallel_processes = 8;CREATE INDEX ON chunks USING ann(feature)
WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);To build an index asynchronously using psql, perform the following steps:
-- Assume the operation is performed on an 8-core 32 GB instance, the same as the DMS operation above.
-- Step 1: Save the following content to a file named create_index.sql.
SET statement_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET fastann.build_parallel_processes = 8;
CREATE INDEX ON chunks USING ann(feature) WITH (dim=1536, distancemeasure=cosine, hnsw_m=64, pq_enable=1);
-- Step 2: Run the following command to start building the index in parallel.
psql -h gp-xxxx-master.gpdb.rds.aliyuncs.com -p 5432 -d dbname -U username -f create_index.sqlIf the current instance is handling business workloads, do not set the number of parallel processes for index building to match the instance specifications. Otherwise, the performance of running business workloads may be affected.

