How do I use the pgvector extension to perform vector similarity searches? - ApsaraDB RDS

pgvector is a powerful extension that is supported by ApsaraDB RDS for PostgreSQL to perform vector similarity searches. The extension enables the system to support a new data type and allows you to perform vector similarity searches in an efficient manner.

Background information

ApsaraDB RDS for PostgreSQL supports the pgvector extension. The extension can be used to store vector data, perform vector similarity searches, and enrich the underlying data for AI services.

The pgvector extension provides the following benefits:

It enables the system to support vector data. You can use the pgvector extension to store and search for vector data.
It supports exact and approximate nearest neighbor (ANN) searches. This allows you to calculate the distances or similarities between vectors in a vector space by L2 Euclidean distance, cosine similarity, or inner product. It supports the construction of Hierarchical Navigable Small World (HNSW) indexes and IVFFLAT parallel indexes, element-by-element multiplication of vectors, L1 distance functions, and sum aggregation.
It allows you to create vectors with up to 16,000 dimensions and create indexes for vectors with up to 2,000 dimensions.

Terms and implementation principles

embedding

Embedding is a technique that translates high-dimensional data into a low-dimensional space. In machine learning and natural language processing (NLP), embedding is a common method that is used to represent sparse symbols or objects as continuous vectors.

In NLP, word embedding is an approach that is used to represent words in the form of real-valued vectors. This allows computers to better understand text-based content. Word embedding is used to translate the semantics and grammatical relations of words into a vector space.

Note

For more information, see the following official documentation of common embedding tools and libraries:

Implementation principles

Embedding is a technique that abstracts data such as words, images, audio content, and videos from multiple dimensions and translates the data into vectors.
The pgvector extension supports vector data, which allows ApsaraDB RDS for PostgreSQL to store vector data.
The pgvector extension can be used to perform exact and ANN searches on stored vector data.

Assume that you need to store three objects (an apple, a banana, and a cat) in an RDS instance and use the pgvector extension to calculate the similarities among the objects. You can perform the following steps:

Use embeddings to translate the apple, banana, and cat objects into vectors. The following section provides an example of two-dimensional embeddings:
```
Apple: embedding [1,1]
Banana: embedding [1.2,0.8]
Cat: embedding [6,0.4]
```
Store the vector data in the RDS instance. For more information, see Examples.
The following figure shows the distribution of the apple, banana, and cat objects in a two-dimensional space.

The apple and the banana are closer because they are both fruits and are more similar to each other. The cat is an animal and is located far from the fruits.

The attributes of fruits can be grouped in terms of dimensions such as color, place of origin, and taste. If you group the fruits from finer-grained dimensions, you can obtain more accurate results.

Scenarios

You want to store vector data.
You want to perform vector similarity searches.

Prerequisites

The RDS instance meets the following requirements:

The RDS instance runs PostgreSQL 14 or later.
The minor engine version of your RDS instance is 20230430 or later.
Note
For more information about how to upgrade the major engine version or update the minor engine version of the RDS instance, see Upgrade the major engine version of an ApsaraDB RDS for PostgreSQL instance or Update the minor engine version of an ApsaraDB RDS for PostgreSQL instance.

Extension management

Create the extension.
```
CREATE EXTENSION IF NOT EXISTS vector;
```
Delete the extension.
```
DROP EXTENSION vector;
```
Update the extension.
```
ALTER EXTENSION vector UPDATE [ TO new_version ]
```
Note
In the preceding statement, the new_version parameter indicates the version of the pgvector extension. For more information about the latest version and related features of the pgvector extension, see official documentation of pgvector.

Examples

This section provides an example on how to use the pgvector extension. For more information, see official documentation of pgvector.

Create a table named items of the vector type to store embeddings.
```
CREATE TABLE items (
  id bigserial PRIMARY KEY, 
  item text, 
  embedding vector(2)
);
```
Note
In the preceding statements, two-dimensional vectors are used. The pgvector extension allows you to create vectors with up to 16,000 dimensions.

Insert the vectors into the created table.

INSERT INTO
  items (item, embedding)
VALUES
  ('Apple', '[1,1]'),
  ('Banana', '[1.2,0.8]'),
  ('Cat', '[6,0.4]');

Use the cosine similarity operator (<=>) to calculate the similarities between the banana, apple, and cat objects.
```
SELECT
  item,
  1 - (embedding <=> '[1.2, 0.8]') AS cosine_similarity
FROM
  items
ORDER BY
  cosine_similarity DESC;
```
Note
- In the preceding statements, the cosine_similarity = 1 - cosine_distance formula is used to calculate similarities. The shorter the distance between vectors, the more similar the objects are.
- You can also use the Euclidean distance operator (<->) or the inner product operator (<#>) to calculate similarities.
Sample output:
```
item | cosine_similarity
------+--------------------
 Banana |                  1
 Apple |  0.980580680748848
 Cat   |  0.867105556566985
```
In the preceding output:
- A value of 1 for the banana indicates an exact match.
- A value of 0.98 for the apple indicates that the apple has a high similarity with the banana.
- A value of 0.86 for the cat indicates that the cat has a low similarity with the banana.
Note
You can specify a similarity threshold based on your business requirements to exclude results with low similarities.

Create an index for vectors to improve the efficiency of vector similarity searches by using the pgvector extension. You can execute the following statement to create an index for the embedding parameter:

CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

The following table describes the parameters.

Parameter/Value	Description
items	The name of the table for which the index is created.
embedding	The name of the column for which the index is created.
vector_cosine_ops	The access mode that is specified in the vector index method. For vector similarity searches by cosine similarity, set this parameter to `vector_cosine_ops`. For vector similarity searches by Euclidean distance, set this parameter to `vector_l2_ops`. For vector similarity searches by inner product, set this parameter to `vector_ip_ops`.
lists = 100	The lists parameter indicates the number of lists into which the dataset is divided. A larger value indicates that the dataset is divided into more lists, the size of each subset is smaller, and the index search speed is faster. As the value of the lists parameter increases, the search recall rate may decrease. Note The recall rate is the ratio of the number of exactly searched samples or classified samples to the total number of relevant samples in an information search or classification task. The recall rate is a metric that is used to measure the capability of the system to identify all relevant samples. Index creation requires a large number of memory resources. If the value of the lists parameter is larger than 2000, the `ERROR: memory required is xxx MB, maintenance_work_mem is xxx MB` error message is displayed. In this case, you must change the value of the maintenance_work_mem parameter to a larger value. However, an excessively large value may cause out of memory (OOM) errors. For more information about how to change the value of the parameter, see Modify the parameters of an ApsaraDB RDS for PostgreSQL instance. You can change the value of the lists parameter to make a trade-off between the search speed and the recall rate. This helps meet your requirements in specific scenarios.

Use one of the following methods to configure the ivfflat.probes parameter. The parameter specifies the number of lists that you want to search in the index. To search more lists and improve the recall rate of the search result, you can increase the value of the ivfflat.probes parameter.
- Session level
```
SET ivfflat.probes = 10;
```
- Transaction level
```
BEGIN; SET LOCAL ivfflat.probes = 10; SELECT ... COMMIT;
```
A large value of the ivfflat.probes parameter indicates a high recall rate of the search result but a slow search speed. To ensure the optimal search performance and recall rate, you must adjust the values of the ivfflat.probes and lists parameters based on your business requirements and the characteristics of the dataset.
Note
If the value of the ivfflat.probes parameter is the same as the value of the lists parameter that is specified during the index creation, vector indexes are ignored and a full table scan is performed for your search. In this case, indexes are not used, and the full table is scanned. This may degrade search performance.

Performance data

When you create indexes for vectors, you must make a trade-off between the search speed and the recall rate based on your business requirements. The following test results provide examples of performance optimization.

This section describes the storage that is separately occupied by vectors and indexes for different volumes of data. This section also describes the impacts on the search speed and the recall rate when you adjust the values of the lists and probes parameters.

Test data preparation

Create a test database.
```
CREATE DATABASE testdb;
```
Install the extension.
```
CREATE EXTENSION IF NOT EXISTS vector;
```

Generate random vectors with a fixed length and use the vectors as the test data.

CREATE OR REPLACE FUNCTION random_array(dim integer) 
    RETURNS DOUBLE PRECISION[] 
AS $$ 
    SELECT array_agg(random()) 
    FROM generate_series(1, dim); 
$$ 
LANGUAGE SQL 
VOLATILE 
COST 1;

Create a table that can store vectors with 1,536 dimensions.
```
CREATE TABLE vtest(id BIGINT, v VECTOR(1536));
```

Insert data into a table.

INSERT INTO vtest SELECT i, random_array(1536)::VECTOR(1536) FROM generate_series(1, 100000) AS i;

Create an index.

CREATE INDEX ON vtest USING ivfflat(v vector_cosine_ops) WITH(lists = 100);

Procedure

To prevent the impacts on the test data due to factors such as network latency, we recommend that you use the internal endpoint of the RDS instance. In this example, the test is performed on an Elastic Compute Service (ECS) instance that resides in the same region and virtual private cloud (VPC) as the RDS instance.

Use a random vector to calculate the similarities between the data in the vtest table and obtain the most similar 50 records.
You must create an SQL file and write the following content to the file for subsequent stress testing:
```
WITH tmp AS (
    SELECT random_array(1536)::VECTOR(1536) AS vec
)
SELECT id
FROM vtest
ORDER BY v <=> (SELECT vec FROM tmp)
LIMIT FLOOR(RANDOM() * 50);
```

Use pgbench to perform stress testing.

pgbench is a program that runs benchmark tests in PostgreSQL. You must run the following commands in the CLI. Make sure that the PostgreSQL client is installed. In this example, PostgreSQL 15.1 is used. For more information, see PostgreSQL official documentation.

pgbench -f ./test.sql -c6 -T60 -P5 -U testuser -h pgm-bp****.pg.rds.aliyuncs.com -p 5432 -d testdb

The following table describes the parameters.

Parameter/Value	Description
-f ./test.sql	The path and file name of the test script file. `./test.sql` is an example. You must modify the parameter based on your business requirements.
-c6	The number of concurrent client connections. -c indicates that the number of concurrent client connections. In this example, 6 indicates that six concurrent client connections can be used for the test.
-T60	The testing time. -T indicates the time that is required to perform the test. In this example, 60 indicates that the test lasts for 60 seconds.
-P5	The script parameter. In this example, the progress report is displayed every 5 seconds.
-U testuser	The username of the database. You must replace testuser with your database username.
-h pgm-bp****.pg.rds.aliyuncs.com	The internal endpoint of the RDS instance.
-p 5432	The internal port of the RDS instance.
-d testdb	The database to which you want to connect. In this example, testdb is connected.

Test results

Test results between the data volume and the storage that is separately occupied by vectors and indexes, the latency, and the transactions per second (TPS)

Data volume (unit: 10,000 rows)	Storage used by vectors (unit: MB)	Storage used by indexes (unit: MB)	Latency (unit: milliseconds)	TPS
10	796	782	15.7	380
30	2388	2345	63	94
50	3979	3907	74	80
80	6367	6251	90	66
100	7958	7813	105	56

Impacts on the search speed and the recall rate when you adjust the value of the probes parameter and retain the value of the lists parameter

Assume that the value of the lists parameter is fixed as 2000 and the table contains 1 million data rows. In this case, a large value of the probes parameter indicates a high recall rate but a low TPS.

Impacts on the search speed and the recall rate when you adjust the value of the lists parameter and retain the value of the probes parameter

Assume that the value of the probes parameter is fixed as 20 and the number of data rows in the table is 1 million. In this case, a large value of the lists parameter indicates a low recall rate but a high TPS.

Conclusions

The value of the lists parameter has a slight impact on the storage that is occupied by indexes. The occupied storage is affected by the data volume in the table.
The values of the lists and probes parameters have opposite impacts on the search speed and the recall rate. Appropriate settings of the two parameters help make a trade-off between the search speed and the recall rate.
We recommend that you configure the lists and probes parameters based on the number of data rows in the table. The following section provides sample configuration settings.
- If the number of data rows in the table is less than or equal to 1 million, you can calculate the values of the lists and probes parameters by using the following formulas: lists = Number of data rows/1,000 and probes = Value of the lists parameter/10
- If the number of data rows in the table is greater than 1 million, you can calculate the values of the lists and probes parameters by using the following formulas: lists = sqrt(Number of data rows) and probes = sqrt(Value of the lists parameter).
  Note
  sqrt indicates the square root function.

Best practices

Build an LLM-driven dedicated chatbot on top of ApsaraDB RDS for PostgreSQL