This topic describes how to use the pgvector extension to test the performance of an ApsaraDB RDS for PostgreSQL instance based on Hierarchical Navigable Small Worlds (HNSW) indexes. The pgvector extension uses the ANN-Benchmarks tool to evaluate key metrics, such as the recall rate, queries per second (QPS), and index creation time.
Test environment
An RDS instance and an Elastic Compute Service (ECS) instance must reside in the same virtual private cloud (VPC) and belong to the same vSwitch to prevent errors caused by network fluctuations.
Test instance and test tool | Description |
RDS instance specifications |
|
ECS instance specifications |
|
Test tool | Important By default, the ANN-Benchmarks tool tests single-threaded performance. If you want to test the concurrency performance, you can follow the instructions provided in Use the pgvector extension to test performance based on IVF indexes. |
Preparations
RDS instance
Create a privileged account named ann_testuser and a database named ann_testdb. For more information, see Create a database and an account.
Install the pgvector extension in the ann_testdb database. The pgvector extension is named vector in the system. For more information, see Manage extensions.
ECS instance
Install Docker. For more information, see Install Docker.
Run the following commands to download the ANN-Benchmarks tool:
cd ~ git clone https://github.com/erikbern/ann-benchmarks.git
Run the following commands to create and activate a virtual environment named ann_test by using Conda and install Python 3.10.6:
yum install git yum install conda conda create -n ann_test python=3.10.6 conda init bash source /usr/etc/profile.d/conda.sh conda activate ann_test
Run the following commands to install the dependencies of the ANN-Benchmarks tool:
cd ~/ann-benchmarks/ pip install -r requirements.txt
Test procedure
All steps in the test procedure are performed in the ann_test virtual environment. If you are forced to log out due to errors such as a timeout, you can run the conda activate ann_test
command to log on to the environment again.
Step 1: Configure connection information for the test tool
Append the following connection settings to the ~/ann-benchmarks/ann_benchmarks/algorithms/pgvector/module.py
file of the test tool and configure the settings based on your business requirements:
# Configure the parameters to connect to the RDS instance.
os.environ['ANN_BENCHMARKS_PG_USER'] = 'ann_testuser' # Specifies the username of the account that is used to connect to the RDS instance.
os.environ['ANN_BENCHMARKS_PG_PASSWORD'] = 'testPawword' # Specifies the password of the account that is used to connect to the RDS instance.
os.environ['ANN_BENCHMARKS_PG_DBNAME'] = 'ann_testdb' # Specifies the name of the required database on the RDS instance.
os.environ['ANN_BENCHMARKS_PG_HOST'] = 'pgm-****.pg.rds.aliyuncs.com' # Specifies the internal endpoint of the RDS instance.
os.environ['ANN_BENCHMARKS_PG_PORT'] = '5432' # Specifies the internal port of the RDS instance.
os.environ['ANN_BENCHMARKS_PG_START_SERVICE'] = 'false' # Disables automatic startup.
Step 2: Configure test parameters for the ANN-Benchmarks tool
Modify the ~/ann-benchmarks/ann_benchmarks/algorithms/pgvector/config.yml
file of the test tool based on your business requirements. Examples:
float:
any:
- base_args: ['@metric']
constructor: PGVector
disabled: false
docker_tag: ann-benchmarks-pgvector
module: ann_benchmarks.algorithms.pgvector
name: pgvector
run_groups:
M-16(100):
arg_groups: [{M: 16, efConstruction: 100}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 800]]
M-16(200):
arg_groups: [{M: 16, efConstruction: 200}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 800]]
M-24(200):
arg_groups: [{M: 24, efConstruction: 200}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 800]]
This test is performed on the following groups: M-16(100), M-16(200), and M-24(200). In the test for each group, arg_groups
is used to create relevant parameters for the HNSW index, and query_args
is used to retrieve the relevant parameters.
Parameter | Description | |
arg_groups | M | Corresponds to Parameter M in the HNSW index, which specifies the maximum number of neighboring nodes of each node at each layer when you create the HNSW index. A large value indicates a high graph density. In most cases, a high graph density increases the recall rate and extends time required to create and query the index. |
efConstruction | Corresponds to ef_construction in the HNSW index, which specifies the size of the candidate set when you create the HNSW index. The size of a candidate set specifies the number of candidate nodes retained to achieve optimal connections. In most cases, a large value increases the recall rate and extends the time required to create and query the index. | |
query_args | ef_search | Specifies the size of the candidate set that is maintained during the query. In most cases, a large value increases the recall rate and extends the time reuqired to query the index. |
Step 3: Create a test Docker image
Optional. Modify the
~/ann-benchmarks/ann_benchmarks/algorithms/pgvector/Dockerfile
file of the test tool based on the following information to skip the test on the PostgreSQL community that is automatically configured for the tool:FROM ann-benchmarks USER root RUN pip install psycopg[binary] pgvector
Specify
--algorithm pgvector
and run the install.py script to create a test Docker image.cd ~/ann-benchmarks/ python install.py --algorithm pgvector
NoteYou can run the
python install.py --help
command to view the supported configuration parameters.
Step 4: Obtain a dataset
When you run a test script, the specified public test dataset is automatically downloaded.
In this topic, the nytimes-256-angular dataset that uses the Angular distance formula is used for similarity retrieval. For more information, see ann-benchmarks.
Dataset | Dimension | Number of data rows | Number of test vectors | Top N nearest neighbors | Distance formula |
NYTimes | 256 | 290,000 | 10,000 | 100 | Angular |
If the public dataset does meet the test requirements, we recommend that you convert the vector data to a standard format, such as Hierarchical Data Format version 5 (HDF5), to test retrieval performance. For more information, see Appendix 2: Custom test datasets.
Step 5: Perform the test and obtain the results
Run the following commands to execute the test script:
cd ~/ann-benchmarks nohup python run.py --dataset nytimes-256-angular -k 10 --algorithm pgvector --runs 1 > ann_benchmark_test.log 2>&1 & tail -f ann_benchmark_test.log
Parameter
Description
--dataset
The dataset that you want to test.
--k
The LIMIT value of the SQL statement, which specifies the number of records to return.
--algorithm
The algorithm for the tested vector database. In this example, the parameter is set to pgvector.
--runs
The number of runs of the test from which the optimal result set is selected.
--parallelism
The parallelism. Default value: 1.
Run the following commands to obtain the test result:
cd ~/ann-benchmarks python plot.py --dataset nytimes-256-angular --recompute
NoteThe following table describes the test result.
m
ef_construction
ef_search
Recall rate
QPS
16
100
10
0.630
1423.985
20
0.741
1131.941
40
0.820
836.017
80
0.871
574.733
120
0.894
440.076
200
0.918
297.267
400
0.945
162.759
800
0.969
84.268
16
200
10
0.683
1299.667
20
0.781
1094.968
40
0.849
790.838
80
0.895
533.826
120
0.914
405.975
200
0.933
272.591
400
0.956
148.688
800
0.977
76.555
24
200
10
0.767
1182.747
20
0.840
922.770
40
0.887
639.899
80
0.920
411.140
120
0.936
303.323
200
0.953
199.752
400
0.973
105.506
800
0.988
53.904
Optional. Run the following commands to obtain test result details, including the recall rate, QPS, response time (RT), index creation time, and index size.
cd ~/ann-benchmarks python data_export.py --out res.csv
For example, you can query the relationship between
m
,ef_construction
, and index creation time.m
ef_construction
build (Index creation time in seconds)
16
100
33.35161
16
200
57.66014
24
200
87.22608
Test conclusions
When you create an HNSW index, the following conclusions can be obtained:
If you increase the values of the
m
,ef_construction
, andef_search
parameters, the recall rate increases but the QPS decreases.If you increase the values of the
m
andef_construction
parameters, the recall rate increases, the QPS decreases, and the index creation time is extended.If you have high requirements for the recall rate, we recommend that you do not use the default values of the index parameters. The default values of the m, ef_construction, and ef_search parameters are 16, 64, and 40, respectively.
Appendix 1: Impacts of the parameter configurations of an RDS instance and vectors on index creation
You can configure different parameter values for the ANN-Benchmarks tool, perform the test, and then analyze the test results to determine the impacts of the parameter configurations of an RDS instance and vectors on index creation.
Impacts of the parameter configurations of an RDS instance on index creation
If you set the m parameter to 16 and the efConstruction parameter to 64 for the ANN-Benchmarks tool, the parameter configurations of the RDS instance have the following impacts on index creation:
maintenance_work_mem
This parameter specifies the maximum amount of memory that can be used for maintenance operations, such as VACUUM and CREATE INDEX. Unit: KB. If the value of the maintenance_work_mem parameter is less than the size of the test data, you can increase the value to shorten the time required to create an index. If the value of the maintenance_work_mem parameter exceeds the size of the test data, the index creation time does not decrease.
For example, your RDS instance uses the pg.x8.2xlarge.2c instance type that provides 16 cores and 128 GB of memory, you set the max_parallel_maintenance_workers parameter to the default value 8, and the nytimes-256-angular dataset contains approximately 324 MB of data. In this case, the configuration of the maintenance_work_mem parameter has the following impacts on index creation.
maintenance_work_mem
Index creation time (unit: seconds)
64 MB (65,536 KB)
52.82
128 MB (131,072 KB)
46.79
256 MB (262,144 KB)
36.40
512 MB (524,288 KB)
18.90
1 GB (1,048,576 KB)
19.06
max_parallel_maintenance_workers
This parameter specifies the maximum number of parallel workers that can be started by a single CREATE INDEX operation. The index creation time decreases as the value of the max_parallel_maintenance_workers parameter increases.
For example, your RDS instance uses the pg.x8.2xlarge.2c instance type that provides 16 cores and 128 GB of memory, you set the maintenance_work_mem parameter to the default value 2048 (equivalent to 2 GB), and the nytimes-256-angular dataset contains approximately 324 MB of data. In this case, the configuration of the max_parallel_maintenance_workers parameter has the following impacts on index creation.
max_parallel_maintenance_workers
Index creation time (unit: seconds)
1
76.00
2
51.34
4
32.49
8
19.66
12
14.44
16
13.07
24
13.15
Impacts of the parameter configurations of vectors on index creation
If you set the maintenance_work_mem parameter to 8 GB (equivalent to 8,388,608 KB) and the max_parallel_maintenance_workers parameter to 16 for the RDS instance, the parameter configurations of the RDS instance have the following impacts on index creation.
Vector dimension
The GloVe dataset that contains 1,183,514 rows of data is used. If you set the m parameter to 16, the efConstruction parameter to 64, and the ef_search parameter to 40 for the ANN-Benchmarks tool, the following results are returned. The test results show that the index creation time increases, the recall rate and QPS decrease, and the query latency increases as the vector dimension increases.
Dimension
Index creation time (unit: seconds)
Recall rate
QPS
P99 (ms)
25
195.10
0.99985
192.94
7.84
50
236.92
0.99647
152.36
9.69
100
319.36
0.97231
126.89
11.14
200
529.33
0.93186
95.05
15.11
NoteP99: the 99th percentile of latency after the RTs of all query requests are sorted in ascending order. This indicates that the RTs of 99% of all query requests are lower than this value.
Rows of vectors
The dbpedia-openai-{n}k-angular dataset is used, and the number of rows of vectors that is specified by n ranges from 100 to 1,000. If you set the m parameter to 48, the efConstruction parameter to 256, and the ef_search parameter to 200 for the ANN-Benchmarks tool, the following results are returned. The test results show that index creation time nonlinearly increases, the recall rate and QPS decrease, and the query latency increases as the number of rows of vectors increases.
Rows of vectors
Number of rows (10,000)
Index creation time (unit: seconds)
Recall rate
QPS
P99 (ms)
100
10
54.05s
0.9993
171.74
8.93
200
20
137.23
0.99901
146.78
10.81
500
50
436.68
0.999
118.55
13.94
1000
100
957.26
0.99879
101.60
16.35
Appendix 2: Custom test datasets
The following datasets are generated based on the format of the nytimes-256-angular dataset and are provided only for reference.
Run the following commands to create a custom dataset:
NoteIn this example, rds_ai is installed. For more information, see Use the AI capabilities provided by the rds_ai extension.
import h5py import numpy as np import psycopg2 import pgvector.psycopg2 # Assume conn_info = { 'host': 'pgm-****.rds.aliyuncs.com', 'user': 'ann_testuser', 'password': '****', 'port': '5432', 'dbname': 'ann_testdb' } embedding_len = 1024 distance_top_n = 100 query_batch_size = 100 try: # Connect to the RDS instance. with psycopg2.connect(**conn_info) as connection: pgvector.psycopg2.register_vector(connection) with connection.cursor() as cur: # Obtain vector data. cur.execute("select count(1) from test_rag") count = cur.fetchone()[0] train_embeddings = [] for start in range(0, count, query_batch_size): query = f"SELECT embedding FROM test_rag ORDER BY id OFFSET {start} LIMIT {query_batch_size}" cur.execute(query) res = [embedding[0] for embedding in cur.fetchall()] train_embeddings.extend(res) train = np.array(train_embeddings) # Obtain the query data and calculate the embedding. with open('query.txt', 'r', encoding='utf-8') as file: queries = [query.strip() for query in file] test = [] # Install the rds_ai extension or use Alibaba Cloud Model Studio SDK. for query in queries: cur.execute(f"SELECT rds_ai.embed('{query.strip()}')::vector(1024)") test.extend([cur.fetchone()[0]]) test = np.array(test) # Calculate the distances between the top N nearest neighbors and the required node. dot_product = np.dot(test, train.T) norm_test = np.linalg.norm(test, axis=1, keepdims=True) norm_train = np.linalg.norm(train, axis=1, keepdims=True) similarity = dot_product / (norm_test * norm_train.T) distance_matrix = 1 - similarity neighbors = np.argsort(distance_matrix, axis=1)[:, :distance_top_n] distances = np.take_along_axis(distance_matrix, neighbors, axis=1) with h5py.File('custom_dataset.hdf5', 'w') as f: f.create_dataset('distances', data=distances) f.create_dataset('neighbors', data=neighbors) f.create_dataset('test', data=test) f.create_dataset('train', data=train) f.attrs.update({ "type": "dense", "distance": "angular", "dimension": embedding_len, "point_type": "float" }) print("The HDF5 file is created and the dataset is added.") except (Exception, psycopg2.DatabaseError) as error: print(f"Error: {error}")
In the
DATASETS
section of the~/ann-benchmarks/ann_benchmarks/datasets.py
file, add a custom dataset.DATASETS: Dict[str, Callable[[str], None]] = { ......, "<custom_dataset>": None, }
Upload the custom dataset to the
~/ann-benchmarks
directory and use the dataset to run therun.py
test script.