Test environment
Resource | Specifications |
---|---|
AnalyticDB for PostgreSQL instance |
|
Elastic Compute Service (ECS) instance |
|
Preparations
- Download the open source data set ANN_GIST1M or ANN_SIFT1B from the following URL: http://corpus-texmex.irisa.fr/. In this example, ANN_GIST1M is used to demonstrate how to perform vector queries and integrated queries. Vector queries are performed on vector data sets, while integrated queries are performed on both structured and vector data.
- Download the test tool from the following link: Test tool.
Create a data table and an index
Execute the following statements to create the table and indexes that are required for the test.
CREATE SCHEMA IF NOT EXISTS vec;
CREATE TABLE IF NOT EXISTS "vec"."vector_test_for_gist" (
"id" bigint NOT NULL,
"shot_time" timestamp,
"device_id" bigint,
"feature_data" "float4"[],
PRIMARY KEY(id)
) DISTRIBUTED BY (id);
ALTER TABLE vec.vector_test_for_gist ALTER COLUMN shot_time SET STORAGE PLAIN;
ALTER TABLE vec.vector_test_for_gist ALTER COLUMN device_id SET STORAGE PLAIN;
ALTER TABLE vec.vector_test_for_gist ALTER COLUMN feature_data SET STORAGE PLAIN;
CREATE INDEX idx_vector_test_for_gist_device_id ON vec.vector_test_for_gist (device_id);
CREATE INDEX idx_vector_test_for_gist_shot_time ON vec.vector_test_for_gist (shot_time);
CREATE INDEX idx_vector_test_for_gist_feature_data ON vec.vector_test_for_gist USING ann(feature_data) WITH (dim=960,pq_segments=64,hnsw_m=64,external_storage=1);
Parameters for indexes:
Data dimension | Whether deletion and update operations are performed | Vector parameter |
---|---|---|
256 | Yes | dim=256,pq_segments=16,hnsw_m=16 |
256 | No | dim=256,pq_segments=16,hnsw_m=16,external_storage=1 |
512 | Yes | dim=512,pq_segments=64,hnsw_m=64 |
512 | No | dim=512,pq_segments=64,hnsw_m=64,external_storage=1 |
960 | Yes | dim=960,pq_segments=64,hnsw_m=64 |
960 | No | dim=960,pq_segments=64,hnsw_m=64,external_storage=1 |
Prepare test data
Enter the directory that contains the test script and run the following commands:
- Generate table data.
python generateTableData.py -s gist -t fvecs -l 1000000 -i /home/greenplum/data/vector_data/gist/gist_base.fvecs -o /home/greenplum/data/vector_data/gist_data/table_data -n 100
- Import table data. Note Before you import data, you must modify the dbconf file. You need to specify the names of the database and table used for the test. The data import may take a long time. You can run the import task in the background.
python insertTableData.py -n 10 -b 10 --file_path /home/greenplum/data/vector_data/gist_data/table_data --file_num 100
- Generate a vector query data set.
python generateQueryData.py -s gist -t fvecs -l 1000 -i /home/greenplum/data/vector_data/gist/gist_base.fvecs -o /home/greenplum/data/vector_data/gist_data/query_data
- Generate a vector query result set.
python generateGroundTruth.py --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_query_data.txt --output_file /home/greenplum/data/vector_data/gist_data/groundtruth/groundtruth.txt --topk 10
- Generate an integrated query data set.
python generateQueryData.py -s gist -t fvecs -l 1000 -i /home/greenplum/data/vector_data/gist/gist_query.fvecs -o /home/greenplum/data/vector_data/gist_data/query_data --fusion
- Generate an integrated query result set.
python generateGroundTruth.py --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_fusion_query_data.txt --output_file /home/greenplum/data/vector_data/gist_data/groundtruth/fusion_groundtruth.txt --topk 10 --fusion
Note After the result sets are generated, you must copy the files to the groundtruth directory of the script.
Vector query test
- Calibrate recall parameters
- To ensure that the recall rate of the query meets the 95% target set for this test, recall tests are performed to calibrate the parameters required for the vector query. Then, the performance test is conducted. Note Adjust the following parameters in the parameterConf file to tune the recall rate: fastann.hnsw_max_scan_points, fastann.hnsw_ef_search, and fastann.pq_amp. fastann.hnsw_max_scan_points is the most important parameter.Run the following command to perform the recall test for the vector query tests:
python recallTest.py -n 10 --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_query_data.txt --topk 10
- Performance tests
- Adjust the number of concurrent processes in the QPS test to stress test the database. Run the following command to perform the QPS test for vector queries:
python qpsTest.py -n 1 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_query_data.txt --topk 10
- Test the response time of a single query. No concurrency is required. Run the following command to perform the response time test for vector queries:
python rtTest.py -n 1 --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_query_data.txt --topk 10
- Adjust the number of concurrent processes in the QPS test to stress test the database. Run the following command to perform the QPS test for vector queries:
Integrated query test
- Calibrate recall parameters
- To ensure that the recall rate of the query meets the 95% target set for this test, recall tests are performed to calibrate the parameters required for the integrated query. Then, the performance test is conducted. Note
- You must run the analyze vec.vector_test_for_gist command before you perform the recall test.
- Adjust the following parameters in the parameterConf file to tune the recall rate: fastann.hnsw_max_scan_points, fastann.hnsw_ef_search, and fastann.pq_amp. fastann.hnsw_max_scan_points is the most important parameter.
Run the following command to perform the recall test for the integrated query tests:python recallTest.py -n 10 --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_fusion_query_data.txt --topk 10 --fusion
- Performance tests
- Adjust the number of concurrent processes in the QPS test to stress test the database. Run the following command to perform the QPS test for integrated queries:
python qpsTest.py -n 1 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_fusion_query_data.txt --topk 10 --fusion
- Test the response time of a single query. No concurrency is required. Run the following command to perform the response time test for vector queries:
python rtTest.py -n 1 --limit 1000 --input_file /home/greenplum/data/vector_data/gist_data/query_data/gist_fusion_query_data.txt --topk 10 --fusion
- Adjust the number of concurrent processes in the QPS test to stress test the database. Run the following command to perform the QPS test for integrated queries:
Test report: vector queries
- GIST dataset
Item Description Dataset 1 million entries of 960 dimensions Configured Top-k 10 Configured recall rate 95% Parameter configurations - fastann.hnsw_max_scan_points=140
- fastann.hnsw_ef_search = 400
- fastann.pq_amp=10
Average response time (seconds) 0.005 QPS (zero concurrency) 190 Largest QPS (30 requests in concurrency) 3500 - SIFT dataset
Item Description Datasets 100 million entries of 128 dimensions Configured Top-k 10 Configured recall rate 95% Parameter configurations - fastann.hnsw_max_scan_points=180
- fastann.hnsw_ef_search = 400
- fastann.pq_amp=1
Average response time (seconds) 0.002 QPS (zero concurrency) 415 Largest QPS (30 requests in concurrency) 5015 - An image dataset
Item Description Datasets 10 million entries of 512 dimensions Configured Top-k 10 Configured recall rate 95% Parameter configurations - fastann.hnsw_max_scan_points=30000
- fastann.hnsw_ef_search=400
- fastann.pq_amp=10
Average response time (seconds) 0.015 QPS (zero concurrency) 65 Largest QPS (30 requests in concurrency) 705
Test methods for third-party datasets
The test procedure and test tool described in this topic can also be used on third-party datasets. You only need to convert the format of the dataset based on the requirements of the test tool.