TairVector is a Tair-based in-house data structure that provides high-performance real-time storage and retrieval for vectors. This topic describes the method used to test the performance of TairVector. It also provides the test results obtained by Alibaba Cloud.
TairVector supports the approximate nearest neighbor (ANN) search algorithm. You can use TairVector for semantic retrieval of unstructured data and personalized recommendation engines. For more information, see TairVector.
Test description
Test environment
Instance: A Tair DRAM-based instance that is compatible with Redis 6.0 and has a storage capacity of 16 GB.
- An Elastic Compute Service (ECS) instance that is deployed in the same virtual private cloud (VPC) as the Tair instance and is connected to the Tair instance over the VPC.
- Linux operating system.
- A Python (version 3.7 or later) development environment.
Test data
The Sift-128-euclidean
, Gist-960-euclidean
, Glove-200-angular
, and Deep-image-96-angular
datasets are used to test the Hierarchical Navigable Small World (HNSW) indexing algorithm. The Random-s-100-euclidean
and Mnist-784-euclidean
datasets are used to test the Flat Search indexing algorithm.
Dataset | Description | Vector dimension | Number of vectors | Number of queries | Data volume | Distance formula |
---|---|---|---|---|---|---|
Sift-128-euclidean | Image feature vectors that are generated by using the Texmex dataset and the scale-invariant feature transform (SIFT) algorithm. | 128 | 1,000,000 | 10,000 | 488 MB | L2 |
Gist-960-euclidean | Image feature vectors that are generated by using the Texmex dataset and the gastrointestinal stromal tumor (GIST) algorithm. | 960 | 1,000,000 | 1,000 | 3.57 GB | L2 |
Glove-200-angular | Word vectors that are generated by applying the GloVe algorithm on the texts data from the Internet. | 200 | 1,183,514 | 10,000 | 902 MB | IP |
Deep-image-96-angular | Vectors that are extracted from the output layer of the GoogLeNet neural network with the ImageNet training dataset. | 96 | 9,990,000 | 10,000 | 3.57 GB | IP |
Random-s-100-euclidean | A dataset that is randomly generated by using test tools. No download URLs are available. | 100 | 90,000 | 10,000 | 34 MB | L2 |
Mnist-784-euclidean | A dataset from the MNIST database. | 784 | 60,000 | 10,000 | 179 MB | L2 |
Test tools and methods
- Install
tair-py
on the test server.Use one of the following methods to perform the installation:- Use
pip
:pip install tair
- Run the following code:
git clone https://github.com/alibaba/tair-py.git cd tair-py python setup.py install
- Use
- Download and decompress Ann-benchmarks. Run the following command to decompress Ann-benchmarks:
tar -zxvf ann-benchmarks.tar.gz
- Configure the endpoint, port number, username, and password of the Tair instance in the
algos.yaml
file.Open thealgos.yaml
file, search fortairvector
to find the corresponding configuration items, and then configure the following parameters ofbase-args
:- host: the endpoint used to connect to the Tair instance.
- port: the port number used to connect to the Tair instance.
- password: the username and password of the Tair instance in the
user:password
format. For more information, see Connect to a Tair instance. - parallelism: the number of concurrent threads. Default value: 4.
{"host": "r-****0d7f.redis.zhangbei.rds.aliyuncs.com", "port": "6379", "password": "testaccount:Rp829dlwa", "parallelism": 4}
- Run the
run.py
script to start the test.Example:Important After you run therun.py
script, the entire test is started to create an index, write data to the index, and then query and record the results. Do not repeatedly run the script on a single test dataset.# Run a single-threaded test by using the Sift-128-euclidean dataset and HNSW indexing algorithm. python run.py --local --runs 3 --algorithm tairvector --dataset sift-128-euclidean # Run a multi-threaded test by using the Sift-128-euclidean dataset and HNSW indexing algorithm. python run.py --local --runs 3 --algorithm tairvector --dataset sift-128-euclidean --batch # Run a single-threaded test by using the Mnist-784-euclidean dataset and Flat Search indexing algorithm. python run.py --local --runs 3 --algorithm tairvector-flat --dataset mnist-784-euclidean # Run a multi-threaded test by using the Mnist-784-euclidean dataset and Flat Search indexing algorithm. python run.py --local --runs 3 --algorithm tairvector-flat --dataset mnist-784-euclidean --batch
- Run the
data_export.py
script and export the results.Example:# Single thread. python data_export.py --output out.csv # Multiple threads. python data_export.py --output out.csv --batch
Test results
HNSW indexes
- Write performance
In this test, eight threads concurrently write data to the Tair instance. Take note of the write throughput.
The following figures show the write performance of the HNSW indexing algorithm at different values of the M parameter when
ef_construct
is set to 500. The M parameter specifies the maximum number of outgoing neighbors on each layer in a graph index structure. The write performance of the HNSW indexing algorithm decreases in inverse proportion to the value of the M parameter.Figure 1. Sift-128-euclidean Figure 2. Gist-960-euclidean Figure 3. Glove-200-angular Figure 4. Deep-image-96-angular - ANN query performance
Take note of the latency of single-threaded ANN queries and the throughput of multi-threaded concurrent queries.
- Latency of single-threaded ANN queries
The following figures show the latency of single-threaded ANN queries in TairVector when
ef_construct
is set to 500 while M andef_search
are set to different values. The latency of single-threaded ANN queries increases in proportion to the value ofef_search
that specifies the amount of data to be traversed. The higher the latency, the higher the recall rate. For more information, see the "Recall rate" section of this topic.Figure 5. Sift-128-euclidean Figure 6. Gist-960-euclidean Figure 7. Glove-200-angular Figure 8. Deep-image-96-angular - Throughput of multi-threaded queries
The following figures show the throughput of single-threaded queries and four-thread concurrent queries when
ef_construct
is set to 500,M
is set to 24, andef_search
is set to different values.Figure 9. Sift-128-euclidean Figure 10. Gist-960-euclidean Figure 11. Glove-200-angular Figure 12. Deep-image-96-angular
- Latency of single-threaded ANN queries
- Recall rateThe recall rate of queries by using HNSW indexes and specific parameter settings are closely related. The following figures show the top ten recall rates at different values of M and
ef_search
for different datasets. The query latency increases in proportion to the values of M andef_search
. For more information, see the "ANN query performance" section of this topic.Note You can modify the relevant parameters based on your business needs to balance query performance with the recall rate.Figure 13. Sift-128-euclidean Figure 14. Gist-960-euclidean Figure 15. Glove-200-angular Figure 16. Deep-image-96-angular - Memory efficiencyThe memory efficiency indicates the memory amplification factor, which is the ratio of index memory usage to the original size of vectors. The following figure shows the memory amplification factor of TairVector at different values of the M parameter.
The memory usage of HNSW indexes increases only in proportion to the value of the M parameter. The memory usage of vectors increases in proportion to the dimension of vectors. As the memory usage of vectors increases, the ratio of index memory usage to the total memory space decreases. In this case, the memory amplification factor decreases.
FLAT indexes
- Write performanceThe following figure shows the write throughput of FLAT indexes.
- ANN query performanceThe following figure shows the throughput of single-threaded and multi-threaded ANN queries by using FLAT indexes.
- Memory efficiencyThe following figure shows the memory amplification factor of FLAT indexes in TairVector.
The memory amplification factor of the Random-s-100-euclidean dataset is relatively high because the total size of the dataset is smaller than that of other datasets.