TairVector is a Tair-based in-house data structure that provides high-performance real-time storage and retrieval for vectors. This topic describes the method used to test the performance of TairVector. It also provides the test results obtained by Alibaba Cloud.

TairVector supports the approximate nearest neighbor (ANN) search algorithm. You can use TairVector for semantic retrieval of unstructured data and personalized recommendation engines. For more information, see TairVector.

Test description

Test environment

Instance: A Tair DRAM-based instance that is compatible with Redis 6.0 and has a storage capacity of 16 GB.

Server that is used to run tests:
  • An Elastic Compute Service (ECS) instance that is deployed in the same virtual private cloud (VPC) as the Tair instance and is connected to the Tair instance over the VPC.
  • Linux operating system.
  • A Python (version 3.7 or later) development environment.

Test data

The Sift-128-euclidean, Gist-960-euclidean, Glove-200-angular, and Deep-image-96-angular datasets are used to test the Hierarchical Navigable Small World (HNSW) indexing algorithm. The Random-s-100-euclidean and Mnist-784-euclidean datasets are used to test the Flat Search indexing algorithm.

DatasetDescriptionVector dimensionNumber of vectorsNumber of queriesData volumeDistance formula
Sift-128-euclideanImage feature vectors that are generated by using the Texmex dataset and the scale-invariant feature transform (SIFT) algorithm. 1281,000,00010,000488 MBL2
Gist-960-euclideanImage feature vectors that are generated by using the Texmex dataset and the gastrointestinal stromal tumor (GIST) algorithm. 9601,000,0001,0003.57 GBL2
Glove-200-angularWord vectors that are generated by applying the GloVe algorithm on the texts data from the Internet. 2001,183,51410,000902 MBIP
Deep-image-96-angularVectors that are extracted from the output layer of the GoogLeNet neural network with the ImageNet training dataset. 969,990,00010,0003.57 GBIP
Random-s-100-euclideanA dataset that is randomly generated by using test tools. No download URLs are available. 10090,00010,00034 MBL2
Mnist-784-euclideanA dataset from the MNIST database. 78460,00010,000179 MBL2

Test tools and methods

  1. Install tair-py on the test server.
    Use one of the following methods to perform the installation:
    • Use pip:
      pip install tair
    • Run the following code:
      git clone https://github.com/alibaba/tair-py.git
      cd tair-py
      python setup.py install
  2. Download and decompress Ann-benchmarks.
    Run the following command to decompress Ann-benchmarks:
    tar -zxvf ann-benchmarks.tar.gz
  3. Configure the endpoint, port number, username, and password of the Tair instance in the algos.yaml file.
    Open the algos.yaml file, search for tairvector to find the corresponding configuration items, and then configure the following parameters of base-args:
    • host: the endpoint used to connect to the Tair instance.
    • port: the port number used to connect to the Tair instance.
    • password: the username and password of the Tair instance in the user:password format. For more information, see Connect to a Tair instance.
    • parallelism: the number of concurrent threads. Default value: 4.
    Example:
    {"host": "r-****0d7f.redis.zhangbei.rds.aliyuncs.com", "port": "6379", "password": "testaccount:Rp829dlwa", "parallelism": 4}
  4. Run the run.py script to start the test.
    Important After you run the run.py script, the entire test is started to create an index, write data to the index, and then query and record the results. Do not repeatedly run the script on a single test dataset.
    Example:
    # Run a single-threaded test by using the Sift-128-euclidean dataset and HNSW indexing algorithm. 
    python run.py --local --runs 3 --algorithm tairvector --dataset sift-128-euclidean
    # Run a multi-threaded test by using the Sift-128-euclidean dataset and HNSW indexing algorithm. 
    python run.py --local --runs 3 --algorithm tairvector --dataset sift-128-euclidean --batch
    
    # Run a single-threaded test by using the Mnist-784-euclidean dataset and Flat Search indexing algorithm. 
    python run.py --local --runs 3 --algorithm tairvector-flat --dataset mnist-784-euclidean
    # Run a multi-threaded test by using the Mnist-784-euclidean dataset and Flat Search indexing algorithm. 
    python run.py --local --runs 3 --algorithm tairvector-flat --dataset mnist-784-euclidean --batch
  5. Run the data_export.py script and export the results.
    Example:
    # Single thread.
    python data_export.py --output out.csv
    # Multiple threads.
    python data_export.py --output out.csv --batch

Test results

HNSW indexes

  • Write performance

    In this test, eight threads concurrently write data to the Tair instance. Take note of the write throughput.

    The following figures show the write performance of the HNSW indexing algorithm at different values of the M parameter when ef_construct is set to 500. The M parameter specifies the maximum number of outgoing neighbors on each layer in a graph index structure. The write performance of the HNSW indexing algorithm decreases in inverse proportion to the value of the M parameter.

    Figure 1. Sift-128-euclidean
    Sift-128-euclidean
    Figure 2. Gist-960-euclidean
    Gist-960-euclidean
    Figure 3. Glove-200-angular
    Glove-200-angular
    Figure 4. Deep-image-96-angular
    Deep-image-96-angular
  • ANN query performance

    Take note of the latency of single-threaded ANN queries and the throughput of multi-threaded concurrent queries.

    • Latency of single-threaded ANN queries

      The following figures show the latency of single-threaded ANN queries in TairVector when ef_construct is set to 500 while M and ef_search are set to different values. The latency of single-threaded ANN queries increases in proportion to the value of ef_search that specifies the amount of data to be traversed. The higher the latency, the higher the recall rate. For more information, see the "Recall rate" section of this topic.

      Figure 5. Sift-128-euclidean
      Sift-128-euclidean
      Figure 6. Gist-960-euclidean
      Gist-960-euclidean
      Figure 7. Glove-200-angular
      Glove-200-angular
      Figure 8. Deep-image-96-angular
      Deep-image-96-angular
    • Throughput of multi-threaded queries

      The following figures show the throughput of single-threaded queries and four-thread concurrent queries when ef_construct is set to 500, M is set to 24, and ef_search is set to different values.

      Figure 9. Sift-128-euclidean
      Sift-128-euclidean
      Figure 10. Gist-960-euclidean
      Gist-960-euclidean
      Figure 11. Glove-200-angular
      Glove-200-angular
      Figure 12. Deep-image-96-angular
      Deep-image-96-angular
  • Recall rate
    The recall rate of queries by using HNSW indexes and specific parameter settings are closely related. The following figures show the top ten recall rates at different values of M and ef_search for different datasets. The query latency increases in proportion to the values of M and ef_search. For more information, see the "ANN query performance" section of this topic.
    Note You can modify the relevant parameters based on your business needs to balance query performance with the recall rate.
    Figure 13. Sift-128-euclidean
    Sift-128-euclidean
    Figure 14. Gist-960-euclidean
    Gist-960-euclidean
    Figure 15. Glove-200-angular
    Glove-200-angular
    Figure 16. Deep-image-96-angular
    Deep-image-96-angular
  • Memory efficiency
    The memory efficiency indicates the memory amplification factor, which is the ratio of index memory usage to the original size of vectors. The following figure shows the memory amplification factor of TairVector at different values of the M parameter. Memory usage of TairVectorThe memory usage of HNSW indexes increases only in proportion to the value of the M parameter. The memory usage of vectors increases in proportion to the dimension of vectors. As the memory usage of vectors increases, the ratio of index memory usage to the total memory space decreases. In this case, the memory amplification factor decreases.

FLAT indexes

  • Write performance
    The following figure shows the write throughput of FLAT indexes. Write throughput of FLAT indexes
  • ANN query performance
    The following figure shows the throughput of single-threaded and multi-threaded ANN queries by using FLAT indexes. Query performance of FLAT indexes
  • Memory efficiency
    The following figure shows the memory amplification factor of FLAT indexes in TairVector. fLATThe memory amplification factor of the Random-s-100-euclidean dataset is relatively high because the total size of the dataset is smaller than that of other datasets.