All Products
Search
Document Center

MaxCompute:Hash sharding testing

Last Updated:Jun 19, 2023

Hash sharding is used to test the correctness of end-to-end features of Proxima CE. This topic describes the test conclusion and procedure of hash sharding testing.

Test conclusion

The recall result that is obtained by using hash sharding of Proxima CE is basically the same as the recall result that is obtained by using the recall tool. The correctness test meets expectations.

Test procedure

  1. Design the test method.
    1. Data preparation: Prepare random datasets of different data types such as FLOAT, BINARY, and INT8. For Proxima CE, the datasets must be converted into the related MaxCompute tables. For C++ baseline, data must be processed by using the bench performance test tool that is provided in the Proxima kernel.
      Note C++ baseline indicates that the performance data measured by the Proxima kernel is used as the test benchmark. The Proxima kernel is written in C++.
    2. Algorithm comparison: Use different search methods such as graph search, hierarchical clustering search, and linear search for each dataset to obtain the test results of Proxima CE and C++ baseline. Then, compare the recall rates that are calculated by using Proxima CE and C++ baseline. In the test, Top K is set to 100. The recall tool of Proxima CE calculates the recall rate based on 100 sample data records in the query table. The comparison between the recall rates that are obtained by using Proxima CE and the recall tool of Proxima CE is mainly performed based on the linear search method. The usage and principle of the recall tool of Proxima2 are the same as the usage and principle of the recall tool of Proxima CE.
  2. Make test preparations.
    • Prepare data.
      Prepare random datasets by data type. The following table describes the basic information of the datasets. Each dataset of the query table extracts 100 data records from the doc table.
      Data typeNumber of dimensionsNumber of data recordsValue range
      FLOAT128100,000(0,1)
      INT8128100,000(-128,127)
      BINARY512100,0000/1
    • Configure parameters.
      Search methodParameter
      Graph search
      • proxima.hnsw.searcher.ef: 400
      • proxima.hnsw.builder.efconstruction: 400
      • proxima.hnsw.builder.max_neighbor_count: 100
      Hierarchical clustering search
      • proxima.hc.builder.centroids_count: 2000
      • proxima.hc.searcher.max_scan_count: 80000
      Satellite System Graph (SSG)
      • proxima.hnsw.searcher.ef: 400
      • proxima.hnsw.builder.efconstruction: 400
      • proxima.hnsw.builder.max_neighbor_count: 100
      Graph clustering (GC)
      • proxima.gc.builder.centroid_count: 1000
      • proxima.gc.searcher.scan_ratio: 0.8
      Quantized clustering (QC)
      • proxima.qc.builder.centroid_count: 1000
      • proxima.qc.searcher.scan_ratio: 0.8
      Linear search-
  3. View test results.
    • The following table describes the test result when the data type is FLOAT and the distance measure type is SquaredEuclidean.
      Search methodRecall rate of Proxima CERecall rate of the recall tool
      Graph search89.03%88.62%
      Hierarchical clustering search98.91%98.14%
      Satellite System Graph (SSG)96.00%95.76%
      Graph clustering (GC)97.87%97.64%
      Quantized clustering (QC)97.70%97.77%
      Linear search100%100%
    • The following table describes the test result when the data type is INT8 and the distance measure type is SquaredEuclidean.
      Search methodRecall rate of Proxima CERecall rate of the recall tool
      Graph search89.89%89.93%
      Hierarchical clustering search98.27%97.69%
      Satellite System Graph (SSG)95.58%95.75%
      Graph clustering (GC)97.72%97.36%
      Quantized clustering (QC)97.68%97.71%
      Linear search100%100%
    • The following table describes the test result when the data type is BINARY and the distance measure type is Hamming.
      Search methodRecall rate of Proxima CERecall rate of the recall tool
      Graph search85.33%88.09%
      Hierarchical clustering search91.45%95.27%
      Satellite System Graph (SSG)75.89%77.83%
      Graph clustering (GC)90.01%93.99%
      Quantized clustering (QC)90.51%93.78%
      Linear search100%100%
  4. Analyze the results.

    The recall result for each search method and data type of Proxima CE is similar to the recall result that is obtained by using the recall tool.