Hash sharding is used to test the correctness of end-to-end features of Proxima CE. This topic describes the test conclusion and procedure of hash sharding testing.
Test conclusion
The recall result that is obtained by using hash sharding of Proxima CE is basically the same as the recall result that is obtained by using the recall tool. The correctness test meets expectations.
Test procedure
- Design the test method.
- Data preparation: Prepare random datasets of different data types such as FLOAT, BINARY, and INT8. For Proxima CE, the datasets must be converted into the related MaxCompute tables. For C++ baseline, data must be processed by using the bench performance test tool that is provided in the Proxima kernel. Note C++ baseline indicates that the performance data measured by the Proxima kernel is used as the test benchmark. The Proxima kernel is written in C++.
- Algorithm comparison: Use different search methods such as graph search, hierarchical clustering search, and linear search for each dataset to obtain the test results of Proxima CE and C++ baseline. Then, compare the recall rates that are calculated by using Proxima CE and C++ baseline. In the test, Top K is set to 100. The recall tool of Proxima CE calculates the recall rate based on 100 sample data records in the query table. The comparison between the recall rates that are obtained by using Proxima CE and the recall tool of Proxima CE is mainly performed based on the linear search method. The usage and principle of the recall tool of Proxima2 are the same as the usage and principle of the recall tool of Proxima CE.
- Data preparation: Prepare random datasets of different data types such as FLOAT, BINARY, and INT8. For Proxima CE, the datasets must be converted into the related MaxCompute tables. For C++ baseline, data must be processed by using the bench performance test tool that is provided in the Proxima kernel.
- Make test preparations.
- Prepare data. Prepare random datasets by data type. The following table describes the basic information of the datasets. Each dataset of the query table extracts 100 data records from the doc table.
Data type Number of dimensions Number of data records Value range FLOAT 128 100,000 (0,1)
INT8 128 100,000 (-128,127)
BINARY 512 100,000 0/1 - Configure parameters.
Search method Parameter Graph search proxima.hnsw.searcher.ef: 400
proxima.hnsw.builder.efconstruction: 400
proxima.hnsw.builder.max_neighbor_count: 100
Hierarchical clustering search proxima.hc.builder.centroids_count: 2000
proxima.hc.searcher.max_scan_count: 80000
Satellite System Graph (SSG) proxima.hnsw.searcher.ef: 400
proxima.hnsw.builder.efconstruction: 400
proxima.hnsw.builder.max_neighbor_count: 100
Graph clustering (GC) proxima.gc.builder.centroid_count: 1000
proxima.gc.searcher.scan_ratio: 0.8
Quantized clustering (QC) proxima.qc.builder.centroid_count: 1000
proxima.qc.searcher.scan_ratio: 0.8
Linear search -
- Prepare data.
- View test results.
- The following table describes the test result when the data type is FLOAT and the distance measure type is SquaredEuclidean.
Search method Recall rate of Proxima CE Recall rate of the recall tool Graph search 89.03% 88.62% Hierarchical clustering search 98.91% 98.14% Satellite System Graph (SSG) 96.00% 95.76% Graph clustering (GC) 97.87% 97.64% Quantized clustering (QC) 97.70% 97.77% Linear search 100% 100% - The following table describes the test result when the data type is INT8 and the distance measure type is SquaredEuclidean.
Search method Recall rate of Proxima CE Recall rate of the recall tool Graph search 89.89% 89.93% Hierarchical clustering search 98.27% 97.69% Satellite System Graph (SSG) 95.58% 95.75% Graph clustering (GC) 97.72% 97.36% Quantized clustering (QC) 97.68% 97.71% Linear search 100% 100% - The following table describes the test result when the data type is BINARY and the distance measure type is Hamming.
Search method Recall rate of Proxima CE Recall rate of the recall tool Graph search 85.33% 88.09% Hierarchical clustering search 91.45% 95.27% Satellite System Graph (SSG) 75.89% 77.83% Graph clustering (GC) 90.01% 93.99% Quantized clustering (QC) 90.51% 93.78% Linear search 100% 100%
- The following table describes the test result when the data type is FLOAT and the distance measure type is SquaredEuclidean.
- Analyze the results.
The recall result for each search method and data type of Proxima CE is similar to the recall result that is obtained by using the recall tool.