Hash sharding testing - MaxCompute - Alibaba Cloud Documentation Center

Hash sharding is used to test the correctness of end-to-end features of Proxima CE. This topic describes the test conclusion and procedure of hash sharding testing.

Test conclusion

The recall result that is obtained by using hash sharding of Proxima CE is basically the same as the recall result that is obtained by using the recall tool. The correctness test meets expectations.

Test procedure

Design the test method.
1. Data preparation: Prepare random datasets of different data types such as FLOAT, BINARY, and INT8. For Proxima CE, the datasets must be converted into the related MaxCompute tables. For C++ baseline, data must be processed by using the bench performance test tool that is provided in the Proxima kernel.
  Note C++ baseline indicates that the performance data measured by the Proxima kernel is used as the test benchmark. The Proxima kernel is written in C++.
2. Algorithm comparison: Use different search methods such as graph search, hierarchical clustering search, and linear search for each dataset to obtain the test results of Proxima CE and C++ baseline. Then, compare the recall rates that are calculated by using Proxima CE and C++ baseline. In the test, Top K is set to 100. The recall tool of Proxima CE calculates the recall rate based on 100 sample data records in the query table. The comparison between the recall rates that are obtained by using Proxima CE and the recall tool of Proxima CE is mainly performed based on the linear search method. The usage and principle of the recall tool of Proxima2 are the same as the usage and principle of the recall tool of Proxima CE.

Make test preparations.

Prepare data.
Prepare random datasets by data type. The following table describes the basic information of the datasets. Each dataset of the query table extracts 100 data records from the doc table.
Data type Number of dimensions Number of data records Value range
FLOAT 128 100,000 (0,1)
INT8 128 100,000 (-128,127)
BINARY 512 100,000 0/1

Configure parameters.

Search method	Parameter
Graph search	`proxima.hnsw.searcher.ef: 400` `proxima.hnsw.builder.efconstruction: 400` `proxima.hnsw.builder.max_neighbor_count: 100`
Hierarchical clustering search	`proxima.hc.builder.centroids_count: 2000` `proxima.hc.searcher.max_scan_count: 80000`
Satellite System Graph (SSG)	`proxima.hnsw.searcher.ef: 400` `proxima.hnsw.builder.efconstruction: 400` `proxima.hnsw.builder.max_neighbor_count: 100`
Graph clustering (GC)	`proxima.gc.builder.centroid_count: 1000` `proxima.gc.searcher.scan_ratio: 0.8`
Quantized clustering (QC)	`proxima.qc.builder.centroid_count: 1000` `proxima.qc.searcher.scan_ratio: 0.8`
Linear search	-

View test results.

The following table describes the test result when the data type is FLOAT and the distance measure type is SquaredEuclidean.

Search method	Recall rate of Proxima CE	Recall rate of the recall tool
Graph search	89.03%	88.62%
Hierarchical clustering search	98.91%	98.14%
Satellite System Graph (SSG)	96.00%	95.76%
Graph clustering (GC)	97.87%	97.64%
Quantized clustering (QC)	97.70%	97.77%
Linear search	100%	100%

The following table describes the test result when the data type is INT8 and the distance measure type is SquaredEuclidean.

Search method	Recall rate of Proxima CE	Recall rate of the recall tool
Graph search	89.89%	89.93%
Hierarchical clustering search	98.27%	97.69%
Satellite System Graph (SSG)	95.58%	95.75%
Graph clustering (GC)	97.72%	97.36%
Quantized clustering (QC)	97.68%	97.71%
Linear search	100%	100%

The following table describes the test result when the data type is BINARY and the distance measure type is Hamming.

Search method	Recall rate of Proxima CE	Recall rate of the recall tool
Graph search	85.33%	88.09%
Hierarchical clustering search	91.45%	95.27%
Satellite System Graph (SSG)	75.89%	77.83%
Graph clustering (GC)	90.01%	93.99%
Quantized clustering (QC)	90.51%	93.78%
Linear search	100%	100%

Analyze the results.
The recall result for each search method and data type of Proxima CE is similar to the recall result that is obtained by using the recall tool.

Data type	Number of dimensions	Number of data records	Value range
FLOAT	128	100,000	`(0,1)`
INT8	128	100,000	`(-128,127)`
BINARY	512	100,000	0/1