All Products
Search
Document Center

MaxCompute:End-to-end test for hash sharding

Last Updated:Mar 26, 2026

Hash sharding validates the end-to-end correctness of Proxima CE by comparing recall rates between Proxima CE and the Proxima CE recall tool across multiple search methods and data types.

Test conclusion

Across all tested data types and search methods, Proxima CE recall rates match the recall tool results closely. The correctness test meets expectations.

Test procedure

Design the test method

Data preparation

  • Prepare random datasets of three data types: FLOAT, BINARY, and INT8.

  • For Proxima CE: convert datasets into MaxCompute tables.

  • For the C++ baseline: process data using the bench performance test tool provided in the Proxima kernel.

Note C++ baseline refers to the performance data measured by the Proxima kernel (written in C++), used as the test benchmark.

Algorithm comparison

Run graph search, hierarchical clustering search, Satellite System Graph (SSG), graph clustering (GC), quantized clustering (QC), and linear search against each dataset to collect recall rates for both Proxima CE and the C++ baseline. Set Top K to 100. The recall tool of Proxima CE calculates the recall rate from 100 sample records in the query table. The primary comparison between Proxima CE and the recall tool is based on the linear search method.

Note The usage and principle of the recall tool of Proxima2 are the same as those of the recall tool of Proxima CE.

Prepare data and parameters

Datasets

Each query table extracts 100 data records from the doc table.

Data typeDimensionsRecordsValue range
FLOAT128100,000(0,1)
INT8128100,000(-128,127)
BINARY512100,0000/1

Search parameters

Search methodParameters
Graph search
  • proxima.hnsw.searcher.ef: 400
  • proxima.hnsw.builder.efconstruction: 400
  • proxima.hnsw.builder.max_neighbor_count: 100
Hierarchical clustering search
  • proxima.hc.builder.centroids_count: 2000
  • proxima.hc.searcher.max_scan_count: 80000
Satellite System Graph (SSG)
  • proxima.hnsw.searcher.ef: 400
  • proxima.hnsw.builder.efconstruction: 400
  • proxima.hnsw.builder.max_neighbor_count: 100
Graph clustering (GC)
  • proxima.gc.builder.centroid_count: 1000
  • proxima.gc.searcher.scan_ratio: 0.8
Quantized clustering (QC)
  • proxima.qc.builder.centroid_count: 1000
  • proxima.qc.searcher.scan_ratio: 0.8
Linear search

View test results

FLOAT, SquaredEuclidean distance

Search methodProxima CE recall rateRecall tool recall rate
Graph search89.03%88.62%
Hierarchical clustering search98.91%98.14%
Satellite System Graph (SSG)96.00%95.76%
Graph clustering (GC)97.87%97.64%
Quantized clustering (QC)97.70%97.77%
Linear search100%100%

INT8, SquaredEuclidean distance

Search methodProxima CE recall rateRecall tool recall rate
Graph search89.89%89.93%
Hierarchical clustering search98.27%97.69%
Satellite System Graph (SSG)95.58%95.75%
Graph clustering (GC)97.72%97.36%
Quantized clustering (QC)97.68%97.71%
Linear search100%100%

BINARY, Hamming distance

Search methodProxima CE recall rateRecall tool recall rate
Graph search85.33%88.09%
Hierarchical clustering search91.45%95.27%
Satellite System Graph (SSG)75.89%77.83%
Graph clustering (GC)90.01%93.99%
Quantized clustering (QC)90.51%93.78%
Linear search100%100%

Analyze the results

For FLOAT and INT8 data types, Proxima CE recall rates are consistent with recall tool results across all search methods — differences are within 1 percentage point. For BINARY data, differences are slightly larger (up to approximately 4 percentage points for some methods). The recall result for each search method and data type of Proxima CE is similar to the recall result obtained by using the recall tool.