Proxima CE cluster sharding meets correctness expectations across datasets of different sizes, data types, and parameter combinations. This page presents test results and helps you choose the right parameters for your workload.
How to use these results
Tuning cluster sharding always involves a trade-off: higher recall means longer search time, and shorter search time means lower recall. Before reading the data tables, decide which metric matters more for your use case:
Recall-first: Prioritize the recall rate columns. Look for parameter combinations that keep recall at your target threshold (for example, ≥ 0.99 at top K = 50).
Latency-first: Prioritize the total search time in the logs. Choose the lowest centroid access rate that still meets your minimum recall requirement.
The three parameters that affect both metrics are:
| Parameter | Effect on recall | Effect on search time |
|---|---|---|
| Number of cluster centroids | Positive — more centroids, higher recall | Positive — more centroids, longer search time |
| Centroid access rate | Positive — higher rate, higher recall | Positive — higher rate, longer search time |
| Number of index shards | Negative — more shards, lower recall | Positive — more shards, longer search time |
Test conclusions
Five conclusions hold across all datasets:
More cluster centroids → higher recall. The number of cluster centroids is positively correlated with recall rate.
Higher centroid access rate → higher recall. The centroid access rate is positively correlated with recall rate.
More index shards → lower recall. The number of index shards is negatively correlated with recall rate.
Search time scales with all three parameters. A larger number of cluster centroids, a higher centroid access rate, and a larger number of index shards each increase total search time.
Top K = 1 recall is accurate in all scenarios. Across all tested combinations of centroids, shards, and access rates, top K = 1 recall meets expectations.
When the number of cluster centroids and shard count are fixed, increasing the centroid access rate raises the number of shards actually accessed — which increases both recall and search time.
Test procedure
20 million FLOAT records, 512 dimensions
Test configuration: sampling rate 50%, 1,000 cluster centroids, 10 index shards, Squared Euclidean distance, HNSW search method.
| Centroid access rate | Accessed index shards | Top K = 1 | Top K = 50 | Top K = 100 | Top K = 200 |
|---|---|---|---|---|---|
| 0.1 | 7.30 | 0.999 | 0.9992 | 0.9987 | 0.9974 |
| 0.05 | 6.35 | 0.999 | 0.9987 | 0.9979 | 0.9963 |
| 0.02 | 4.72 | 0.997 | 0.9925 | 0.9912 | 0.9889 |
| 0.01 | 3.49 | 0.992 | 0.9773 | 0.9743 | 0.9697 |
| 0.001 | 1 | 0.762 | 0.7011 | 0.6883 | 0.6762 |
The following log shows worker times when the centroid access rate is 0.1 (total: 250 minutes):
Vector search Data type:4 , Vector dimension:512 , Search method:hnsw , Measure:SquaredEuclidean , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao , Partition:20210707 , Number of data records in the doc table:19990000 , Vector delimiter:
Information about the query table Table name: query_table_pailitao , Partition:20210707 , Number of data records in the query table:100000 , Vector delimiter:
Information about the output table Table name: output_table_pailitao_cluster_2000w , Partition:20210707
Row and column information Number of rows: 10 , Number of columns:10 , Number of data records in the doc table of each column for index building:1999000
Whether to clear volume indexes:false
Time required for each worker node (seconds):
SegmentationWorker: 7
TmpTableWorker: 1
KmeansGraphWorker: 2419
BuildJobWorker: 9927
SeekJobWorker: 1026
TmpResultJoinWorker: 0
RecallWorker: 1675
CleanUpWorker: 4
Total time required (minutes):250
Sample commands:
jar -resources kmeans_center_resource_cl,proxima_ce_pailitao.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/pailitao proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-app_id 201220
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-clean_build_volume false
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_ratio 50
-kmeans_cluster_num 1000;100 million BINARY records, 512 dimensions
All three sub-tests use BINARY data, Hamming distance, and HNSW search. They vary by sampling rate, cluster centroid count, and shard count to demonstrate how each parameter affects recall and search time.
Sub-test 1: 30% sampling, 10,000 centroids, 400 shards
Test configuration: sampling rate 30%, 10,000 cluster centroids, 400 index shards, 20 rows × 400 columns, 250,000 records per column.
| Centroid access rate | Accessed index shards | Top K = 1 | Top K = 50 | Top K = 100 | Top K = 200 |
|---|---|---|---|---|---|
| 0.03 | 114.28 | 0.9995 | 0.9993 | 0.9994 | 0.9992 |
| 0.02 | 87.05 | 0.9995 | 0.9924 | 0.9922 | 0.9901 |
| 0.01 | 53.01 | 0.9995 | 0.9330 | 0.9220 | 0.9062 |
| 0.005 | 31.18 | 0.9995 | 0.7870 | 0.7561 | 0.7275 |
| 0.001 | 8.23 | 0.9995 | 0.4029 | 0.3572 | 0.3233 |
The following log shows worker times when the centroid access rate is 0.001 (total: 708 minutes):
Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:20210712 , Number of data records in the query table: 1010000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_0_001 , Partition:20210712
Row and column information Number of rows: 20 , Number of columns:400 , Number of data records in the doc table of each column for index building:250000
Whether to clear volume indexes:false
Time required for each worker node (seconds):
SegmentationWorker: 10
TmpTableWorker: 1
KmeansGraphWorker: 38636
BuildJobWorker: 1085
SeekJobWorker: 1845
TmpResultJoinWorker: 0
RecallWorker: 939
CleanUpWorker: 4
Total time required (minutes):708
Sample commands:
jar -resources kmeans_center_resource_cl_binary2,proxima_ce_g2.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary2/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_10000_0_001
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 20
-column_num 400
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary2
-kmeans_ratio 30
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.001;Sub-test 2: 50% sampling, 10,000 centroids, 100 shards
Test configuration: sampling rate 50%, 10,000 cluster centroids, 100 index shards, 10 rows × 100 columns, 1,000,000 records per column.
| Centroid access rate | Accessed index shards | Top K = 1 | Top K = 50 | Top K = 100 | Top K = 200 |
|---|---|---|---|---|---|
| 0.03 | 61.93 | 1.0 | 0.9999 | 1.0 | 1.0 |
| 0.02 | 51.43 | 1.0 | 0.9999 | 1.0 | 0.9999 |
| 0.01 | 35.59 | 1.0 | 0.9960 | 0.9962 | 0.9943 |
| 0.005 | 23.26 | 1.0 | 0.9494 | 0.9429 | 0.9309 |
Compared with sub-test 1, reducing the shard count from 400 to 100 (with the same 10,000 centroids) substantially improves recall at all centroid access rates. At a centroid access rate of 0.03, recall reaches at least 0.9999 across all top K values.
The following log shows worker times when the centroid access rate is 0.03 (total: 443 minutes):
Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:query_table_pailitao_binary3 , Number of data records in the query table:100000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_100_0_03 , Partition:20210712
Row and column information Number of rows: 10 , Number of columns:100 , Number of data records in the doc table of each column for index building:1000000
Whether to clear volume indexes:false
Time required for each worker node (seconds):
SegmentationWorker: 10
TmpTableWorker: 1
KmeansGraphWorker: 23760
BuildJobWorker: 1510
SeekJobWorker: 556
TmpResultJoinWorker: 0
RecallWorker: 787
CleanUpWorker: 4
Total time required (minutes):443
Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712_10w
-output_table output_table_pailitao_binary_cluster_10000_100_0_03
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 100
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 50
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.03;Sub-test 3: 100% sampling, 1,000 centroids, 20 shards
Test configuration: sampling rate 100%, 1,000 cluster centroids, 20 index shards, 10 rows × 20 columns, 5,000,000 records per column.
| Centroid access rate | Accessed index shards | Top K = 1 | Top K = 50 | Top K = 100 | Top K = 200 |
|---|---|---|---|---|---|
| 0.1 | 14.26 | 1.0 | 0.9829 | 0.9801 | 0.9587 |
| 0.02 | 8.43 | 1.0 | 0.7898 | 0.7760 | 0.7623 |
Compared with sub-tests 1 and 2, reducing the centroid count from 10,000 to 1,000 significantly lowers recall at higher top K values, even with full sampling and fewer shards.
The following log shows worker times when the centroid access rate is 0.02 (total: 266 minutes):
Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary2 , Partition:20210712 , Data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary2 , Partition:20210712 , Number of data records in the query table:1000000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_1000 , Partition:20210712
Row and column information Number of rows: 10 , Number of columns:20 , Number of data records in the doc table of each column for index building:5000000
Whether to clear volume indexes:false
Time required for each worker node (seconds):
SegmentationWorker: 2
TmpTableWorker: 1
KmeansGraphWorker: 4996
BuildJobWorker: 8727
SeekJobWorker: 1425
TmpResultJoinWorker: 0
RecallWorker: 857
CleanUpWorker: 4
Total time required (minutes):266
Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary2
-doc_table_partition 20210712
-query_table query_table_pailitao_binary2
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_1000
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 20
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 100
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 1000
-kmeans_seek_ratio 0.02;Cross-test analysis
Comparing the three BINARY sub-tests confirms all five conclusions stated above. The effect of centroid count is especially pronounced: cutting from 10,000 to 1,000 centroids reduces top K = 200 recall from ~0.999 (sub-test 1 at 0.03 access rate) to ~0.959 (sub-test 3 at 0.1 access rate) — even though sub-test 3 uses full sampling and far fewer shards. To achieve high recall at large top K, prioritize a sufficient centroid count over reducing shard count.