All Products
Search
Document Center

MaxCompute:Cluster sharding testing

Last Updated:Mar 26, 2026

Proxima CE cluster sharding meets correctness expectations across datasets of different sizes, data types, and parameter combinations. This page presents test results and helps you choose the right parameters for your workload.

How to use these results

Tuning cluster sharding always involves a trade-off: higher recall means longer search time, and shorter search time means lower recall. Before reading the data tables, decide which metric matters more for your use case:

  • Recall-first: Prioritize the recall rate columns. Look for parameter combinations that keep recall at your target threshold (for example, ≥ 0.99 at top K = 50).

  • Latency-first: Prioritize the total search time in the logs. Choose the lowest centroid access rate that still meets your minimum recall requirement.

The three parameters that affect both metrics are:

ParameterEffect on recallEffect on search time
Number of cluster centroidsPositive — more centroids, higher recallPositive — more centroids, longer search time
Centroid access ratePositive — higher rate, higher recallPositive — higher rate, longer search time
Number of index shardsNegative — more shards, lower recallPositive — more shards, longer search time

Test conclusions

Five conclusions hold across all datasets:

  1. More cluster centroids → higher recall. The number of cluster centroids is positively correlated with recall rate.

  2. Higher centroid access rate → higher recall. The centroid access rate is positively correlated with recall rate.

  3. More index shards → lower recall. The number of index shards is negatively correlated with recall rate.

  4. Search time scales with all three parameters. A larger number of cluster centroids, a higher centroid access rate, and a larger number of index shards each increase total search time.

  5. Top K = 1 recall is accurate in all scenarios. Across all tested combinations of centroids, shards, and access rates, top K = 1 recall meets expectations.

When the number of cluster centroids and shard count are fixed, increasing the centroid access rate raises the number of shards actually accessed — which increases both recall and search time.

Test procedure

20 million FLOAT records, 512 dimensions

Test configuration: sampling rate 50%, 1,000 cluster centroids, 10 index shards, Squared Euclidean distance, HNSW search method.

Centroid access rateAccessed index shardsTop K = 1Top K = 50Top K = 100Top K = 200
0.17.300.9990.99920.99870.9974
0.056.350.9990.99870.99790.9963
0.024.720.9970.99250.99120.9889
0.013.490.9920.97730.97430.9697
0.00110.7620.70110.68830.6762

The following log shows worker times when the centroid access rate is 0.1 (total: 250 minutes):

Vector search  Data type:4 , Vector dimension:512 , Search method:hnsw , Measure:SquaredEuclidean , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao , Partition:20210707 , Number of data records in the doc table:19990000 , Vector delimiter:
Information about the query table Table name: query_table_pailitao , Partition:20210707 , Number of data records in the query table:100000 , Vector delimiter:
Information about the output table Table name: output_table_pailitao_cluster_2000w , Partition:20210707
Row and column information  Number of rows: 10 , Number of columns:10 , Number of data records in the doc table of each column for index building:1999000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:          7
TmpTableWorker:              1
KmeansGraphWorker:           2419
BuildJobWorker:              9927
SeekJobWorker:               1026
TmpResultJoinWorker:         0
RecallWorker:                1675
CleanUpWorker:               4
Total time required (minutes):250

Sample commands:
jar -resources kmeans_center_resource_cl,proxima_ce_pailitao.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/pailitao proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar   com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-app_id 201220
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-clean_build_volume false
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_ratio 50
-kmeans_cluster_num 1000;

100 million BINARY records, 512 dimensions

All three sub-tests use BINARY data, Hamming distance, and HNSW search. They vary by sampling rate, cluster centroid count, and shard count to demonstrate how each parameter affects recall and search time.

Sub-test 1: 30% sampling, 10,000 centroids, 400 shards

Test configuration: sampling rate 30%, 10,000 cluster centroids, 400 index shards, 20 rows × 400 columns, 250,000 records per column.

Centroid access rateAccessed index shardsTop K = 1Top K = 50Top K = 100Top K = 200
0.03114.280.99950.99930.99940.9992
0.0287.050.99950.99240.99220.9901
0.0153.010.99950.93300.92200.9062
0.00531.180.99950.78700.75610.7275
0.0018.230.99950.40290.35720.3233

The following log shows worker times when the centroid access rate is 0.001 (total: 708 minutes):

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:20210712 , Number of data records in the query table: 1010000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_0_001 , Partition:20210712
Row and column information  Number of rows: 20 , Number of columns:400 , Number of data records in the doc table of each column for index building:250000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        38636
BuildJobWorker:        1085
SeekJobWorker:        1845
TmpResultJoinWorker:        0
RecallWorker:        939
CleanUpWorker:        4
Total time required (minutes):708

Sample commands:
jar -resources kmeans_center_resource_cl_binary2,proxima_ce_g2.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary2/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_10000_0_001
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 20
-column_num 400
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary2
-kmeans_ratio 30
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.001;

Sub-test 2: 50% sampling, 10,000 centroids, 100 shards

Test configuration: sampling rate 50%, 10,000 cluster centroids, 100 index shards, 10 rows × 100 columns, 1,000,000 records per column.

Centroid access rateAccessed index shardsTop K = 1Top K = 50Top K = 100Top K = 200
0.0361.931.00.99991.01.0
0.0251.431.00.99991.00.9999
0.0135.591.00.99600.99620.9943
0.00523.261.00.94940.94290.9309

Compared with sub-test 1, reducing the shard count from 400 to 100 (with the same 10,000 centroids) substantially improves recall at all centroid access rates. At a centroid access rate of 0.03, recall reaches at least 0.9999 across all top K values.

The following log shows worker times when the centroid access rate is 0.03 (total: 443 minutes):

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:query_table_pailitao_binary3 , Number of data records in the query table:100000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_100_0_03 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:100 , Number of data records in the doc table of each column for index building:1000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        23760
BuildJobWorker:        1510
SeekJobWorker:        556
TmpResultJoinWorker:        0
RecallWorker:        787
CleanUpWorker:        4
Total time required (minutes):443

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712_10w
-output_table output_table_pailitao_binary_cluster_10000_100_0_03
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 100
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 50
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.03;

Sub-test 3: 100% sampling, 1,000 centroids, 20 shards

Test configuration: sampling rate 100%, 1,000 cluster centroids, 20 index shards, 10 rows × 20 columns, 5,000,000 records per column.

Centroid access rateAccessed index shardsTop K = 1Top K = 50Top K = 100Top K = 200
0.114.261.00.98290.98010.9587
0.028.431.00.78980.77600.7623

Compared with sub-tests 1 and 2, reducing the centroid count from 10,000 to 1,000 significantly lowers recall at higher top K values, even with full sampling and fewer shards.

The following log shows worker times when the centroid access rate is 0.02 (total: 266 minutes):

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary2 , Partition:20210712 , Data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary2 , Partition:20210712 , Number of data records in the query table:1000000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_1000 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:20 , Number of data records in the doc table of each column for index building:5000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        2
TmpTableWorker:        1
KmeansGraphWorker:        4996
BuildJobWorker:        8727
SeekJobWorker:        1425
TmpResultJoinWorker:        0
RecallWorker:        857
CleanUpWorker:        4
Total time required (minutes):266

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary2
-doc_table_partition 20210712
-query_table query_table_pailitao_binary2
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_1000
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 20
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 100
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 1000
-kmeans_seek_ratio 0.02;

Cross-test analysis

Comparing the three BINARY sub-tests confirms all five conclusions stated above. The effect of centroid count is especially pronounced: cutting from 10,000 to 1,000 centroids reduces top K = 200 recall from ~0.999 (sub-test 1 at 0.03 access rate) to ~0.959 (sub-test 3 at 0.1 access rate) — even though sub-test 3 uses full sampling and far fewer shards. To achieve high recall at large top K, prioritize a sufficient centroid count over reducing shard count.