This topic describes the test conclusion and procedure of cluster sharding testing.
Test conclusion
The sampling rate, the number of cluster centroids, and the number of index shards that are configured for each dataset are different on Proxima CE. The test results on the recall rates and search period indicate that cluster sharding of Proxima CE meets the correctness expectations. The following conclusions can be drawn from the tests:
- The number of cluster centroids is positively correlated with the recall rate. A large number of cluster centroids causes a high recall rate.
- The centroid access rate is positively correlated with the recall rate. A high centroid access rate causes a high recall rate.
- The number of index shards is negatively correlated with the recall rate. A large number of index shards causes a low recall rate.
- The search period is positively correlated with the number of cluster centroids, the number of index shards, and the centroid access rate. A large number of cluster centroids, a high centroid access rate, and a large number of index shards cause a long search period.
- For cluster sharding in scenarios that have different numbers of cluster centroids, different numbers of index shards, and different centroid access rates, the recall rates that are obtained when top K is 1 are accurate.
Test procedure
Test on 20 million data records of the FLOAT data type with 512 dimensions
In the test, the sampling rate is 50%, 1,000 clusters exist, and 10 index shards are built.
| Centroid access rate | Number of accessed index shards | Recall rates when Top K is 1, 50, 100, and 200 |
| 0.1 | 7.30 |
|
| 0.05 | 6.35 |
|
| 0.02 | 4.72 |
|
| 0.01 | 3.49 |
|
| 0.001 | 1 |
|
The following log shows the search period when the centroid access rate is 0.1.
Vector search Data type:4 , Vector dimension:512 , Search method:hnsw , Measure:SquaredEuclidean , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao , Partition:20210707 , Number of data records in the doc table:19990000 , Vector delimiter:
Information about the query table Table name: query_table_pailitao , Partition:20210707 , Number of data records in the query table:100000 , Vector delimiter:
Information about the output table Table name: output_table_pailitao_cluster_2000w , Partition:20210707
Row and column information Number of rows: 10 , Number of columns:10 , Number of data records in the doc table of each column for index building:1999000
Whether to clear volume indexes:false
Time required for each worker node (seconds):
SegmentationWorker: 7
TmpTableWorker: 1
KmeansGraphWorker: 2419
BuildJobWorker: 9927
SeekJobWorker: 1026
TmpResultJoinWorker: 0
RecallWorker: 1675
CleanUpWorker: 4
Total time required (minutes):250
Sample commands:
jar -resources kmeans_center_resource_cl,proxima_ce_pailitao.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/pailitao proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-app_id 201220
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-clean_build_volume false
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_ratio 50
-kmeans_cluster_num 1000;Test on 100 million data records of the BINARY data type with 512 dimensions
- In the test, the sampling rate is 30%, 10,000 clusters exist, and 400 index shards are built.
Centroid access rate Number of accessed index shards Recall rates when Top K is 1, 50, 100, and 200 0.03 114.28 - 1: 0.9995
- 50: 0.9992600000000001
- 100: 0.9993500000000002
- 200: 0.9991800000000003
0.02 87.05 - 1: 0.9995
- 50: 0.9923800000000012
- 100: 0.9921900000000013
- 200: 0.9900700000000019
0.01 53.01 - 1: 0.9995
- 50: 0.9330400000000014
- 100: 0.9219700000000022
- 200: 0.9062475000000048
0.005 31.18 - 1: 0.9995
- 50: 0.7870400000000013
- 100: 0.7560500000000001
- 200: 0.7274675000000013
0.001 8.23 - 1: 0.9995
- 50: 0.4029200000000014
- 100: 0.3572000000000005
- 200: 0.32333000000000034
The following log shows the search period when the centroid access rate is 0.001.Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~ Information about the query table Table name: query_table_pailitao_binary3 , Partition:20210712 , Number of data records in the query table: 1010000 , Vector delimiter:~ Information about the output table Table name: output_table_pailitao_binary_cluster_10000_0_001 , Partition:20210712 Row and column information Number of rows: 20 , Number of columns:400 , Number of data records in the doc table of each column for index building:250000 Whether to clear volume indexes:false Time required for each worker node (seconds): SegmentationWorker: 10 TmpTableWorker: 1 KmeansGraphWorker: 38636 BuildJobWorker: 1085 SeekJobWorker: 1845 TmpResultJoinWorker: 0 RecallWorker: 939 CleanUpWorker: 4 Total time required (minutes):708 Sample commands: jar -resources kmeans_center_resource_cl_binary2,proxima_ce_g2.jar -classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary2/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_pailitao_binary3 -doc_table_partition 20210712 -query_table query_table_pailitao_binary3 -query_table_partition 20210712 -output_table output_table_pailitao_binary_cluster_10000_0_001 -output_table_partition 20210712 -data_type binary -dimension 512 -app_id 201220 -pk_type int64 -clean_build_volume false -distance_method Hamming -binary_to_int true -row_num 20 -column_num 400 -sharding_mode cluster -kmeans_resource_name kmeans_center_resource_cl_binary2 -kmeans_ratio 30 -job_mode build:seek:recall -topk 1,50,100,200 -kmeans_cluster_num 10000 -kmeans_seek_ratio 0.001;Result analysis: When the number of cluster centroids and the number of shards are fixed, a high centroid access rate increases the number of index shards that are actually accessed and causes a high recall rate and a long search period.
- In the test, the sampling rate is 50%, 10,000 clusters exist, and 100 index shards are built.
Centroid access rate Number of accessed index shards Recall rates when Top K is 1, 50, 100, and 200 0.03 61.93 - 1: 1.0
- 50: 0.9999199999999999
- 100: 1.0
- 200: 1.0
0.02 51.43 - 1: 1.0
- 50: 0.99986
- 100: 1.0
- 200: 0.999985
0.01 35.59 - 1: 1.0
- 50: 0.9960400000000004
- 100: 0.9961900000000005
- 200: 0.9942699999999994
0.005 23.26 - 1: 1.0
- 50: 0.9493600000000024
- 100: 0.9429200000000031
- 200: 0.9308524999999989
The following log shows the search period when the centroid access rate is 0.03.Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~ Information about the query table Table name: query_table_pailitao_binary3 , Partition:query_table_pailitao_binary3 , Number of data records in the query table:100000 , Vector delimiter:~ Information about the output table Table name: output_table_pailitao_binary_cluster_10000_100_0_03 , Partition:20210712 Row and column information Number of rows: 10 , Number of columns:100 , Number of data records in the doc table of each column for index building:1000000 Whether to clear volume indexes:false Time required for each worker node (seconds): SegmentationWorker: 10 TmpTableWorker: 1 KmeansGraphWorker: 23760 BuildJobWorker: 1510 SeekJobWorker: 556 TmpResultJoinWorker: 0 RecallWorker: 787 CleanUpWorker: 4 Total time required (minutes):443 Sample commands: jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar -classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_pailitao_binary3 -doc_table_partition 20210712 -query_table query_table_pailitao_binary3 -query_table_partition 20210712_10w -output_table output_table_pailitao_binary_cluster_10000_100_0_03 -output_table_partition 20210712 -data_type binary -dimension 512 -app_id 201220 -pk_type int64 -clean_build_volume false -distance_method Hamming -binary_to_int true -row_num 10 -column_num 100 -sharding_mode cluster -kmeans_resource_name kmeans_center_resource_cl_binary -kmeans_ratio 50 -job_mode build:seek:recall -topk 1,50,100,200 -kmeans_cluster_num 10000 -kmeans_seek_ratio 0.03; - In the test, the sampling rate is 100%, 1,000 clusters exist, and 20 index shards are built.
Centroid access rate Number of accessed index shards Recall rates when Top K is 1, 50, 100, and 200 0.1 14.26 - 1: 1.0
- 50: 0.9828800000000085
- 100: 0.9801000000000099
- 200: 0.9586999999999933
0.02 8.43 - 1: 1.0
- 50: 0.7897500000000025
- 100: 0.7759649999999999
- 200: 0.7622724999999989
The following log shows the search period when the centroid access rate is 0.02.Vector search Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall Information about the doc table Table name: doc_table_pailitao_binary2 , Partition:20210712 , Data records in the doc table:100000000 , Vector delimiter:~ Information about the query table Table name: query_table_pailitao_binary2 , Partition:20210712 , Number of data records in the query table:1000000 , Vector delimiter:~ Information about the output table Table name: output_table_pailitao_binary_cluster_1000 , Partition:20210712 Row and column information Number of rows: 10 , Number of columns:20 , Number of data records in the doc table of each column for index building:5000000 Whether to clear volume indexes:false Time required for each worker node (seconds): SegmentationWorker: 2 TmpTableWorker: 1 KmeansGraphWorker: 4996 BuildJobWorker: 8727 SeekJobWorker: 1425 TmpResultJoinWorker: 0 RecallWorker: 857 CleanUpWorker: 4 Total time required (minutes):266 Sample commands: jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar -classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_pailitao_binary2 -doc_table_partition 20210712 -query_table query_table_pailitao_binary2 -query_table_partition 20210712 -output_table output_table_pailitao_binary_cluster_1000 -output_table_partition 20210712 -data_type binary -dimension 512 -app_id 201220 -pk_type int64 -clean_build_volume false -distance_method Hamming -binary_to_int true -row_num 10 -column_num 20 -sharding_mode cluster -kmeans_resource_name kmeans_center_resource_cl_binary -kmeans_ratio 100 -job_mode build:seek:recall -topk 1,50,100,200 -kmeans_cluster_num 1000 -kmeans_seek_ratio 0.02;
Result analysis: The following conclusions can be drawn from the comparison of the preceding tests:
- The number of cluster centroids is positively correlated with the recall rate. A large number of cluster centroids causes a high recall rate.
- The centroid access rate is positively correlated with the recall rate. A high centroid access rate causes a high recall rate.
- The number of index shards is negatively correlated with the recall rate. A large number of index shards causes a low recall rate.
- The search period is positively correlated with the number of cluster centroids, the number of index shards, and the centroid access rate. A large number of cluster centroids, a high centroid access rate, and a large number of index shards cause a long search period.
- For cluster sharding in scenarios that have different numbers of cluster centroids, different numbers of index shards, and different centroid access rates, the recall rates that are obtained when top K is 1 are accurate.