Cluster sharding testing - MaxCompute - Alibaba Cloud Documentation Center

This topic describes the test conclusion and procedure of cluster sharding testing.

Test conclusion

The sampling rate, the number of cluster centroids, and the number of index shards that are configured for each dataset are different on Proxima CE. The test results on the recall rates and search period indicate that cluster sharding of Proxima CE meets the correctness expectations. The following conclusions can be drawn from the tests:

The number of cluster centroids is positively correlated with the recall rate. A large number of cluster centroids causes a high recall rate.
The centroid access rate is positively correlated with the recall rate. A high centroid access rate causes a high recall rate.
The number of index shards is negatively correlated with the recall rate. A large number of index shards causes a low recall rate.
The search period is positively correlated with the number of cluster centroids, the number of index shards, and the centroid access rate. A large number of cluster centroids, a high centroid access rate, and a large number of index shards cause a long search period.
For cluster sharding in scenarios that have different numbers of cluster centroids, different numbers of index shards, and different centroid access rates, the recall rates that are obtained when top K is 1 are accurate.

Test procedure

Test on 20 million data records of the FLOAT data type with 512 dimensions

In the test, the sampling rate is 50%, 1,000 clusters exist, and 10 index shards are built.

Centroid access rate	Number of accessed index shards	Recall rates when Top K is 1, 50, 100, and 200
0.1	7.30	1: 0.999 50: 0.9992400000000005 100: 0.9987400000000008 200: 0.9974424999999909
0.05	6.35	1: 0.999 50: 0.998660000000001 100: 0.9979400000000015 200: 0.9963449999999905
0.02	4.72	1 : 0.997 50 : 0.9924600000000039 100 : 0.9912400000000047 200 : 0.9889074999999895
0.01	3.49	1: 0.992 50: 0.9773200000000073 100: 0.9742900000000084 200: 0.9696999999999908
0.001	1	1 : 0.762 50 : 0.7010600000000025 100 : 0.6882600000000016 200 : 0.676167499999999

The following log shows the search period when the centroid access rate is 0.1.

Vector search  Data type:4 , Vector dimension:512 , Search method:hnsw , Measure:SquaredEuclidean , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao , Partition:20210707 , Number of data records in the doc table:19990000 , Vector delimiter:
Information about the query table Table name: query_table_pailitao , Partition:20210707 , Number of data records in the query table:100000 , Vector delimiter:
Information about the output table Table name: output_table_pailitao_cluster_2000w , Partition:20210707
Row and column information  Number of rows: 10 , Number of columns:10 , Number of data records in the doc table of each column for index building:1999000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:          7
TmpTableWorker:              1
KmeansGraphWorker:           2419
BuildJobWorker:              9927
SeekJobWorker:               1026
TmpResultJoinWorker:         0
RecallWorker:                1675
CleanUpWorker:               4
Total time required (minutes):250

Sample commands:
jar -resources kmeans_center_resource_cl,proxima_ce_pailitao.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/pailitao proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar   com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-app_id 201220
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-clean_build_volume false
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_ratio 50
-kmeans_cluster_num 1000;

Test on 100 million data records of the BINARY data type with 512 dimensions

In the test, the sampling rate is 30%, 10,000 clusters exist, and 400 index shards are built.

Centroid access rate	Number of accessed index shards	Recall rates when Top K is 1, 50, 100, and 200
0.03	114.28	1: 0.9995 50: 0.9992600000000001 100: 0.9993500000000002 200: 0.9991800000000003
0.02	87.05	1: 0.9995 50: 0.9923800000000012 100: 0.9921900000000013 200: 0.9900700000000019
0.01	53.01	1: 0.9995 50: 0.9330400000000014 100: 0.9219700000000022 200: 0.9062475000000048
0.005	31.18	1: 0.9995 50: 0.7870400000000013 100: 0.7560500000000001 200: 0.7274675000000013
0.001	8.23	1: 0.9995 50: 0.4029200000000014 100: 0.3572000000000005 200: 0.32333000000000034

The following log shows the search period when the centroid access rate is 0.001.

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:20210712 , Number of data records in the query table: 1010000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_0_001 , Partition:20210712
Row and column information  Number of rows: 20 , Number of columns:400 , Number of data records in the doc table of each column for index building:250000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        38636
BuildJobWorker:        1085
SeekJobWorker:        1845
TmpResultJoinWorker:        0
RecallWorker:        939
CleanUpWorker:        4
Total time required (minutes):708

Sample commands:
jar -resources kmeans_center_resource_cl_binary2,proxima_ce_g2.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary2/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_10000_0_001
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 20
-column_num 400
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary2
-kmeans_ratio 30
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.001;

Result analysis: When the number of cluster centroids and the number of shards are fixed, a high centroid access rate increases the number of index shards that are actually accessed and causes a high recall rate and a long search period.

In the test, the sampling rate is 50%, 10,000 clusters exist, and 100 index shards are built.

Centroid access rate	Number of accessed index shards	Recall rates when Top K is 1, 50, 100, and 200
0.03	61.93	1: 1.0 50: 0.9999199999999999 100: 1.0 200: 1.0
0.02	51.43	1: 1.0 50: 0.99986 100: 1.0 200: 0.999985
0.01	35.59	1: 1.0 50: 0.9960400000000004 100: 0.9961900000000005 200: 0.9942699999999994
0.005	23.26	1: 1.0 50: 0.9493600000000024 100: 0.9429200000000031 200: 0.9308524999999989

The following log shows the search period when the centroid access rate is 0.03.

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:query_table_pailitao_binary3 , Number of data records in the query table:100000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_100_0_03 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:100 , Number of data records in the doc table of each column for index building:1000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        23760
BuildJobWorker:        1510
SeekJobWorker:        556
TmpResultJoinWorker:        0
RecallWorker:        787
CleanUpWorker:        4
Total time required (minutes):443

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712_10w
-output_table output_table_pailitao_binary_cluster_10000_100_0_03
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 100
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 50
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.03;

In the test, the sampling rate is 100%, 1,000 clusters exist, and 20 index shards are built.

Centroid access rate	Number of accessed index shards	Recall rates when Top K is 1, 50, 100, and 200
0.1	14.26	1: 1.0 50: 0.9828800000000085 100: 0.9801000000000099 200: 0.9586999999999933
0.02	8.43	1: 1.0 50: 0.7897500000000025 100: 0.7759649999999999 200: 0.7622724999999989

The following log shows the search period when the centroid access rate is 0.02.

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary2 , Partition:20210712 , Data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary2 , Partition:20210712 , Number of data records in the query table:1000000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_1000 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:20 , Number of data records in the doc table of each column for index building:5000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        2
TmpTableWorker:        1
KmeansGraphWorker:        4996
BuildJobWorker:        8727
SeekJobWorker:        1425
TmpResultJoinWorker:        0
RecallWorker:        857
CleanUpWorker:        4
Total time required (minutes):266

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary2
-doc_table_partition 20210712
-query_table query_table_pailitao_binary2
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_1000
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 20
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 100
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 1000
-kmeans_seek_ratio 0.02;

Result analysis: The following conclusions can be drawn from the comparison of the preceding tests:

The number of cluster centroids is positively correlated with the recall rate. A large number of cluster centroids causes a high recall rate.
The centroid access rate is positively correlated with the recall rate. A high centroid access rate causes a high recall rate.
The number of index shards is negatively correlated with the recall rate. A large number of index shards causes a low recall rate.
The search period is positively correlated with the number of cluster centroids, the number of index shards, and the centroid access rate. A large number of cluster centroids, a high centroid access rate, and a large number of index shards cause a long search period.
For cluster sharding in scenarios that have different numbers of cluster centroids, different numbers of index shards, and different centroid access rates, the recall rates that are obtained when top K is 1 are accurate.