クラスターシャーディングテスト - MaxCompute - Alibaba Cloud ドキュメントセンター

このトピックでは、クラスターシャーディングテストのテスト結論と手順について説明します。

テストの結論

各データセットに設定されているサンプリングレート、クラスターセントロイドの数、およびインデックスシャードの数は、プロキシマCEで異なります。リコール率と検索期間のテスト結果は、プロキシマCEのクラスターシャーディングが正確性の期待値を満たしていることを示しています。テストから次の結論を引き出すことができます。

クラスタ重心の数は、再現率と正の相関がある。多数のクラスタ重心は、高い再現率を引き起こす。
セントロイドアクセスレートは、リコールレートと正の相関がある。高いセントロイドアクセス速度は、高いリコール速度を引き起こす。
インデックスシャードの数は、リコール率と負の相関があります。インデックスシャードの数が多いと、リコール率が低くなります。
検索期間は、クラスタ重心の数、インデックスシャードの数、および重心アクセスレートと正の相関があります。多数のクラスタ重心、高い重心アクセスレート、および多数のインデックスシャードは、長い検索期間を引き起こす。
異なる数のクラスタ重心、異なる数のインデックスシャード、および異なる重心アクセスレートを有するシナリオにおけるクラスタシャーディングの場合、上位Kが1であるときに得られるリコールレートは正確である。

テストの実行手順

512のディメンションを持つFLOATデータ型の20万件のデータレコードのテスト

テストでは、サンプリングレートが50% 、1,000のクラスターが存在し、10のインデックスシャードが作成されます。

Centroidアクセスレート	アクセスされたインデックスシャードの数	リコールレートの場合、トップKは1、50、100、および200
0.1	7.30	1: 0.999 50: 0.9992400000000005 100: 0.9987400000000008 200: 0.9974424999999909
0.05	6.35	1: 0.999 50: 0.998660000000001 100: 0.9979400000000015 200: 0.9963449999999905
0.02	4.72	1 : 0.997 50 : 0.9924600000000039 100 : 0.9912400000000047 200 : 0.9889074999999895
0.01	3.49	1: 0.992 50: 0.9773200000000073 100: 0.9742900000000084 200: 0.9696999999999908
0.001	1	1 : 0.762 50 : 0.7010600000000025 100 : 0.6882600000000016 200 : 0.676167499999999

次のログは、セントロイドアクセスレートが0.1である場合の検索期間を示しています。

Vector search  Data type:4 , Vector dimension:512 , Search method:hnsw , Measure:SquaredEuclidean , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao , Partition:20210707 , Number of data records in the doc table:19990000 , Vector delimiter:
Information about the query table Table name: query_table_pailitao , Partition:20210707 , Number of data records in the query table:100000 , Vector delimiter:
Information about the output table Table name: output_table_pailitao_cluster_2000w , Partition:20210707
Row and column information  Number of rows: 10 , Number of columns:10 , Number of data records in the doc table of each column for index building:1999000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:          7
TmpTableWorker:              1
KmeansGraphWorker:           2419
BuildJobWorker:              9927
SeekJobWorker:               1026
TmpResultJoinWorker:         0
RecallWorker:                1675
CleanUpWorker:               4
Total time required (minutes):250

Sample commands:
jar -resources kmeans_center_resource_cl,proxima_ce_pailitao.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/pailitao proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar   com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-app_id 201220
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-clean_build_volume false
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_ratio 50
-kmeans_cluster_num 1000;

512ディメンションを持つBINARYデータ型の100万件のデータレコードのテスト

テストでは、サンプリングレートが30% され、10,000のクラスターが存在し、400のインデックスシャードが作成されます。

Centroidアクセスレート	アクセスされたインデックスシャードの数	リコールレートの場合、トップKは1、50、100、および200
0.03	114.28	1: 0.9995 50: 0.9992600000000001 100: 0.9993500000000002 200: 0.9991800000000003
0.02	87.05	1: 0.9995 50: 0.9923800000000012 100: 0.9921900000000013 200: 0.9900700000000019
0.01	53.01	1: 0.9995 50: 0.9330400000000014 100: 0.9219700000000022 200: 0.9062475000000048
0.005	31.18	1: 0.9995 50: 0.7870400000000013 100: 0.7560500000000001 200: 0.7274675000000013
0.001	8.23	1: 0.9995 50: 0.4029200000000014 100: 0.3572000000000005 200: 0.32333000000000034

次のログは、セントロイドアクセスレートが0.001である場合の検索期間を示しています。

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:20210712 , Number of data records in the query table: 1010000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_0_001 , Partition:20210712
Row and column information  Number of rows: 20 , Number of columns:400 , Number of data records in the doc table of each column for index building:250000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        38636
BuildJobWorker:        1085
SeekJobWorker:        1845
TmpResultJoinWorker:        0
RecallWorker:        939
CleanUpWorker:        4
Total time required (minutes):708

Sample commands:
jar -resources kmeans_center_resource_cl_binary2,proxima_ce_g2.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary2/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_10000_0_001
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 20
-column_num 400
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary2
-kmeans_ratio 30
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.001;

結果分析: クラスター重心の数とシャードの数が固定されている場合、重心アクセス率が高いと、実際にアクセスされるインデックスシャードの数が増加し、リコール率が高くなり、検索期間が長くなります。

テストでは、サンプリングレートが50% され、10,000のクラスターが存在し、100のインデックスシャードが作成されます。

Centroidアクセスレート	アクセスされたインデックスシャードの数	リコールレートの場合、トップKは1、50、100、および200
0.03	61.93	1: 1.0 50: 0.9999199999999999 100: 1.0 200: 1.0
0.02	51.43	1: 1.0 50: 0.99986 100: 1.0 200: 0.999985
0.01	35.59	1: 1.0 50: 0.9960400000000004 100: 0.9961900000000005 200: 0.9942699999999994
0.005	23.26	1: 1.0 50: 0.9493600000000024 100: 0.9429200000000031 200: 0.9308524999999989

次のログは、セントロイドアクセスレートが0.03である場合の検索期間を示しています。

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary3 , Partition:20210712 , Number of data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary3 , Partition:query_table_pailitao_binary3 , Number of data records in the query table:100000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_10000_100_0_03 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:100 , Number of data records in the doc table of each column for index building:1000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        10
TmpTableWorker:        1
KmeansGraphWorker:        23760
BuildJobWorker:        1510
SeekJobWorker:        556
TmpResultJoinWorker:        0
RecallWorker:        787
CleanUpWorker:        4
Total time required (minutes):443

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary3
-doc_table_partition 20210712
-query_table query_table_pailitao_binary3
-query_table_partition 20210712_10w
-output_table output_table_pailitao_binary_cluster_10000_100_0_03
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 100
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 50
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 10000
-kmeans_seek_ratio 0.03;

テストでは、サンプリングレートが100% 、1,000のクラスターが存在し、20のインデックスシャードが作成されます。

Centroidアクセスレート	アクセスされたインデックスシャードの数	リコールレートの場合、トップKは1、50、100、および200
0.1	14.26	1: 1.0 50: 0.9828800000000085 100: 0.9801000000000099 200: 0.9586999999999933
0.02	8.43	1: 1.0 50: 0.7897500000000025 100: 0.7759649999999999 200: 0.7622724999999989

次のログは、セントロイドアクセスレートが0.02である場合の検索期間を示しています。

Vector search  Data type:1 , Vector dimension:512 , Search method:hnsw , Measure:Hamming , Building mode:build:seek:recall
Information about the doc table Table name: doc_table_pailitao_binary2 , Partition:20210712 , Data records in the doc table:100000000 , Vector delimiter:~
Information about the query table Table name: query_table_pailitao_binary2 , Partition:20210712 , Number of data records in the query table:1000000 , Vector delimiter:~
Information about the output table Table name: output_table_pailitao_binary_cluster_1000 , Partition:20210712
Row and column information  Number of rows: 10 , Number of columns:20 , Number of data records in the doc table of each column for index building:5000000
Whether to clear volume indexes:false

Time required for each worker node (seconds):
SegmentationWorker:        2
TmpTableWorker:        1
KmeansGraphWorker:        4996
BuildJobWorker:        8727
SeekJobWorker:        1425
TmpResultJoinWorker:        0
RecallWorker:        857
CleanUpWorker:        4
Total time required (minutes):266

Sample commands:
jar -resources kmeans_center_resource_cl_binary,proxima_ce_g.jar
-classpath /data/jiliang.ljl/project/proxima2-java/proxima-ce/target/binary/proxima-ce-0.1-SNAPSHOT-jar-with-dependencies.jar  com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao_binary2
-doc_table_partition 20210712
-query_table query_table_pailitao_binary2
-query_table_partition 20210712
-output_table output_table_pailitao_binary_cluster_1000
-output_table_partition 20210712
-data_type binary
-dimension 512
-app_id 201220
-pk_type int64
-clean_build_volume false
-distance_method Hamming
-binary_to_int true
-row_num 10
-column_num 20
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl_binary
-kmeans_ratio 100
-job_mode build:seek:recall
-topk 1,50,100,200
-kmeans_cluster_num 1000
-kmeans_seek_ratio 0.02;

結果分析: 前述のテストの比較から、次の結論を引き出すことができます。

クラスタ重心の数は、再現率と正の相関がある。多数のクラスタ重心は、高い再現率を引き起こす。
セントロイドアクセスレートは、リコールレートと正の相関がある。高いセントロイドアクセス速度は、高いリコール速度を引き起こす。
インデックスシャードの数は、リコール率と負の相関があります。インデックスシャードの数が多いと、リコール率が低くなります。
検索期間は、クラスタ重心の数、インデックスシャードの数、および重心アクセスレートと正の相関があります。多数のクラスタ重心、高い重心アクセスレート、および多数のインデックスシャードは、長い検索期間を引き起こす。
異なる数のクラスタ重心、異なる数のインデックスシャード、および異なる重心アクセスレートを有するシナリオにおけるクラスタシャーディングの場合、上位Kが1であるときに得られるリコールレートは正確である。