Proxima CE supports cluster sharding in vector search tasks. This topic describes how to use the cluster sharding feature and provides examples.
Prerequisites
The Proxima CE package is installed. For more information, see Install the Proxima CE package.
Basic principles
Proxima CE supports hash sharding and cluster sharding during vector search. You can configure the -sharding_mode parameter to specify an index sharding mode. If you set this parameter to hash, hash sharding is performed. If you set this parameter to cluster, cluster sharding is performed. By default, hash sharding is performed.
Hash sharding: When Proxima CE builds indexes, the full set of data in the doc table is sharded to obtain a specific number of indexes. The number of indexes that can be obtained is specified by the column_num parameter. During vector search, Proxima CE searches for data in all index shards in each query and merges the recall results.
Cluster sharding: Proxima CE performs clustering on data in the doc table and then assigns the data that is close to each other to the same index shard. During vector search, Proxima CE searches for data in the index shards that correspond to specific nearest cluster centroids based on the distance between the query and the cluster centroid.
Cluster sharding on indexes aims to improve query performance. When you use this feature, Proxima CE searches for data only in specific index shards and recalls the optimal result without the need to query all index shards. Cluster sharding on indexes involves the following phases:
Index building phase
When Proxima CE builds indexes, k-means clustering is performed on the data set in the doc table and generates a specific number of cluster centroids. The number of cluster centroids that can be generated is specified by the kmeans_cluster_num parameter.
Then, Proxima CE divides the cluster centroids into a specific number of sets based on the spatial distance. The number of sets is specified by the column_num parameter. This process is used to assign the cluster centroids to the specified number of indexes.
When Proxima CE clusters data in the doc table, Proxima CE assigns a data record to the index shard that corresponds to the cluster centroid closest to the data record.
Index query phase
Proxima CE calculates the distance between the query and each cluster centroid.
Proxima CE selects a specific proportion of index shards that correspond to the nearest cluster centroids and searches for data in the index shards. The proportion of index shards that can be selected is specified by the kmeans_seek_ratio parameter.
Proxima CE merges the retrieved results in the index shards.
Use scenarios
Cluster sharding is suitable for scenarios in which an excessively large amount of data such as billions of data records exists. This feature is especially suitable for scenarios in which an excessively large amount of query data exists.
Cluster sharding is suitable for scenarios in which an index is built once and then queried multiple times.
NoteCluster sharding on indexes requires k-means clustering on the data set in the doc table, which consumes a specific period of time. In addition, a recall loss is incurred because only part of index shards are retrieved. Therefore, this feature is not suitable for all vector search scenarios.
Cluster sharding does not support multi-category search. Distance functions do not support distance formulas that use a distance measure type other than Euclidean distance or Hamming distance.
Use logic
Set the
-sharding_modeparameter to cluster.Add the name of the table that stores the initial cluster centroid to the
-resourcesoption in the JAR command.NoteThis parameter is not a command-line parameter, but a parameter required by the JAR command. The name of the table that stores the initial cluster centroid is a custom name and must be unique. For example, the name can be
foo_init_center_resource.When Proxima CE is run, a MaxCompute table is created to store cluster centroids. Due to the mechanism of MaxCompute resources, you must manually specify the name of the table that stores the cluster centroids.
Make sure that the value of the
-kmeans_resource_nameparameter is the same as the value of the-resourcesoption. Proxima CE cannot directly obtain the value of the-resourcesoption. Therefore, you must specify the command-line parameter-kmeans_resource_nameto pass the value of the -resources option.Configure other parameters. Other parameters are optional. For more information, see the parameters whose names start with
kmeans_in Optional parameters.
Create input tables and import data
You can execute the following statements on an SQL node of DataWorks:
-- Note: The origin_table table is a 128-dimensional FLOAT vector data table of an Alibaba Cloud service.
-- Prepare a doc table.
CREATE TABLE cluster_10kw_128f_doc(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_doc add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_doc PARTITION (pt='20221111') SELECT pk, vector FROM origin_table WHERE pt='20221111';
-- Prepare a query table.
CREATE TABLE cluster_10kw_128f_query(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_query add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_query PARTITION (pt='20221111') SELECT pk, vector FROM origin_table WHERE pt='20221111';Use DataWorks to run Proxima CE
In this example, DataWorks is used to run Proxima CE and an external volume is created in advance.
For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.
Sample commands:
--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar -- The JAR package of Proxima CE that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner -- classpath specifies the entry class of the main function.
-doc_table cluster_10kw_128f_doc
-doc_table_partition 20221111
-query_table cluster_10kw_128f_query
-query_table_partition 20221111
-output_table cluster_10kw_128f_output
-output_table_partition 20221111
-algo_model hnsw
-data_type float
-pk_type int64
-dimension 128
-column_num 50
-row_num 50
-vector_separator ,
-topk 1,50,100,200 -- Obtain the recall rates when top K is set to 1, 50, 100, and 200.
-job_mode train:build:seek:recall
-- -clean_build_volume true -- Retain the index. You can set this option to true for subsequent use of the index. If you set this option to true, you must set the job_mode parameter to `seek(:recall optional)'.
-external_volume_name udf_proxima_ext
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_xxx -- Manually specify the name of the kmeans resource. In this example, kmeans_center_resource_xxx is used.
-kmeans_cluster_num 1000
-- -kmeans_sample_ratio 0.05 -- Use the default value.
-- -kmeans_seek_ratio 0.1 -- Use the default value.
-- -kmeans_iter_num 30 -- Use the default value.
-- -kmeans_init_center_method "" -- Use the default value.
-- -kmeans_worker_num 0 -- Use the default value.
;Execution result
The output table contains a large amount of data. In this example, only the actual run logs are provided and data records in the output table are not provided. The schema of the output table is the same as the schema of the execution result.
Vector search Data type:4 , Vector dimension:128 , Search method:HNSW , Measure:SquaredEuclidean , Building mode:train:build:seek:recall
Information about the doc table Table name: cluster_10kw_128f_doc , Partition:20221111 , Number of data records in the doc table:100000000 , Vector delimiter:,
Information about the query table Table name: cluster_10kw_128f_query , Partition:20221111 , Number of data records in the query table:100000000 , Vector delimiter:,
Information about the output table Table name:cluster_10kw_128f_output , Partition:20221111
Row and column information Number of rows: 50 , Number of columns:50 , Number of data records in the doc table of each column for index building:2000000
Whether to clear the volume index:true
Time required for each worker node (seconds):
SegmentationWorker: 3
TmpTableWorker: 1
KmeansGraphWorker: 2243
BuildJobWorker: 4973
SeekJobWorker: 5922
TmpResultJoinWorker: 0
RecallWorker: 986
CleanUpWorker: 6
Total time required (minutes):235
Actual recall rate
Recall@1: 0.999
Recall@50: 0.9941600000000027
Recall@100: 0.9902300000000046
Recall@200: 0.9816199999999914