Cluster sharding - MaxCompute - Alibaba Cloud Documentation Center

Proxima CE supports cluster sharding in vector search tasks. This topic describes how to use the cluster sharding feature and provides examples.

Prerequisites

The Proxima CE package is installed. For more information, see Install the Proxima CE package.

Basic principles

Proxima CE supports hash sharding and cluster sharding during vector search. You can configure the -sharding_mode parameter to specify an index sharding mode. If you set this parameter to hash, hash sharding is performed. If you set this parameter to cluster, cluster sharding is performed. By default, hash sharding is performed.

Hash sharding: When Proxima CE builds indexes, the full set of data in the doc table is sharded to obtain a specific number of indexes. The number of indexes that can be obtained is specified by the column_num parameter. During vector search, Proxima CE searches for data in all index shards in each query and merges the recall results.
Cluster sharding: Proxima CE performs clustering on data in the doc table and then assigns the data that is close to each other to the same index shard. During vector search, Proxima CE searches for data in the index shards that correspond to specific nearest cluster centroids based on the distance between the query and the cluster centroid.

Cluster sharding on indexes aims to improve query performance. When you use this feature, Proxima CE searches for data only in specific index shards and recalls the optimal result without the need to query all index shards. Cluster sharding on indexes involves the following phases:

Index building phase
1. When Proxima CE builds indexes, k-means clustering is performed on the data set in the doc table and generates a specific number of cluster centroids. The number of cluster centroids that can be generated is specified by the kmeans_cluster_num parameter.
2. Then, Proxima CE divides the cluster centroids into a specific number of sets based on the spatial distance. The number of sets is specified by the column_num parameter. This process is used to assign the cluster centroids to the specified number of indexes.
3. When Proxima CE clusters data in the doc table, Proxima CE assigns a data record to the index shard that corresponds to the cluster centroid closest to the data record.
Index query phase
1. Proxima CE calculates the distance between the query and each cluster centroid.
2. Proxima CE selects a specific proportion of index shards that correspond to the nearest cluster centroids and searches for data in the index shards. The proportion of index shards that can be selected is specified by the kmeans_seek_ratio parameter.
3. Proxima CE merges the retrieved results in the index shards.

Use scenarios

Cluster sharding is suitable for scenarios in which an excessively large amount of data such as billions of data records exists. This feature is especially suitable for scenarios in which an excessively large amount of query data exists.
Cluster sharding is suitable for scenarios in which an index is built once and then queried multiple times.
Note
- Cluster sharding on indexes requires k-means clustering on the data set in the doc table, which consumes a specific period of time. In addition, a recall loss is incurred because only part of index shards are retrieved. Therefore, this feature is not suitable for all vector search scenarios.
- Cluster sharding does not support multi-category search. Distance functions do not support distance formulas that use a distance measure type other than Euclidean distance or Hamming distance.

Use logic

Set the -sharding_mode parameter to cluster.
Add the name of the table that stores the initial cluster centroid to the -resources option in the JAR command.
Note
- This parameter is not a command-line parameter, but a parameter required by the JAR command. The name of the table that stores the initial cluster centroid is a custom name and must be unique. For example, the name can be foo_init_center_resource.
- When Proxima CE is run, a MaxCompute table is created to store cluster centroids. Due to the mechanism of MaxCompute resources, you must manually specify the name of the table that stores the cluster centroids.
Make sure that the value of the -kmeans_resource_name parameter is the same as the value of the -resources option. Proxima CE cannot directly obtain the value of the -resources option. Therefore, you must specify the command-line parameter -kmeans_resource_name to pass the value of the -resources option.
Configure other parameters. Other parameters are optional. For more information, see the parameters whose names start with kmeans_ in Optional parameters.

Create input tables and import data

You can execute the following statements on an SQL node of DataWorks:

-- Note: The origin_table table is a 128-dimensional FLOAT vector data table of an Alibaba Cloud service.
-- Prepare a doc table.
CREATE TABLE cluster_10kw_128f_doc(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_doc add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_doc PARTITION (pt='20221111') SELECT pk, vector FROM origin_table WHERE pt='20221111';

-- Prepare a query table.
CREATE TABLE cluster_10kw_128f_query(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_query add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_query PARTITION (pt='20221111') SELECT pk, vector FROM origin_table WHERE pt='20221111';

Use DataWorks to run Proxima CE

In this example, DataWorks is used to run Proxima CE and an external volume is created in advance.

Note

For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.

Sample commands:

--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar  -- The JAR package of Proxima CE that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner  -- classpath specifies the entry class of the main function.
-doc_table cluster_10kw_128f_doc
-doc_table_partition 20221111
-query_table cluster_10kw_128f_query
-query_table_partition 20221111
-output_table cluster_10kw_128f_output
-output_table_partition 20221111
-algo_model hnsw
-data_type float
-pk_type int64
-dimension 128
-column_num 50
-row_num 50
-vector_separator ,
-topk 1,50,100,200 -- Obtain the recall rates when top K is set to 1, 50, 100, and 200.
-job_mode train:build:seek:recall
-- -clean_build_volume true -- Retain the index. You can set this option to true for subsequent use of the index. If you set this option to true, you must set the job_mode parameter to `seek(:recall optional)'.
-external_volume_name udf_proxima_ext
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_xxx -- Manually specify the name of the kmeans resource. In this example, kmeans_center_resource_xxx is used.
-kmeans_cluster_num 1000
-- -kmeans_sample_ratio 0.05 -- Use the default value.
-- -kmeans_seek_ratio 0.1 -- Use the default value.
-- -kmeans_iter_num 30 -- Use the default value.
-- -kmeans_init_center_method "" -- Use the default value.
-- -kmeans_worker_num 0 -- Use the default value.
;

Execution result

Note

The output table contains a large amount of data. In this example, only the actual run logs are provided and data records in the output table are not provided. The schema of the output table is the same as the schema of the execution result.

Vector search  Data type:4 , Vector dimension:128 , Search method:HNSW , Measure:SquaredEuclidean , Building mode:train:build:seek:recall
Information about the doc table Table name: cluster_10kw_128f_doc , Partition:20221111 , Number of data records in the doc table:100000000 , Vector delimiter:,
Information about the query table Table name: cluster_10kw_128f_query , Partition:20221111 , Number of data records in the query table:100000000 , Vector delimiter:,
Information about the output table Table name:cluster_10kw_128f_output , Partition:20221111
Row and column information  Number of rows: 50 , Number of columns:50 , Number of data records in the doc table of each column for index building:2000000
Whether to clear the volume index:true

Time required for each worker node (seconds):
   SegmentationWorker:      3
   TmpTableWorker:      1
   KmeansGraphWorker:       2243
   BuildJobWorker:      4973
   SeekJobWorker:       5922
   TmpResultJoinWorker:     0
   RecallWorker:        986
   CleanUpWorker:       6
Total time required (minutes):235

Actual recall rate
    Recall@1:   0.999
    Recall@50:  0.9941600000000027
    Recall@100: 0.9902300000000046
    Recall@200: 0.9816199999999914