Proxima CE supports cluster sharding in vector search tasks. This topic describes how to use the cluster sharding feature and provides examples.
Prerequisites
Before you begin, ensure that you have:
-
The Proxima CE package installed. For more information, see Install the Proxima CE package.
When to use cluster sharding
Proxima CE supports two index sharding modes, controlled by the -sharding_mode parameter: hash sharding (default) and cluster sharding.
| Mode | How it works | When to use |
|---|---|---|
| Hash sharding (default) | Distributes all records across a fixed number of shards. Every query searches all shards and merges the results. | General-purpose workloads |
| Cluster sharding | Groups similar vectors into the same shard using k-means clustering. Queries search only the shards closest to the query vector. | Billion-scale datasets with high query volume |
Use cluster sharding when:
-
Your dataset contains billions of records
-
You expect a high volume of queries
-
You build the index once and query it many times
Limitations:
-
Index building takes longer because k-means clustering runs on the full doc table.
-
Recall is not 100%: only a subset of shards is searched per query, so some results may be missed. See Execution result for representative recall rates.
-
Multi-category search is not supported.
-
Distance functions support only Euclidean distance and Hamming distance.
How it works
Cluster sharding runs in two phases:
-
Cluster the doc table using k-means. The number of cluster centroids is set by
kmeans_cluster_num. -
Partition the cluster centroids into groups based on spatial distance. The number of groups (index shards) is set by
column_num. -
Assign each record to the index shard of its nearest cluster centroid.
-
Calculate the distance between the query vector and every cluster centroid.
-
Select the index shards that correspond to the nearest centroids. The fraction of shards to search is set by
kmeans_seek_ratio. -
Merge the results from the selected shards.
Steps 1–3 run during index building. Steps 4–6 run during each query.
Configure cluster sharding
-
Set
-sharding_modetocluster. -
Add the name of the table that stores the initial cluster centroids to the
-resourcesoption in the JAR command.-resourcesis a JAR command option, not a command-line parameter. The table name is custom and must be unique—for example,foo_init_center_resource. MaxCompute creates this table at runtime to persist the cluster centroids, so you must specify the name explicitly. -
Set
-kmeans_resource_nameto the same value as-resources. Proxima CE cannot read the value of-resourcesdirectly, so this parameter passes it explicitly. -
(Optional) Tune the k-means parameters: For the full parameter reference, see Reference: Proxima CE parameters.
Parameter What it controls Default Trade-off kmeans_cluster_numNumber of cluster centroids generated during index building — More centroids improve recall but increase build time and memory usage kmeans_seek_ratioFraction of index shards searched per query 0.1 Higher values improve recall but increase query latency kmeans_sample_ratioFraction of data sampled for k-means training 0.05 Higher values improve centroid quality but increase training time kmeans_iter_numNumber of k-means iterations 30 More iterations produce better centroids at the cost of build time kmeans_worker_numNumber of parallel workers for k-means 0 (auto) Increase to speed up clustering on large datasets
Example
This example uses DataWorks to run Proxima CE on a 100-million-record, 128-dimensional FLOAT vector dataset. An external volume is created in advance.
Create input tables
Run the following SQL on a DataWorks SQL node to prepare the doc table and query table:
-- The origin_table table is a 128-dimensional FLOAT vector data table of an Alibaba Cloud service.
-- Prepare a doc table.
CREATE TABLE cluster_10kw_128f_doc(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_doc ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_doc PARTITION (pt='20221111')
SELECT pk, vector FROM origin_table WHERE pt='20221111';
-- Prepare a query table.
CREATE TABLE cluster_10kw_128f_query(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_query ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_query PARTITION (pt='20221111')
SELECT pk, vector FROM origin_table WHERE pt='20221111';
Run Proxima CE
For details about every parameter in the following command, see Reference: Proxima CE parameters.
--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar -- JAR package of Proxima CE
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table cluster_10kw_128f_doc
-doc_table_partition 20221111
-query_table cluster_10kw_128f_query
-query_table_partition 20221111
-output_table cluster_10kw_128f_output
-output_table_partition 20221111
-algo_model hnsw
-data_type float
-pk_type int64
-dimension 128
-column_num 50
-row_num 50
-vector_separator ,
-topk 1,50,100,200 -- Evaluate recall at top-K = 1, 50, 100, and 200
-job_mode train:build:seek:recall
-- -clean_build_volume true -- Set to true to retain the index for subsequent queries.
-- When set, use job_mode seek(:recall optional).
-external_volume_name udf_proxima_ext
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_xxx -- Must match the -resources value
-kmeans_cluster_num 1000
-- -kmeans_sample_ratio 0.05 -- Default value
-- -kmeans_seek_ratio 0.1 -- Default value
-- -kmeans_iter_num 30 -- Default value
-- -kmeans_init_center_method "" -- Default value
-- -kmeans_worker_num 0 -- Default value (auto)
;
Execution result
The output table contains a large amount of data. This example shows the run logs only. The output table schema follows the same structure as the standard execution result.
Vector search Data type:4 , Vector dimension:128 , Search method:HNSW , Measure:SquaredEuclidean , Building mode:train:build:seek:recall
Information about the doc table Table name: cluster_10kw_128f_doc , Partition:20221111 , Number of data records in the doc table:100000000 , Vector delimiter:,
Information about the query table Table name: cluster_10kw_128f_query , Partition:20221111 , Number of data records in the query table:100000000 , Vector delimiter:,
Information about the output table Table name:cluster_10kw_128f_output , Partition:20221111
Row and column information Number of rows: 50 , Number of columns:50 , Number of data records in the doc table of each column for index building:2000000
Whether to clear the volume index:true
Time required for each worker node (seconds):
SegmentationWorker: 3
TmpTableWorker: 1
KmeansGraphWorker: 2243
BuildJobWorker: 4973
SeekJobWorker: 5922
TmpResultJoinWorker: 0
RecallWorker: 986
CleanUpWorker: 6
Total time required (minutes):235
Actual recall rate
Recall@1: 0.999
Recall@50: 0.9941600000000027
Recall@100: 0.9902300000000046
Recall@200: 0.9816199999999914
The KmeansGraphWorker (2,243 s) and BuildJobWorker (4,973 s) reflect the extra build time that k-means clustering adds compared to hash sharding. The recall rates above 0.98 across all top-K values show that the recall loss from searching a subset of shards is minimal in practice.
What's next
-
Reference: Proxima CE parameters — full parameter reference including all
kmeans_*options -
Execution result schema — output table structure and field descriptions