All Products
Search
Document Center

MaxCompute:Cluster sharding

Last Updated:Mar 26, 2026

Proxima CE supports cluster sharding in vector search tasks. This topic describes how to use the cluster sharding feature and provides examples.

Prerequisites

Before you begin, ensure that you have:

When to use cluster sharding

Proxima CE supports two index sharding modes, controlled by the -sharding_mode parameter: hash sharding (default) and cluster sharding.

Mode How it works When to use
Hash sharding (default) Distributes all records across a fixed number of shards. Every query searches all shards and merges the results. General-purpose workloads
Cluster sharding Groups similar vectors into the same shard using k-means clustering. Queries search only the shards closest to the query vector. Billion-scale datasets with high query volume

Use cluster sharding when:

  • Your dataset contains billions of records

  • You expect a high volume of queries

  • You build the index once and query it many times

Limitations:

  • Index building takes longer because k-means clustering runs on the full doc table.

  • Recall is not 100%: only a subset of shards is searched per query, so some results may be missed. See Execution result for representative recall rates.

  • Multi-category search is not supported.

  • Distance functions support only Euclidean distance and Hamming distance.

How it works

Cluster sharding runs in two phases:

  1. Cluster the doc table using k-means. The number of cluster centroids is set by kmeans_cluster_num.

  2. Partition the cluster centroids into groups based on spatial distance. The number of groups (index shards) is set by column_num.

  3. Assign each record to the index shard of its nearest cluster centroid.

  4. Calculate the distance between the query vector and every cluster centroid.

  5. Select the index shards that correspond to the nearest centroids. The fraction of shards to search is set by kmeans_seek_ratio.

  6. Merge the results from the selected shards.

Steps 1–3 run during index building. Steps 4–6 run during each query.

image..png

Configure cluster sharding

  1. Set -sharding_mode to cluster.

  2. Add the name of the table that stores the initial cluster centroids to the -resources option in the JAR command.

    -resources is a JAR command option, not a command-line parameter. The table name is custom and must be unique—for example, foo_init_center_resource. MaxCompute creates this table at runtime to persist the cluster centroids, so you must specify the name explicitly.
  3. Set -kmeans_resource_name to the same value as -resources. Proxima CE cannot read the value of -resources directly, so this parameter passes it explicitly.

  4. (Optional) Tune the k-means parameters: For the full parameter reference, see Reference: Proxima CE parameters.

    Parameter What it controls Default Trade-off
    kmeans_cluster_num Number of cluster centroids generated during index building More centroids improve recall but increase build time and memory usage
    kmeans_seek_ratio Fraction of index shards searched per query 0.1 Higher values improve recall but increase query latency
    kmeans_sample_ratio Fraction of data sampled for k-means training 0.05 Higher values improve centroid quality but increase training time
    kmeans_iter_num Number of k-means iterations 30 More iterations produce better centroids at the cost of build time
    kmeans_worker_num Number of parallel workers for k-means 0 (auto) Increase to speed up clustering on large datasets

Example

This example uses DataWorks to run Proxima CE on a 100-million-record, 128-dimensional FLOAT vector dataset. An external volume is created in advance.

Create input tables

Run the following SQL on a DataWorks SQL node to prepare the doc table and query table:

-- The origin_table table is a 128-dimensional FLOAT vector data table of an Alibaba Cloud service.
-- Prepare a doc table.
CREATE TABLE cluster_10kw_128f_doc(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_doc ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_doc PARTITION (pt='20221111')
  SELECT pk, vector FROM origin_table WHERE pt='20221111';

-- Prepare a query table.
CREATE TABLE cluster_10kw_128f_query(pk STRING, vector STRING) PARTITIONED BY (pt STRING);
ALTER TABLE cluster_10kw_128f_query ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE cluster_10kw_128f_query PARTITION (pt='20221111')
  SELECT pk, vector FROM origin_table WHERE pt='20221111';

Run Proxima CE

For details about every parameter in the following command, see Reference: Proxima CE parameters.
--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar  -- JAR package of Proxima CE
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner
-doc_table cluster_10kw_128f_doc
-doc_table_partition 20221111
-query_table cluster_10kw_128f_query
-query_table_partition 20221111
-output_table cluster_10kw_128f_output
-output_table_partition 20221111
-algo_model hnsw
-data_type float
-pk_type int64
-dimension 128
-column_num 50
-row_num 50
-vector_separator ,
-topk 1,50,100,200           -- Evaluate recall at top-K = 1, 50, 100, and 200
-job_mode train:build:seek:recall
-- -clean_build_volume true  -- Set to true to retain the index for subsequent queries.
                              -- When set, use job_mode seek(:recall optional).
-external_volume_name udf_proxima_ext
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_xxx  -- Must match the -resources value
-kmeans_cluster_num 1000
-- -kmeans_sample_ratio 0.05  -- Default value
-- -kmeans_seek_ratio 0.1     -- Default value
-- -kmeans_iter_num 30        -- Default value
-- -kmeans_init_center_method ""  -- Default value
-- -kmeans_worker_num 0       -- Default value (auto)
;

Execution result

The output table contains a large amount of data. This example shows the run logs only. The output table schema follows the same structure as the standard execution result.
Vector search  Data type:4 , Vector dimension:128 , Search method:HNSW , Measure:SquaredEuclidean , Building mode:train:build:seek:recall
Information about the doc table Table name: cluster_10kw_128f_doc , Partition:20221111 , Number of data records in the doc table:100000000 , Vector delimiter:,
Information about the query table Table name: cluster_10kw_128f_query , Partition:20221111 , Number of data records in the query table:100000000 , Vector delimiter:,
Information about the output table Table name:cluster_10kw_128f_output , Partition:20221111
Row and column information  Number of rows: 50 , Number of columns:50 , Number of data records in the doc table of each column for index building:2000000
Whether to clear the volume index:true

Time required for each worker node (seconds):
   SegmentationWorker:      3
   TmpTableWorker:      1
   KmeansGraphWorker:       2243
   BuildJobWorker:      4973
   SeekJobWorker:       5922
   TmpResultJoinWorker:     0
   RecallWorker:        986
   CleanUpWorker:       6
Total time required (minutes):235

Actual recall rate
    Recall@1:   0.999
    Recall@50:  0.9941600000000027
    Recall@100: 0.9902300000000046
    Recall@200: 0.9816199999999914

The KmeansGraphWorker (2,243 s) and BuildJobWorker (4,973 s) reflect the extra build time that k-means clustering adds compared to hash sharding. The recall rates above 0.98 across all top-K values show that the recall loss from searching a subset of shards is minimal in practice.

What's next