Proxima CE supports category-filtered vector search (KNN) across large datasets in MaxCompute. This topic describes the test results and procedures that verify correctness across six configurations: mixed category sizes, three distance measures (Euclidean distance, inner product, Hamming distance), two data types (FLOAT and BINARY), and multi-label query scenarios.
Test conclusion: Proxima CE correctly handles both search by category and multi-label search by category in multi-category scenarios.
How categories are classified
Proxima CE classifies each category as small or large based on the category threshold and the number of documents in the category.
| Category type | Condition | Processing method |
|---|---|---|
| Small category | Document count is below the category threshold | Block matrix approach, controlled by category_row_num and category_col_num |
| Large category | Document count is at or above the category threshold | Parallel processing, controlled by category_thread_num |
Setting the category threshold high enough (for example, 1,000,000) forces all categories to be treated as small categories. Setting it low (for example, 15) causes categories with 20 or more documents to be treated as large categories.
Parameter reference
The following parameters control multi-category search behavior. Pass these parameters as flags to the ProximaCERunner JAR.
| Parameter | Description | Example |
|---|---|---|
-category_threshold |
Threshold that separates small categories from large categories | 15 |
-category_row_num |
Number of rows per block for small-category matrix computation | 2 |
-category_col_num |
Number of columns per block for small-category matrix computation | 3 |
-category_thread_num |
Degree of parallelism for large-category search | 3 |
-topk |
Number of nearest neighbors to return per query | 5 |
-data_type |
Vector data type: int8, float, or binary |
int8 |
-dimension |
Number of dimensions in the vector | 2 |
-distance_method |
Distance measure: Euclidean distance (default), InnerProduct, or hamming |
InnerProduct |
-query_multi_label |
Set to true to enable multi-label search by category |
true |
-app_id |
Proxima CE application ID | 201220 |
Test methods
The following six cases verify result correctness under different category configurations. Each case uses the data format key+category ID-idx,idx~idx,category ID (for example, key1-1,1~1,1), and all test cases use partition date 20210712.
Case 1: Mixed small and large categories (FLOAT, Euclidean distance)
Purpose: Verify result correctness when small and large categories coexist.
With a threshold of 15, categories with 1, 5, and 10 documents are small categories (3 total), and categories with 20 and 30 documents are large categories (2 total).
| Parameter | Value |
|---|---|
| Doc table — documents per category | 1, 5, 10, 20, 30 |
| Query table — queries per category | 5, 10, 20, 5, 10 |
| Category threshold | 15 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | 3 |
| Top K | 5 |
| Data type | FLOAT |
| Dimensions | 2 |
| Distance measure | Euclidean distance |
Case 2: All small categories (FLOAT, Euclidean distance)
Purpose: Verify that small-category block matrix processing produces the same results as Case 1.
Same data as Case 1. The category threshold is 1,000,000, which forces all categories to be treated as small categories.
| Parameter | Value |
|---|---|
| Doc table — documents per category | 1, 5, 10, 20, 30 |
| Query table — queries per category | 5, 10, 20, 5, 10 |
| Category threshold | 1,000,000 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | N/A |
| Top K | 5 |
| Data type | FLOAT |
| Dimensions | 2 |
| Distance measure | Euclidean distance |
Case 3: All small categories (FLOAT, inner product)
Purpose: Verify result correctness using inner product distance when all categories are small. With inner product, a higher score indicates greater similarity.
Same data and threshold as Case 2, with distance measure changed to inner product.
| Parameter | Value |
|---|---|
| Doc table — documents per category | 1, 5, 10, 20, 30 |
| Query table — queries per category | 5, 10, 20, 5, 10 |
| Category threshold | 1,000,000 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | N/A |
| Top K | 5 |
| Data type | FLOAT |
| Dimensions | 2 |
| Distance measure | Inner product |
Case 4: All small categories (BINARY, Hamming distance)
Purpose: Verify result correctness for binary vectors using Hamming distance. Because doc and query data are identical, exact matches (score 0.0) appear among the top results.
| Parameter | Value |
|---|---|
| Doc table — documents per category | 1, 5, 10 |
| Query table — queries per category | 1, 5, 10 |
| Category threshold | 1,000,000 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | N/A |
| Top K | 5 |
| Data type | BINARY |
| Dimensions | 4 |
| Distance measure | Hamming distance |
Case 5: All large categories (FLOAT, Euclidean distance)
Purpose: Verify that large-category parallel processing produces the same results as Case 1 for categories 20 and 30.
Categories 1, 5, and 10 are removed from the doc table. Only categories 20 and 30 remain, both above the threshold of 15.
| Parameter | Value |
|---|---|
| Doc table — documents per category | 20, 30 |
| Query table — queries per category | 5, 10 |
| Category threshold | 15 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | 3 |
| Top K | 5 |
| Data type | FLOAT |
| Dimensions | 2 |
| Distance measure | Euclidean distance |
Case 6: Multi-label queries (FLOAT, Euclidean distance)
Purpose: Verify that a query assigned to multiple categories returns top K results from each of those categories.
Doc table is the same as Case 1. Query vectors include multiple category labels using a semicolon-separated format: key;vector;category1,category2,.... For example, key1-1;1~1;1,5,10 assigns query key1-1 to categories 1, 5, and 10.
| Parameter | Value |
|---|---|
| Doc table — documents per category | 1, 5, 10, 20, 30 |
| Query table — queries per category | Multi-category data |
| Category threshold | 15 |
category_row_num |
2 |
category_col_num |
3 |
| Degree of parallelism | 3 |
| Top K | 5 |
| Data type | FLOAT |
| Dimensions | 2 |
| Distance measure | Euclidean distance |
Comparison tests
Each comparison test includes the steps to prepare data, run the JAR command, and verify results. All test cases use partition date 20210712.
Case 1: Mixed small and large categories (FLOAT, Euclidean distance)
Input data:
-
Doc table: categories 1, 5, 10, 20, and 30, with document counts of 1, 5, 10, 20, and 30
-
Query table: categories 1, 5, 10, 20, and 30, with query counts of 5, 10, 20, 5, and 10
-
Data type: FLOAT, 2 dimensions, Euclidean distance
Step 1: Prepare data.
The doc table contains the following data:
key1-1,1~1,1
key5-1,1~1,5
... ...
key5-5,5~5,5
key10-1,1~1,10
key10-2,2~2,10
... ...
key10-9,9~9,10
key10-10,10~10,10
key20-1,1~1,20
... ...
key20-20,20~20,20
key30-1,1~1,30
... ...
key30-30,30~30,30
The query table contains the following data:
key1-1,1~1,1
... ...
key1-5,5~5,1
key5-1,1~1,5
... ...
key5-10,10~10,5
key10-1,1~1,10
... ...
key10-20,20~20,10
key20-1,1~1,20
... ...
key20-5,5~5,20
key30-1,1~1,30
... ...
key30-10,10~10,30
Step 2: Run the JAR command.
With category threshold 15, category_row_num 2, category_col_num 3, degree of parallelism 3, and top K 5: categories 1, 5, and 10 are small categories (3 total) and categories 20 and 30 are large categories (2 total).
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15;
Step 3: Verify results.
Each query returns the top 5 nearest neighbors within its category. Results are ranked by ascending Euclidean distance score, where 0.0 indicates an identical vector. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key30-1 | key30-1 | 0.0 | 30 | 20210712 |
| key30-1 | key30-2 | 2.0 | 30 | 20210712 |
| key30-1 | key30-3 | 8.0 | 30 | 20210712 |
| key30-1 | key30-4 | 18.0 | 30 | 20210712 |
| key30-1 | key30-5 | 32.0 | 30 | 20210712 |
... ...
| key20-5 | key20-5 | 0.0 | 20 | 20210712 |
| key20-5 | key20-6 | 2.0 | 20 | 20210712 |
| key20-5 | key20-4 | 2.0 | 20 | 20210712 |
| key20-5 | key20-3 | 8.0 | 20 | 20210712 |
| key20-5 | key20-7 | 8.0 | 20 | 20210712 |
Case 2: All small categories (FLOAT, Euclidean distance)
Purpose: Verify that switching all categories to small-category processing (threshold 100) produces the same nearest-neighbor rankings as Case 1.
Step 1: Prepare data. Same data as Case 1.
Step 2: Run the JAR command.
The category threshold is set to 100, which makes all categories small.
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100;"
Step 3: Verify results.
The expected result is the same as Case 1. Switching from large-category parallel processing to small-category block matrix processing produces identical nearest-neighbor rankings. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key10-18 | key10-10 | 128.0 | 10 | 20210712 |
| key10-18 | key10-9 | 162.0 | 10 | 20210712 |
| key10-18 | key10-8 | 200.0 | 10 | 20210712 |
| key10-18 | key10-7 | 242.0 | 10 | 20210712 |
| key10-18 | key10-6 | 288.0 | 10 | 20210712 |
...
| key30-8 | key30-8 | 0.0 | 30 | 20210712 |
| key30-8 | key30-9 | 2.0 | 30 | 20210712 |
| key30-8 | key30-7 | 2.0 | 30 | 20210712 |
| key30-8 | key30-6 | 8.0 | 30 | 20210712 |
| key30-8 | key30-10 | 8.0 | 30 | 20210712 |
| key5-5 | key5-5 | 0.0 | 5 | 20210712 |
| key5-5 | key5-4 | 2.0 | 5 | 20210712 |
| key5-5 | key5-3 | 8.0 | 5 | 20210712 |
| key5-5 | key5-2 | 18.0 | 5 | 20210712 |
| key5-5 | key5-1 | 32.0 | 5 | 20210712 |
+----+------------+-------+------------+----+
Case 3: All small categories (FLOAT, inner product)
Purpose: Verify that switching from Euclidean distance to inner product produces correct rankings. With inner product, a higher score indicates greater similarity, so results are ranked by descending score.
Step 1: Prepare data. Same data as Case 1.
Step 2: Run the JAR command.
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type float -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100 -distance_method InnerProduct;"
Step 3: Verify results.
Results are ranked by descending score (higher is more similar). For query key10-10 with vector [10, 10], the top 5 results are the doc vectors with the highest inner products: [1,1] scores 20.0, [2,2] scores 40.0, up to [5,5] scoring 100.0. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key10-10 | key10-1 | 20.0 | 10 | 20210712 |
| key10-10 | key10-2 | 40.0 | 10 | 20210712 |
| key10-10 | key10-3 | 60.0 | 10 | 20210712 |
| key10-10 | key10-4 | 80.0 | 10 | 20210712 |
| key10-10 | key10-5 | 100.0 | 10 | 20210712 |
| key10-19 | key10-1 | 38.0 | 10 | 20210712 |
| key10-19 | key10-2 | 76.0 | 10 | 20210712 |
| key10-19 | key10-3 | 114.0 | 10 | 20210712 |
| key10-19 | key10-4 | 152.0 | 10 | 20210712 |
| key10-19 | key10-5 | 190.0 | 10 | 20210712 |
... ...
| key10-17 | key10-1 | 34.0 | 10 | 20210712 |
| key10-17 | key10-2 | 68.0 | 10 | 20210712 |
| key10-17 | key10-3 | 102.0 | 10 | 20210712 |
| key10-17 | key10-4 | 136.0 | 10 | 20210712 |
| key10-17 | key10-5 | 170.0 | 10 | 20210712 |
| key30-8 | key30-16 | 256.0 | 30 | 20210712 |
| key30-8 | key30-17 | 272.0 | 30 | 20210712 |
| key30-8 | key30-18 | 288.0 | 30 | 20210712 |
| key30-8 | key30-19 | 304.0 | 30 | 20210712 |
| key30-8 | key30-20 | 320.0 | 30 | 20210712 |
+----+------------+-------+------------+----+
Case 4: All small categories (BINARY, Hamming distance)
Purpose: Verify correctness for binary vectors. Because doc and query data are identical, exact matches (Hamming distance 0.0) appear in the top results.
Step 1: Prepare data.
Vectors are 32-bit binary strings encoded as 0~0~...~0 or 1~1~...~1. The doc table contains:
key1-1,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,1
key5-1,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,5
key5-2,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,5
key5-3,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key5-4,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key5-5,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key10-1,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,10
key10-2,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,10
key10-3,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-4,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-5,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-6,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-7,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-8,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-9,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-10,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
Step 2: Run the JAR command.
The -dimension flag is set to 32 (number of bits).
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type binary -dimension 32 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100 -distance_method hamming;"
Step 3: Verify results.
For queries key10-9 through key10-10 (all-ones vectors), the top 5 results include only other all-ones vectors with Hamming distance 0.0, confirming exact matches rank highest. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key10-9 | key10-8 | 0.0 | 10 | 20210712 |
| key10-9 | key10-9 | 0.0 | 10 | 20210712 |
| key10-9 | key10-10 | 0.0 | 10 | 20210712 |
| key10-9 | key10-5 | 0.0 | 10 | 20210712 |
| key10-9 | key10-7 | 0.0 | 10 | 20210712 |
| key5-4 | key5-5 | 0.0 | 5 | 20210712 |
| key5-4 | key5-4 | 0.0 | 5 | 20210712 |
| key5-4 | key5-3 | 15.0 | 5 | 20210712 |
| key5-4 | key5-1 | 32.0 | 5 | 20210712 |
| key5-4 | key5-2 | 32.0 | 5 | 20210712 |
...
| key10-3 | key10-7 | 15.0 | 10 | 20210712 |
| key1-1 | key1-1 | 0.0 | 1 | 20210712 |
| key10-4 | key10-4 | 0.0 | 10 | 20210712 |
| key10-4 | key10-3 | 0.0 | 10 | 20210712 |
| key10-4 | key10-7 | 15.0 | 10 | 20210712 |
| key10-4 | key10-6 | 15.0 | 10 | 20210712 |
| key10-4 | key10-5 | 15.0 | 10 | 20210712 |
+----+------------+-------+------------+----+
Case 5: All large categories (FLOAT, Euclidean distance)
Purpose: Verify that results for categories 20 and 30 match those from Case 1 when only large categories are present in the doc table.
Step 1: Prepare data.
Query table data is the same as Case 1. The doc table contains only categories 20 and 30:
key20-1 1~1 20
key20-2 2~2 20
... ...
key20-19 19~19 20
key20-20 20~20 20
key30-1 1~1 30
key30-2 2~2 30
... ...
key30-28 28~28 30
key30-29 29~29 30
key30-30 30~30 30
Step 2: Run the JAR command.
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15;"
Step 3: Verify results.
The top 5 nearest neighbors for each query match the results from Case 1 for categories 20 and 30, confirming that large-category parallel processing produces the same output as the mixed-category case. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key30-1 | key30-1 | 0.0 | 30 | 20210712 |
| key30-1 | key30-2 | 2.0 | 30 | 20210712 |
| key30-1 | key30-3 | 8.0 | 30 | 20210712 |
| key30-1 | key30-4 | 18.0 | 30 | 20210712 |
| key30-1 | key30-5 | 32.0 | 30 | 20210712 |
| key30-2 | key30-2 | 0.0 | 30 | 20210712 |
| key30-2 | key30-3 | 2.0 | 30 | 20210712 |
| key30-2 | key30-1 | 2.0 | 30 | 20210712 |
| key30-2 | key30-4 | 8.0 | 30 | 20210712 |
| key30-2 | key30-5 | 18.0 | 30 | 20210712 |
... ...
| key20-1 | key20-1 | 0.0 | 20 | 20210712 |
| key20-1 | key20-2 | 2.0 | 20 | 20210712 |
| key20-1 | key20-3 | 8.0 | 20 | 20210712 |
| key20-1 | key20-4 | 18.0 | 20 | 20210712 |
| key20-1 | key20-5 | 32.0 | 20 | 20210712 |
... ...
| key20-5 | key20-5 | 0.0 | 20 | 20210712 |
| key20-5 | key20-6 | 2.0 | 20 | 20210712 |
| key20-5 | key20-4 | 2.0 | 20 | 20210712 |
| key20-5 | key20-7 | 8.0 | 20 | 20210712 |
| key20-5 | key20-3 | 8.0 | 20 | 20210712 |
+----+------------+-------+------------+----+
Case 6: Multi-label queries (FLOAT, Euclidean distance)
Purpose: Verify that a query assigned to multiple categories returns top K results from each of those categories independently.
Step 1: Prepare data.
Doc table data is the same as Case 1. The query table uses a semicolon-separated format where the last field lists multiple category IDs: key;vector;category1,category2,...
key1-1;1~1;1,5,10
key1-2;2~2;1,5,10
key1-3;3~3;1
key1-4;4~4;1
key1-5;5~5;1
Queries key1-1 and key1-2 are assigned to categories 1, 5, and 10. Queries key1-3 through key1-5 are assigned to category 1 only.
Step 2: Run the JAR command.
odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15 -query_multi_label true;"
Step 3: Verify results.
For key1-1 (assigned to categories 1, 5, and 10), the output includes top K results from all three categories: key1-1 from category 1, key5-1 from category 5, and key10-1 from category 10 each appear with score 0.0 (exact matches). Queries assigned to only one category (key1-3 through key1-5) return results only from category 1. The result meets the expectation:
+----+------------+-------+------------+----+
| pk | knn_result | score | category | pt |
+----+------------+-------+------------+----+
| key1-1 | key1-1 | 0.0 | 1 | 20210712 |
| key1-1 | key5-1 | 0.0 | 5 | 20210712 |
| key1-1 | key10-1 | 0.0 | 10 | 20210712 |
| key1-1 | key10-2 | 2.0 | 10 | 20210712 |
| key1-1 | key5-2 | 2.0 | 5 | 20210712 |
| key1-2 | key5-2 | 0.0 | 5 | 20210712 |
| key1-2 | key10-2 | 0.0 | 10 | 20210712 |
| key1-2 | key5-3 | 2.0 | 5 | 20210712 |
| key1-2 | key5-1 | 2.0 | 5 | 20210712 |
| key1-2 | key10-1 | 2.0 | 10 | 20210712 |
| key1-3 | key1-1 | 8.0 | 1 | 20210712 |
| key1-4 | key1-1 | 18.0 | 1 | 20210712 |
| key1-5 | key1-1 | 32.0 | 1 | 20210712 |
+----+------------+-------+------------+----+