Query vector data by category in Proxima CE - MaxCompute

Vector embeddings alone cannot express category boundaries. If your dataset spans multiple categories and you want to confine each search to a specific category, add a category field to your tables so Proxima CE searches only the vectors in the requested category.

Two patterns are supported:

Pattern	Doc table field	Query table field	Required parameter
Single category per query row	`category BIGINT`	`category BIGINT`	—
Multiple categories per query row	`category BIGINT`	`multicategory STRING`	`-query_multi_label true`

Prerequisites

Before you begin, ensure that you have:

The Proxima CE package installed. See Install the Proxima CE package.
An Object Storage Service (OSS) directory created for the external volume. The search task fails if the OSS directory does not exist when the task runs.

Query vectors by single category

Use this pattern when each query vector targets exactly one category. Proxima CE searches only the doc vectors in the matching category.

Data model

Both the doc and query tables include a category BIGINT column. This is the only schema difference compared to basic vector search.

-- Doc table
CREATE TABLE category_doc_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);

-- Query table
CREATE TABLE category_query_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);

Important

The number and values of categories in the doc table and query table must be identical. If they differ, the search task fails. Manually reconcile the category values in both tables before rerunning.

Load data

Run the following SQL statements on a SQL node in the DataWorks console.

-- Doc table: 10 vectors across categories 1-4
CREATE TABLE category_doc_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_doc_table_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_doc_table_float_smoke PARTITION (pt='20221111') VALUES
('1.nid','1~1~1~1~1~1~1~1', 1),
('2.nid','2~2~2~2~2~2~2~2', 1),
('3.nid','3~3~3~3~3~3~3~3', 1),
('4.nid','4~4~4~4~4~4~4~4', 2),
('5.nid','5~5~5~5~5~5~5~5', 2),
('6.nid','6~6~6~6~6~6~6~6', 2),
('7.nid','7~7~7~7~7~7~7~7', 3),
('8.nid','8~8~8~8~8~8~8~8', 3),
('9.nid','9~9~9~9~9~9~9~9', 3),
('10.nid','10~10~10~10~10~10~10~10', 4);
-- SELECT * FROM category_doc_table_float_smoke;

-- Query table: 3 query vectors, one per category
CREATE TABLE category_query_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_query_table_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_query_table_float_smoke PARTITION (pt='20221111') VALUES
('q1.nid','1~1~1~1~2~2~2~2', 1),
('q2.nid','4~4~4~4~3~3~3~3', 2),
('q3.nid','9~9~9~9~5~5~5~5', 3);
-- SELECT * FROM category_query_table_float_smoke;

Run a search task

Use DataWorks to run the search task. Create an external volume backed by OSS before running.

For parameter descriptions, see Reference: Proxima CE parameters.

--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar  -- The Proxima CE JAR file that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner  -- The entry class of the main function.
-doc_table category_doc_table_float_smoke  -- The name of the doc table.
-doc_table_partition 20221111  -- The name of the partition in the doc table.
-query_table category_query_table_float_smoke  -- The name of the query table.
-query_table_partition 20221111  -- The name of the partition in the query table.
-output_table category_output_table_float_smoke  -- The name of the output table.
-output_table_partition 20221111  -- The name of the partition in the output table.
-data_type float  -- The vector data type.
-dimension 8  -- The vector dimension.
-topk 1  -- The value of K for top-K search.
-job_mode train:build:seek:recall  -- The job mode. Default: train:build:seek. Adding :recall enables recall rate calculation.
-external_volume_name udf_proxima_ext  -- The external volume name (OSS-backed). Create the OSS directory before running.
-owner_id 123456  -- The owner ID. Must be unique.
-- -category_row_num 1  -- Rows indexed per small category. Retain the default in most cases.
-- -category_col_num 1  -- Columns indexed per small category. Retain the default in most cases.
-- -category_thread_num 10  -- Concurrency for large-category search. Retain the default in most cases.
;

Output

The output table contains one row per query vector, showing the top-1 nearest neighbor within the searched category.

+------------+------------+------------+------------+------------+
| pk         | knn_result | score      | category   | pt         |
+------------+------------+------------+------------+------------+
| q1.nid     | 1.nid      | 4.0        | 1          | 20221111   |
| q2.nid     | 4.nid      | 4.0        | 2          | 20221111   |
| q3.nid     | 7.nid      | 32.0       | 3          | 20221111   |
+------------+------------+------------+------------+------------+

Query vectors across multiple categories

Use this pattern when a single query vector needs to search across two or more categories simultaneously. This pattern is compatible with the single-category pattern.

The schema differs from single-category search in two ways:

The doc table still uses category BIGINT.
The query table uses multicategory STRING instead of category BIGINT. Store multiple category values as a comma-separated string, for example '1,2'.

Set -query_multi_label true in the search task.

Data model

-- Doc table (same schema as single-category)
CREATE TABLE doc_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);

-- Query table (multicategory STRING replaces category BIGINT)
CREATE TABLE query_table_float_smoke(pk STRING, vector STRING, multicategory STRING) PARTITIONED BY (pt STRING);

Important

The categories defined in the doc table must match those referenced in the multicategory field of the query table. A mismatch causes the search task to fail.

Load data

-- Doc table: 10 vectors across categories 1-4
CREATE TABLE category_doc_table_mc_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_doc_table_mc_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_doc_table_mc_float_smoke PARTITION (pt='20221111') VALUES
('1.nid','1~1~1~1~1~1~1~1', 1),
('2.nid','2~2~2~2~2~2~2~2', 1),
('3.nid','3~3~3~3~3~3~3~3', 1),
('4.nid','4~4~4~4~4~4~4~4', 2),
('5.nid','1~1~1~1~1~1~1~1', 2),
('6.nid','7~7~7~7~7~7~7~7', 2),
('7.nid','7~7~7~7~7~7~7~7', 3),
('8.nid','8~8~8~8~8~8~8~8', 3),
('9.nid','9~9~9~9~9~9~9~9', 3),
('10.nid','10~10~10~10~10~10~10~10', 4);
-- SELECT * FROM category_doc_table_mc_float_smoke WHERE pt='20221111';

-- Query table: 3 query vectors targeting multiple categories each
CREATE TABLE category_query_table_mc_float_smoke(pk STRING, vector STRING, multicategory STRING) PARTITIONED BY (pt STRING);
ALTER TABLE category_query_table_mc_float_smoke ADD PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_query_table_mc_float_smoke PARTITION (pt='20221111') VALUES
('q1.nid','1~1~1~1~2~2~2~2', '1,2'),
('q2.nid','4~4~4~4~3~3~3~3', '2'),
('q3.nid','9~9~9~9~5~5~5~5', '2,3');
-- SELECT * FROM category_query_table_mc_float_smoke WHERE pt='20221111';

Run a search task

This task is identical to the single-category task except for two changes: the table names reference the multi-category tables, and -query_multi_label true is required.

For parameter descriptions, see Reference: Proxima CE parameters.

--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar  -- The Proxima CE JAR file that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner  -- The entry class of the main function.
-doc_table category_doc_table_mc_float_smoke  -- The name of the doc table.
-doc_table_partition 20221111  -- The name of the partition in the doc table.
-query_table category_query_table_mc_float_smoke  -- The name of the query table.
-query_table_partition 20221111  -- The name of the partition in the query table.
-output_table category_output_table_mc_float_smoke  -- The name of the output table.
-output_table_partition 20221111  -- The name of the partition in the output table.
-data_type float  -- The vector data type.
-dimension 8  -- The vector dimension.
-topk 2  -- The value of K for top-K search.
-job_mode train:build:seek:recall  -- The job mode. Default: train:build:seek. Adding :recall enables recall rate calculation.
-external_volume_name udf_proxima_ext  -- The external volume name (OSS-backed). Create the OSS directory before running.
-owner_id 123456  -- The owner ID. Must be unique.
-query_multi_label true  -- Required for multi-category search. Set to true.
-- -category_row_num 1  -- Rows indexed per small category. Retain the default in most cases.
-- -category_col_num 1  -- Columns indexed per small category. Retain the default in most cases.
-- -category_thread_num 10  -- Concurrency for large-category search. Retain the default in most cases.
;

Output

Each query vector produces one result row per searched category. For example, q1.nid targeted categories 1 and 2, so it returns two rows.

+------------+------------+------------+------------+------------+
| pk         | knn_result | score      | category   | pt         |
+------------+------------+------------+------------+------------+
| q1.nid     | 2.nid      | 4.0        | 1          | 20221111   |
| q1.nid     | 5.nid      | 4.0        | 2          | 20221111   |
| q2.nid     | 4.nid      | 4.0        | 2          | 20221111   |
| q2.nid     | 5.nid      | 52.0       | 2          | 20221111   |
| q3.nid     | 6.nid      | 32.0       | 2          | 20221111   |
| q3.nid     | 7.nid      | 32.0       | 3          | 20221111   |
+------------+------------+------------+------------+------------+

Usage notes

Topic	Detail
OSS directory	Create the OSS directory for the external volume before running any search task. The task fails immediately if the directory does not exist.
Category consistency	The categories defined in the doc table and referenced in the query table must match in both count and values. Manually reconcile any mismatch before rerunning.
Optional index parameters	`-category_row_num`, `-category_col_num`, and `-category_thread_num` are commented out in the examples because their defaults work for most workloads. Adjust them only if you observe index performance issues with very small or very large category sizes.
Recall rate	Setting `-job_mode train:build:seek:recall` adds a recall evaluation phase after the search. The default mode `train:build:seek` skips this phase.