Proxima CE supports multi-category search. This topic describes how to use the multi-category search feature and provides examples on using this feature.
Prerequisites
The Proxima CE package is installed. For more information, see Install the Proxima CE package.
Query vectors by category
If multiple categories of vector data exists, you can use the multi-category search feature to query vector data by category.
Table creation statements
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING, category bigint) PARTITIONED BY (pt STRING);
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING, category bigint) PARTITIONED BY (pt STRING);Import data into input tables
The fields of the doc and query tables for basic vector search are similar to those for multi-category search. The only difference is that the category field of the BIGINT type is added to input tables for multi-category search. You can execute the following statements on an SQL node in the DataWorks console.
For the doc and query tables, the number and values of categories must be the same. Otherwise, the search task fails to run. If this occurs, you need to manually adjust the categories to resolve the mismatch issue.
CREATE TABLE category_doc_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_doc_table_float_smoke add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_doc_table_float_smoke PARTITION (pt='20221111') VALUES
('1.nid','1~1~1~1~1~1~1~1', 1),
('2.nid','2~2~2~2~2~2~2~2', 1),
('3.nid','3~3~3~3~3~3~3~3', 1),
('4.nid','4~4~4~4~4~4~4~4', 2),
('5.nid','5~5~5~5~5~5~5~5', 2),
('6.nid','6~6~6~6~6~6~6~6', 2),
('7.nid','7~7~7~7~7~7~7~7', 3),
('8.nid','8~8~8~8~8~8~8~8', 3),
('9.nid','9~9~9~9~9~9~9~9', 3),
('10.nid','10~10~10~10~10~10~10~10', 4);
-- SELECT * FROM category_doc_table_float_smoke;
CREATE TABLE category_query_table_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_query_table_float_smoke add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_query_table_float_smoke PARTITION (pt='20221111') VALUES
('q1.nid','1~1~1~1~2~2~2~2', 1),
('q2.nid','4~4~4~4~3~3~3~3', 2),
('q3.nid','9~9~9~9~5~5~5~5', 3);
-- SELECT * FROM category_query_table_float_smoke;Use DataWorks to run a search task
In this example, DataWorks is used to run a search task, and an external volume is created before you run the search task.
For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.
Sample code:
--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar -- The Proxima CE JAR file that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner -- The entry class of the main function.
-doc_table category_doc_table_float_smoke -- The name of the doc table.
-doc_table_partition 20221111 -- The name of the partition in the doc table.
-query_table category_query_table_float_smoke -- The name of the query table.
-query_table_partition 20221111 -- The name of the partition in the query table.
-output_table category_output_table_float_smoke -- The name of the output table.
-output_table_partition 20221111 -- The name of the partition in the output table.
-data_type float -- The vector data type.
-dimension 8 -- The vector dimension.
-topk 1 -- The value of K for top K search.
-job_mode train:build:seek:recall -- The mode in which a search task runs. Default value: train:build:seek. If you set this parameter to train:build:seek:recall, the recall rate of the search task can be calculated.
-external_volume_name udf_proxima_ext -- The name of the external volume whose data source is Object Storage Service (OSS). You must create an OSS directory in advance before you run a search task. Otherwise, the search task fails to run.
-owner_id 123456 -- The owner ID, which must be unique.
-- -category_row_num 1 -- The number of rows for which indexes are created in a small category. In most cases, you can retain the default value.
-- -category_col_num 1 -- The number of columns for which indexes are created in a small category. In most cases, you can retain the default value.
-- -category_thread_num 10 -- The concurrency of large-category search tasks that can be run. In most cases, you can retain the default value.
;Output result
+------------+------------+------------+------------+------------+
| pk | knn_result | score | category | pt |
+------------+------------+------------+------------+------------+
| q1.nid | 1.nid | 4.0 | 1 | 20221111 |
| q2.nid | 4.nid | 4.0 | 2 | 20221111 |
| q3.nid | 7.nid | 32.0 | 3 | 20221111 |
+------------+------------+------------+------------+------------+ Query data from multiple categories
If data of a query table belongs to multiple categories, you need to separately query data from each category. This feature does not conflict with the single-category query feature.
Table creation statements
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING, category BIGINT) partitioned by (pt STRING);
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING, multicategory STRING) partitioned by (pt STRING);Import data into input tables
The fields of the doc and query tables for basic vector search are similar to those for multi-category search. The differences are that the category field of the BIGINT type is added to the doc table for multi-category search and the multicategory field of the STRING type is added to the query table for multi-category search. You can execute the following statements on an SQL node in the DataWorks console.
CREATE TABLE category_doc_table_mc_float_smoke(pk STRING, vector STRING, category BIGINT) PARTITIONED BY (pt STRING);
ALTER TABLE category_doc_table_mc_float_smoke add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_doc_table_mc_float_smoke PARTITION (pt='20221111') VALUES
('1.nid','1~1~1~1~1~1~1~1', 1),
('2.nid','2~2~2~2~2~2~2~2', 1),
('3.nid','3~3~3~3~3~3~3~3', 1),
('4.nid','4~4~4~4~4~4~4~4', 2),
('5.nid','1~1~1~1~1~1~1~1', 2),
('6.nid','7~7~7~7~7~7~7~7', 2),
('7.nid','7~7~7~7~7~7~7~7', 3),
('8.nid','8~8~8~8~8~8~8~8', 3),
('9.nid','9~9~9~9~9~9~9~9', 3),
('10.nid','10~10~10~10~10~10~10~10', 4);
-- SELECT * FROM category_doc_table_mc_float_smoke WHERE pt='20221111';
CREATE TABLE category_query_table_mc_float_smoke(pk STRING, vector STRING, multicategory string) PARTITIONED BY (pt STRING);
ALTER TABLE category_query_table_mc_float_smoke add PARTITION(pt='20221111');
INSERT OVERWRITE TABLE category_query_table_mc_float_smoke PARTITION (pt='20221111') VALUES
('q1.nid','1~1~1~1~2~2~2~2', '1,2'),
('q2.nid','4~4~4~4~3~3~3~3', '2'),
('q3.nid','9~9~9~9~5~5~5~5', '2,3');
-- SELECT * FROM category_query_table_mc_float_smoke WHERE pt='20221111';Use DataWorks to run a search task
In this example, DataWorks is used to run a search task and an external volume is created before you run the search task.
For details about the parameter configuration used in the following example code, see Reference: Proxima CE parameters.
Sample code:
--@resource_reference{"proxima-ce-aliyun-1.0.0.jar"}
jar -resources proxima-ce-aliyun-1.0.0.jar -- The Proxima CE JAR file that is uploaded.
-classpath proxima-ce-aliyun-1.0.0.jar com.alibaba.proxima2.ce.ProximaCERunner -- The entry class of the main function.
-doc_table category_doc_table_mc_float_smoke -- The name of the doc table.
-doc_table_partition 20221111 -- The name of the partition in the doc table.
-query_table category_query_table_mc_float_smoke -- The name of the query table.
-query_table_partition 20221111 -- The name of the partition in the query table.
-output_table category_output_table_mc_float_smoke -- The name of the output table.
-output_table_partition 20221111 -- The name of the partition in the output table.
-data_type float -- The vector data type.
-dimension 8 -- The vector dimension.
-topk 2 -- The value of K for top K search.
-job_mode train:build:seek:recall -- The mode in which a search task runs. Default value: train:build:seek. If you set this parameter to train:build:seek:recall, the recall rate of the search task can be calculated.
-external_volume_name udf_proxima_ext -- The name of the external volume whose data source is OSS. You must create an OSS directory in advance before you run a search task. Otherwise, the search task fails to run.
-owner_id 123456 -- The owner ID, which must be unique.
-query_multi_label true -- This parameter is required for multi-category search. Set this parameter to true.
-- -category_row_num 1 -- The number of rows for which indexes are created in a small category. In most cases, you can retain the default value.
-- -category_col_num 1 -- The number of columns for which indexes are created in a small category. In most cases, you can retain the default value.
-- -category_thread_num 10 -- The concurrency of large-category search tasks that can be run. In most cases, you can retain the default value.
;Output result
+------------+------------+------------+------------+------------+
| pk | knn_result | score | category | pt |
+------------+------------+------------+------------+------------+
| q1.nid | 2.nid | 4.0 | 1 | 20221111 |
| q1.nid | 5.nid | 4.0 | 2 | 20221111 |
| q2.nid | 4.nid | 4.0 | 2 | 20221111 |
| q2.nid | 5.nid | 52.0 | 2 | 20221111 |
| q3.nid | 6.nid | 32.0 | 2 | 20221111 |
| q3.nid | 7.nid | 32.0 | 3 | 20221111 |
+------------+------------+------------+------------+------------+