Multi-category testing - MaxCompute - Alibaba Cloud Documentation Center

This topic describes the test conclusion and procedure of multi-category testing.

Test conclusion

Proxima CE is suitable for the search by category and multi-label search by category in multi-category scenarios.

Test procedure

Test methods

Case 1: This test case generates a batch of multi-category data. The numbers of documents in each category in the doc table are 1, 5, 10, 20, and 30. The numbers of documents in each category in the query table are 5, 10, 20, 5, and 10. The category threshold is set to 15. The number of rows for small categories is 2 and the number of columns for small categories is 3. The degree of parallelism for large categories is 3. The value of top K is 5. The data type is FLOAT. The number of dimensions is 2. The distance measure type is Euclidean distance. This test case aims to check the correctness of test results.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for small categories	Number of columns for small categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
1, 5, 10, 20, and 30	5, 10, 20, 5, and 10	15	2	3	3	5	FLOAT	2	Euclidean distance

Case 2: This test case aims to check the correctness of test results when only small categories exist. The test data is the same as the test data of Case 1. The category threshold is 1000000. This value indicates that only small categories exist. The distance measure type is Euclidean distance.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for small categories	Number of columns for small categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
1, 5, 10, 20, and 30	5, 10, 20, 5, and 10	1000000	2	3	-	5	FLOAT	2	Euclidean distance

Case 3: The test data is the same as the test data of Case 2. The distance measure type is inner product.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for small categories	Number of columns for small categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
1, 5, 10, 20, and 30	5, 10, 20, 5, and 10	1000000	2	3	-	5	FLOAT	2	Inner product

Case 4: This test case aims to check the correctness of test results when only small categories exist. The numbers of documents in each category in the doc table are 1, 5, and 10. The numbers of documents in each category in the query table are 1, 5, and 10. The data type is BINARY. The distance measure type is Hamming distance. The category threshold is 1000000.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for small categories	Number of columns for small categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
1, 5, and 10	1, 5, and 10	1000000	2	3	-	5	BINARY	4	Hamming distance

Case 5: This test case aims to check the correctness of test results when only large categories exist. The test data is the same as the test data of Case 1 with categories 1, 5, and 10 removed.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for large categories	Number of columns for large categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
20 and 30	5 and 10	15	2	3	3	5	FLOAT	2	Euclidean distance

Case 6: This test case is used to test data in the query table by multi-category. Data in the doc table of this test case is the same as data in the doc table of Case 1.

Number of documents in each category in the doc table	Number of documents in each category in the query table	Category threshold	Number of rows for small categories	Number of columns for small categories	Degree of parallelism for large categories	Top K	Data type	Number of dimensions	Distance measure type
1, 5, 10, 20, and 30	Multi-category data	15	2	3	3	5	FLOAT	2	Euclidean distance

Comparison tests

Case 1: Input data: The numbers of documents in categories 1, 5, 10, 20, and 30 in the doc table are 1, 5, 10, 20, and 30. The numbers of documents in categories 1, 5, 10, 20, and 30 in the query table are 5, 10, 20, 5, and 10. The data type is FLOAT and the number of dimensions is 2. The distance measure type is Euclidean distance.

Prepare data.

The multi-category data is generated in the key+category ID-idx,idx~idx,category ID format, such as key1-1,1~1,1.

The doc table contains the following data:

 key1-1,1~1,1
 key5-1,1~1,5
 ... ...
 key5-5,5~5,5
 key10-1,1~1,10
 key10-2,2~2,10
 ... ...
 key10-9,9~9,10
 key10-10,10~10,10
 key20-1,1~1,20
 ... ...
 key20-20,20~20,20
 key30-1,1~1,30
 ... ...
 key30-30,30~30,30

The query table contains the following data:

 key1-1,1~1,1
 ... ...
 key1-5,5~5,1
 key5-1,1~1,5
 ... ...
 key5-10,10~10,5
 key10-1,1~1,10
 ... ...
 key10-20,20~20,10
 key20-1,1~1,20
 ... ...
 key20-5,5~5,20
 key30-1,1~1,30
 ... ...
 key30-10,10~10,30

Obtain the test result.

For example, the category threshold is 15, the number of rows for small categories is 2, the number of columns for small categories is 3, the degree of parallelism for large categories is 3, and the value of top K is 5. In this case, the number of small categories is 3 and the number of large categories is 2. After the test, view the following test result. The test result meets the expectation.

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key30-1 | key30-1    | 0.0   | 30         | 20210712 |
| key30-1 | key30-2    | 2.0   | 30         | 20210712 |
| key30-1 | key30-3    | 8.0   | 30         | 20210712 |
| key30-1 | key30-4    | 18.0  | 30         | 20210712 |
| key30-1 | key30-5    | 32.0  | 30         | 20210712 |
... ...
| key20-5 | key20-5    | 0.0   | 20         | 20210712 |
| key20-5 | key20-6    | 2.0   | 20         | 20210712 |
| key20-5 | key20-4    | 2.0   | 20         | 20210712 |
| key20-5 | key20-3    | 8.0   | 20         | 20210712 |
| key20-5 | key20-7    | 8.0   | 20         | 20210712 |

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15;

Case 2: All categories are small categories. The distance measure type is Euclidean distance. The expected result of this test case is the same as the expected result of Case 1.

Prepare data. The data of this test case is the same as the data of Case 1.

Obtain the test result.

The expected result of this test case is the same as the expected result of Case 1. The test result is correct and meets the expectation. The following result is returned:

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key10-18 | key10-10   | 128.0 | 10         | 20210712 |
| key10-18 | key10-9    | 162.0 | 10         | 20210712 |
| key10-18 | key10-8    | 200.0 | 10         | 20210712 |
| key10-18 | key10-7    | 242.0 | 10         | 20210712 |
| key10-18 | key10-6    | 288.0 | 10         | 20210712 |
...
| key30-8 | key30-8    | 0.0   | 30         | 20210712 |
| key30-8 | key30-9    | 2.0   | 30         | 20210712 |
| key30-8 | key30-7    | 2.0   | 30         | 20210712 |
| key30-8 | key30-6    | 8.0   | 30         | 20210712 |
| key30-8 | key30-10   | 8.0   | 30         | 20210712 |
| key5-5 | key5-5     | 0.0   | 5          | 20210712 |
| key5-5 | key5-4     | 2.0   | 5          | 20210712 |
| key5-5 | key5-3     | 8.0   | 5          | 20210712 |
| key5-5 | key5-2     | 18.0  | 5          | 20210712 |
| key5-5 | key5-1     | 32.0  | 5          | 20210712 |
+----+------------+-------+------------+----+

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100;"

Case 3: The input data of this test case is the same as the input data of Case 2. The distance measure type is inner product in Case 3.

Prepare data. The data of this test case is the same as the data of Case 1.

Obtain the test result. The result meets the expectation. The following result is returned:

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key10-10 | key10-1    | 20.0  | 10         | 20210712 |
| key10-10 | key10-2    | 40.0  | 10         | 20210712 |
| key10-10 | key10-3    | 60.0  | 10         | 20210712 |
| key10-10 | key10-4    | 80.0  | 10         | 20210712 |
| key10-10 | key10-5    | 100.0 | 10         | 20210712 |
| key10-19 | key10-1    | 38.0  | 10         | 20210712 |
| key10-19 | key10-2    | 76.0  | 10         | 20210712 |
| key10-19 | key10-3    | 114.0 | 10         | 20210712 |
| key10-19 | key10-4    | 152.0 | 10         | 20210712 |
| key10-19 | key10-5    | 190.0 | 10         | 20210712 |
... ...
| key10-17 | key10-1    | 34.0  | 10         | 20210712 |
| key10-17 | key10-2    | 68.0  | 10         | 20210712 |
| key10-17 | key10-3    | 102.0 | 10         | 20210712 |
| key10-17 | key10-4    | 136.0 | 10         | 20210712 |
| key10-17 | key10-5    | 170.0 | 10         | 20210712 |
| key30-8 | key30-16   | 256.0 | 30         | 20210712 |
| key30-8 | key30-17   | 272.0 | 30         | 20210712 |
| key30-8 | key30-18   | 288.0 | 30         | 20210712 |
| key30-8 | key30-19   | 304.0 | 30         | 20210712 |
| key30-8 | key30-20   | 320.0 | 30         | 20210712 |
+----+------------+-------+------------+----+

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type float -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100 -distance_method InnerProduct;"

Case 4: Input data: The numbers of documents in categories 1, 5, and 10 in the doc table are 1, 5, and 10. The numbers of documents in categories 1, 5, and 10 in the query table are 1, 5, and 10. The data type is BINARY and the number of dimensions is 4. The distance measure type is Euclidean distance. Data in the doc table is consistent with data in the query table.

Prepare data.

key1-1,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,1
key5-1,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,5
key5-2,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,5
key5-3,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key5-4,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key5-5,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,5
key10-1,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,10
key10-2,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~0,10
key10-3,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-4,0~0~0~0~0~0~0~0~0~0~0~0~0~0~0~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-5,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-6,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-7,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-8,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-9,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10
key10-10,1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1~1,10

Obtain the test result. The result meets the expectation. The following result is returned:

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key10-9 | key10-8    | 0.0   | 10         | 20210712 |
| key10-9 | key10-9    | 0.0   | 10         | 20210712 |
| key10-9 | key10-10   | 0.0   | 10         | 20210712 |
| key10-9 | key10-5    | 0.0   | 10         | 20210712 |
| key10-9 | key10-7    | 0.0   | 10         | 20210712 |
| key5-4 | key5-5     | 0.0   | 5          | 20210712 |
| key5-4 | key5-4     | 0.0   | 5          | 20210712 |
| key5-4 | key5-3     | 15.0  | 5          | 20210712 |
| key5-4 | key5-1     | 32.0  | 5          | 20210712 |
| key5-4 | key5-2     | 32.0  | 5          | 20210712 |
...
| key10-3 | key10-7    | 15.0  | 10         | 20210712 |
| key1-1 | key1-1     | 0.0   | 1          | 20210712 |
| key10-4 | key10-4    | 0.0   | 10         | 20210712 |
| key10-4 | key10-3    | 0.0   | 10         | 20210712 |
| key10-4 | key10-7    | 15.0  | 10         | 20210712 |
| key10-4 | key10-6    | 15.0  | 10         | 20210712 |
| key10-4 | key10-5    | 15.0  | 10         | 20210712 |
+----+------------+-------+------------+----+

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type binary -dimension 32 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 100 -distance_method hamming;"

Case 5: All categories are large categories. The expected result of this test case is the same as the expected result of Case 1. The test result is correct and meets the expectation. The threshold for large categories is 15, and data in the query table of this test case is the same as data in the query table of Case 1. Categories 1, 5, and 10 are removed from data in the doc table and only categories 20 and 30 are retained.

Prepare data.

Data in the query table of this test case is the same as data in the query table of Case 1. The doc table contains the following data:

key20-1   1~1   20
key20-2   2~2   20
... ...
key20-19   19~19   20
key20-20   20~20   20
key30-1   1~1   30
key30-2   2~2   30
... ...
key30-28   28~28   30
key30-29   29~29   30
key30-30   30~30   30

Obtain the test result. The result meets the expectation. The following result is returned:

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key30-1 | key30-1    | 0.0   | 30         | 20210712 |
| key30-1 | key30-2    | 2.0   | 30         | 20210712 |
| key30-1 | key30-3    | 8.0   | 30         | 20210712 |
| key30-1 | key30-4    | 18.0  | 30         | 20210712 |
| key30-1 | key30-5    | 32.0  | 30         | 20210712 |
| key30-2 | key30-2    | 0.0   | 30         | 20210712 |
| key30-2 | key30-3    | 2.0   | 30         | 20210712 |
| key30-2 | key30-1    | 2.0   | 30         | 20210712 |
| key30-2 | key30-4    | 8.0   | 30         | 20210712 |
| key30-2 | key30-5    | 18.0  | 30         | 20210712 |
... ...
| key20-1 | key20-1    | 0.0   | 20         | 20210712 |
| key20-1 | key20-2    | 2.0   | 20         | 20210712 |
| key20-1 | key20-3    | 8.0   | 20         | 20210712 |
| key20-1 | key20-4    | 18.0  | 20         | 20210712 |
| key20-1 | key20-5    | 32.0  | 20         | 20210712 |
... ...
| key20-5 | key20-5    | 0.0   | 20         | 20210712 |
| key20-5 | key20-6    | 2.0   | 20         | 20210712 |
| key20-5 | key20-4    | 2.0   | 20         | 20210712 |
| key20-5 | key20-7    | 8.0   | 20         | 20210712 |
| key20-5 | key20-3    | 8.0   | 20         | 20210712 |
+----+------------+-------+------------+----+

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15;"

Case 6: This test case is used to test data in the query table by multi-category. Data in the doc table of this test case is the same as data in the doc table of Case 1.

Prepare data.
Data in the doc table of this test case is the same as data in the doc table of Case 1. The query table contains the following data:
```
key1-1;1~1;1,5,10
key1-2;2~2;1,5,10
key1-3;3~3;1
key1-4;4~4;1
key1-5;5~5;1
```

Obtain the test result. The result meets the expectation. The following result is returned:

+----+------------+-------+------------+----+
| pk | knn_result | score | category   | pt |
+----+------------+-------+------------+----+
| key1-1 | key1-1     | 0.0   | 1          | 20210712 |
| key1-1 | key5-1     | 0.0   | 5          | 20210712 |
| key1-1 | key10-1    | 0.0   | 10         | 20210712 |
| key1-1 | key10-2    | 2.0   | 10         | 20210712 |
| key1-1 | key5-2     | 2.0   | 5          | 20210712 |
| key1-2 | key5-2     | 0.0   | 5          | 20210712 |
| key1-2 | key10-2    | 0.0   | 10         | 20210712 |
| key1-2 | key5-3     | 2.0   | 5          | 20210712 |
| key1-2 | key5-1     | 2.0   | 5          | 20210712 |
| key1-2 | key10-1    | 2.0   | 10         | 20210712 |
| key1-3 | key1-1     | 8.0   | 1          | 20210712 |
| key1-4 | key1-1     | 18.0  | 1          | 20210712 |
| key1-5 | key1-1     | 32.0  | 1          | 20210712 |
+----+------------+-------+------------+----+

JAR command: odpscmd -e "jar -resources proxima-ce-xl-222.jar -classpath ./proxima-ce-xl-222.jar com.alibaba.proxima2.ce.ProximaCERunner -doc_table cat_doc_table -doc_table_partition 20210712 -query_table cat_query_table -query_table_partition 20210712 -output_table cat_result_table -output_table_partition 20210712 -data_type int8 -dimension 2 -app_id 201220 -category_row_num 2 -category_col_num 3 -topk 5 -category_thread_num 3 -category_threshold 15  -query_multi_label true;"