Before you use Proxima CE, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare input data for Proxima CE.
Prepare the environment
Create a MaxCompute project. For more information, see Create a MaxCompute project.
Create a DataWorks workspace and associate the created MaxCompute project with the workspace. For more information about how to create a DataWorks workspace, see Create a workspace.
Apply for using the external volume feature. After the application is approved, create an external volume.
After you create an index in Proxima CE, you must write the index to an external volume in the MaxCompute project. For more information about how to apply for the external volume feature, see Apply for trial use of new features.
For more information about how to create an external volume, see External volume operations.
NoteYou may use an external volume in subsequent operations of Proxima CE. Therefore, we recommend that you create an external volume in advance. If you do not create an external volume, you must enter an
AccessKey pair
of your Alibaba Cloud account orAlibaba Cloud Resource Name (ARN) of a RAM role
as a mandatory startup parameter. However, the two methods pose security risks. We recommend that you do not use the methods.
Obtain the Proxima CE installation package
Click Proxima CE to download the installation package.
The Proxima CE installation package contains the executable JAR file of Proxima CE. After you upload the JAR file to the MaxCompute project as a resource, you can call the JAR file to run Proxima CE tasks.
Upload the JAR file as a resource
You can use the MaxCompute client (odpscmd) or DataWorks to upload the JAR file to a MaxCompute project as a resource. In this example, DataWorks is used to upload and publish resources. For more information about how to upload a JAR file as a MaxCompute resource by using odpscmd, see Resource operations.
On the DataStudio page in the DataWorks console, upload the JAR file as a resource.
Submit the resource.
Publish the resource.
For more information about how to upload resources by using DataWorks, see "Step 1: Create a resource or upload an existing resource" in the Create and use MaxCompute resources topic.
Prepare input tables
Before you run a Proxima CE task, you must prepare the following input tables:
Doc table: the base table.
Query table: the query table.
Table creation statements
-- Create a doc table.
create table doc_table_float_smoke(pk string,vector string) partitioned by (pt string);
-- Create a query table.
create table query_table_float_smoke(pk string,vector string) partitioned by (pt string);
Input table format
Table names
The names of input tables for a task cannot contain the
tmp_
string. Otherwise, the task fails to run.The names of input tables and names of partitions in input tables must be 1 to 64 characters in length. Otherwise, the task fails to run.
Fields
Fixed field
Description
Data type
pk
The primary key value used for queries.
The default data type is STRING.
For a primary key column, its values can be strings such as
1.nid,2.nid,3.nid,...
or numeric values of the INT64 type such as123,456,789,...
.For a primary key column, if all its values are of the INT64 type, you can specify the data type of the column as BIGINT. If you also set
-pk_type
to INT64, query performance is improved.
vector
The vector.
The default data type is STRING.
category
The category used in multi-category scenarios.
This field is required only for multi-category search.
The default data type is BIGINT.
pt
The partition.
The default data type is STRING.
Sample input tables
Doc table
pk
vector
pt
id1
0~1~1~5
20190322
id2
0~1~1~2
20190322
id3
3~2~1~1
20190322
...
...
...
Query table
pk
vector
pt
id8
0~1~1~5
20190322
id9
0~1~1~2
20190322
id10
3~2~1~1
20190322
...
...
...
Use the vector search features
Scenario | Key feature | References |
Basic vector search | Supports search for the top K results from millions of data records. | |
Multi-category search | Supports multi-category scenarios, including scenarios where the query table and doc table belong to different categories and scenarios where a single query table belongs to multiple categories. | |
Cluster sharding | Allows you to create indexes based on cluster sharding. This method can help reduce the amount of data to be computed and accelerate index-based queries. | |
Inner product and cosine distance | Supports inner-product search. | |
Converters | Allows you to use converters. In most cases, converters can help improve performance and decrease the index size. The retrieval loss varies with actual situations. |
After the vector search is performed, the system automatically generates an output table in MaxCompute. You do not need to create another table. You need to only configure the -output_table
parameter in the code of Proxima CE as the table name. For more information about the format of the generated output table, see the Output table format section in this topic.
Output table format
After the vector search is performed, the system automatically generates an output table in MaxCompute. This section describes the format of the generated output table.
Table name: the name of the output table that you specified in the code of Proxima CE.
The name of the output table cannot contain periods (
.
), which are considered special characters in MaxCompute. If the name contains periods (.), the MaxCompute table fails to be parsed.The name of the output table generated for a task cannot contain the
tmp_
string. Otherwise, the task fails to run.The names of output tables and names of partitions in output tables must be 1 to 64 characters in length. Otherwise, the task fails to run.
Fields
Fixed field
Description
Data type
pk
The primary key value that corresponds to each query in a query table.
The default data type is STRING.
For a primary key column, its values can be strings such as
1.nid,2.nid,3.nid,...
or numeric values of the INT64 type such as123,456,789,...
.For a primary key column, if all its values are of the INT64 type, you can specify the data type of the column as BIGINT. If you also set
-pk_type
to INT64, query performance is improved.
knn_result
The primary key value of the doc table that is retrieved for queries.
The default data type is STRING.
score
The similarity score of the retrieved documents.
The default data type is STRING. Proxima CE lists the output results based on the similarity scores in descending order.
NoteFor the distances calculated by using
inner products or the MIPS squared euclidean
algorithm in the Proxima 2 kernel, a longer distance indicates higher similarity. For the distances calculated by using other algorithms, a shorter distance indicates higher similarity. Proxima CE provides a uniform way to list the output results based on similarity scores in descending order.For the distances calculated by using
inner products or the MIPS squared euclidean
algorithm, Proxima CE lists the output results based on the similarityscores
in descending order.For the distances calculated by using other algorithms, Proxima CE lists the output results based on the similarity
scores
in ascending order. The method is the same as the method used in the Proxima 2 kernel.
category
The category used in multi-category scenarios.
This field is required only for multi-category search.
The default data type is BIGINT.
pt
The partition.
The default data type is STRING.
Sample output table
pk | knn_result | score | pt |
id8 | id1 | 0.1 | 20190322 |
id8 | id2 | 0.2 | 20190322 |
id9 | id1 | 0.1 | 20190322 |
id9 | id3 | 0.3 | 20190322 |
... | ... | ... | ... |