Before you use Proxima CE, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare input data for Proxima CE.
Prerequisites
You must complete Preparations.
Get the Proxima CE installation package
Click Proxima CE package to download the installation package.
The Proxima CE installation package contains the executable JAR file of Proxima CE. After you upload the JAR file to the MaxCompute project as a resource, you can call the JAR file to run Proxima CE tasks.
Upload the installation package as a MaxCompute resource
You can use the MaxCompute client (odpscmd) or DataWorks to upload the JAR file to a MaxCompute project as a resource. In this example, DataWorks is used to upload and publish resources. For more information about how to upload resources by using odpscmd, see Add resources.
On the Data Development page of DataWorks, upload the installation package as a JAR resource in a visual way.
NoteTake note of the following items:
If you create or upload a resource that is never uploaded to MaxCompute, you must select Upload to MaxCompute. If the resource has been uploaded to MaxCompute, clear Upload to MaxCompute. Otherwise, an error is reported when you upload the resource.
If you select Upload to MaxCompute when you create or upload a resource, the resource is stored in both DataWorks and MaxCompute after the resource is created or uploaded. If you run a command to delete the resource from MaxCompute later, the resource stored in DataWorks still exists and is normally displayed.
The resource name can be different from the name of the uploaded file.

Commit and deploy the resource.
After you create a resource, you can click the
icon in the top toolbar on the configuration tab of the resource to commit the resource to the development environment. NoteIf nodes in the production environment need to use the resource, you must also deploy the resource to the production environment. For more information, see Deploy nodes.
Prepare input tables
Before you run a Proxima CE task, you must prepare the following input tables:
doc table: the base table.
query table: the query table.
Table creation commands
-- Create a doc table
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);
-- Create a query table
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);Format requirements for input tables
Table name
The name of an input table cannot contain the
tmp_string. Otherwise, the task fails to run.The names of input tables and partition values in input tables must be 1 to 64 characters in length. Otherwise, the task fails to run.
Fields
NoteInput tables must contain the following fixed fields, and the field names must be exactly the same.
Fixed field
Description
Data type
pk
The primary key value field used in queries.
The default data type is STRING.
For the pk column: The values can be numbers or strings, such as strings in the
1.nid,2.nid,3.nid,...format or INT64 values in the123,456,789,...format.For the pk column: If all values are INT64 values, you can specify the BIGINT data type for the column. If you also specify the
-pk_typestartup parameter as INT64, the performance is improved.
vector
The vector field.
The default data type is STRING.
category
The category used in multi-category scenarios.
This field is required only for multi-category search.
The default data type is BIGINT.
pt
The partition field.
The default data type is STRING.
Examples of input tables
doc table
pk
vector
pt
id1
0~1~1~5
20190322
id2
0~1~1~2
20190322
id3
3~2~1~1
20190322
...
...
...
query table
pk
vector
pt
id8
0~1~1~5
20190322
id9
0~1~1~2
20190322
id10
3~2~1~1
20190322
...
...
...
Next step: use the vector search feature
Scenario | Key feature | References |
Basic vector search | Supports search for the top K results from millions of data records. | |
Multi-category search | Supports multi-category scenarios, including scenarios where the query table and doc table belong to different categories and scenarios where a single query table belongs to multiple categories. | |
Cluster sharding | Allows you to create indexes based on cluster sharding. This method can help reduce the amount of data to be computed and accelerate index-based queries. | |
Inner product and cosine distance | Supports inner-product search. | |
Converters | Allows you to use converters. In most cases, converters can help improve performance and decrease the index size. The retrieval loss varies with actual situations. |
After you use the vector search feature, an output table is automatically generated and stored in MaxCompute. You do not need to create the output table. You only need to specify the table name after the -output_table parameter in the Proxima CE code. For more information about the format of the output table, see Format of the output table in this topic.
Format of the output table
After you run a vector search task, an output table is automatically generated and stored in MaxCompute. The following table describes the format of the output table.
Table name: the name of the output table that you specified in the code of Proxima CE.
The name of the output table cannot contain the period (
.) character because it is a special character in MaxCompute and causes MaxCompute to fail to parse the table.The name of the output table cannot contain the
tmp_string. Otherwise, the task fails to run.The names of output tables and names of partitions in output tables must be 1 to 64 characters in length. Otherwise, the task fails to run.
Fields
Fixed field
Description
Data type
pk
The primary key value of each query in the query table.
The default data type is STRING.
The values in the pk column can be numbers or strings, such as strings in the
1.nid,2.nid,3.nid,...format or INT64 values in the123,456,789,...format.If all values in the pk column are INT64 values, you can specify the BIGINT data type for the column. If you also specify the
-pk_typestartup parameter as INT64, the performance is improved.
knn_result
The primary key value of the doc table that is retrieved for queries.
The default data type is STRING.
score
The similarity score of the retrieved documents.
The default data type is STRING. In Proxima CE, the results are sorted in descending order of similarity scores.
NoteFor the
inner_product/mips_squared_euclideandistance algorithms in the Proxima2 kernel, a larger distance indicates a higher similarity. For other distance algorithms, a smaller distance indicates a higher similarity. However, Proxima CE provides unified processing and sorts the results in descending order of similarity scores:For the
inner_product/mips_squared_euclideandistance algorithms, the results are sorted in descending order ofscorevalues.For other distance algorithms, the results are sorted in ascending order of
scorevalues, which is consistent with the Proxima2 kernel.
category
The category used in multi-category scenarios.
This field is required only for multi-category search.
The default data type is BIGINT.
pt
The partition field.
The default data type is STRING.
Example of the output table
pk | knn_result | score | pt |
id8 | id1 | 0.1 | 20190322 |
id8 | id2 | 0.2 | 20190322 |
id9 | id1 | 0.1 | 20190322 |
id9 | id3 | 0.3 | 20190322 |
... | ... | ... | ... |