All Products
Search
Document Center

MaxCompute:Install the Proxima CE package

Last Updated:Apr 07, 2025

Before you use Proxima CE, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare input data for Proxima CE.

Prerequisites

You must complete Preparations.

Get the Proxima CE installation package

Click Proxima CE package to download the installation package.

The Proxima CE installation package contains the executable JAR file of Proxima CE. After you upload the JAR file to the MaxCompute project as a resource, you can call the JAR file to run Proxima CE tasks.

Upload the installation package as a MaxCompute resource

You can use the MaxCompute client (odpscmd) or DataWorks to upload the JAR file to a MaxCompute project as a resource. In this example, DataWorks is used to upload and publish resources. For more information about how to upload resources by using odpscmd, see Add resources.

  1. On the Data Development page of DataWorks, upload the installation package as a JAR resource in a visual way.

    Note

    Take note of the following items:

    • If you create or upload a resource that is never uploaded to MaxCompute, you must select Upload to MaxCompute. If the resource has been uploaded to MaxCompute, clear Upload to MaxCompute. Otherwise, an error is reported when you upload the resource.

    • If you select Upload to MaxCompute when you create or upload a resource, the resource is stored in both DataWorks and MaxCompute after the resource is created or uploaded. If you run a command to delete the resource from MaxCompute later, the resource stored in DataWorks still exists and is normally displayed.

    • The resource name can be different from the name of the uploaded file.

    image

  2. Commit and deploy the resource.

    After you create a resource, you can click the 提交 icon in the top toolbar on the configuration tab of the resource to commit the resource to the development environment.

    Note

    If nodes in the production environment need to use the resource, you must also deploy the resource to the production environment. For more information, see Deploy nodes.

Prepare input tables

Before you run a Proxima CE task, you must prepare the following input tables:

  • doc table: the base table.

  • query table: the query table.

Table creation commands

-- Create a doc table
CREATE TABLE doc_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);

-- Create a query table
CREATE TABLE query_table_float_smoke(pk STRING,vector STRING <,category BIGINT>) PARTITIONED BY (pt STRING);

Format requirements for input tables

  • Table name

    • The name of an input table cannot contain the tmp_ string. Otherwise, the task fails to run.

    • The names of input tables and partition values in input tables must be 1 to 64 characters in length. Otherwise, the task fails to run.

  • Fields

    Note

    Input tables must contain the following fixed fields, and the field names must be exactly the same.

    Fixed field

    Description

    Data type

    pk

    The primary key value field used in queries.

    The default data type is STRING.

    • For the pk column: The values can be numbers or strings, such as strings in the 1.nid,2.nid,3.nid,... format or INT64 values in the 123,456,789,... format.

    • For the pk column: If all values are INT64 values, you can specify the BIGINT data type for the column. If you also specify the -pk_type startup parameter as INT64, the performance is improved.

    vector

    The vector field.

    The default data type is STRING.

    category

    The category used in multi-category scenarios.

    This field is required only for multi-category search.

    The default data type is BIGINT.

    pt

    The partition field.

    The default data type is STRING.

Examples of input tables

  • doc table

    pk

    vector

    pt

    id1

    0~1~1~5

    20190322

    id2

    0~1~1~2

    20190322

    id3

    3~2~1~1

    20190322

    ...

    ...

    ...

  • query table

    pk

    vector

    pt

    id8

    0~1~1~5

    20190322

    id9

    0~1~1~2

    20190322

    id10

    3~2~1~1

    20190322

    ...

    ...

    ...

Next step: use the vector search feature

Scenario

Key feature

References

Basic vector search

Supports search for the top K results from millions of data records.

Basic vector search

Multi-category search

Supports multi-category scenarios, including scenarios where the query table and doc table belong to different categories and scenarios where a single query table belongs to multiple categories.

Multi-category search

Cluster sharding

Allows you to create indexes based on cluster sharding. This method can help reduce the amount of data to be computed and accelerate index-based queries.

Cluster sharding

Inner product and cosine distance

Supports inner-product search.

Inner product and cosine distance

Converters

Allows you to use converters. In most cases, converters can help improve performance and decrease the index size. The retrieval loss varies with actual situations.

Converters

After you use the vector search feature, an output table is automatically generated and stored in MaxCompute. You do not need to create the output table. You only need to specify the table name after the -output_table parameter in the Proxima CE code. For more information about the format of the output table, see Format of the output table in this topic.

Format of the output table

After you run a vector search task, an output table is automatically generated and stored in MaxCompute. The following table describes the format of the output table.

  • Table name: the name of the output table that you specified in the code of Proxima CE.

    • The name of the output table cannot contain the period (.) character because it is a special character in MaxCompute and causes MaxCompute to fail to parse the table.

    • The name of the output table cannot contain the tmp_ string. Otherwise, the task fails to run.

    • The names of output tables and names of partitions in output tables must be 1 to 64 characters in length. Otherwise, the task fails to run.

  • Fields

    Fixed field

    Description

    Data type

    pk

    The primary key value of each query in the query table.

    The default data type is STRING.

    • The values in the pk column can be numbers or strings, such as strings in the 1.nid,2.nid,3.nid,... format or INT64 values in the 123,456,789,... format.

    • If all values in the pk column are INT64 values, you can specify the BIGINT data type for the column. If you also specify the -pk_type startup parameter as INT64, the performance is improved.

    knn_result

    The primary key value of the doc table that is retrieved for queries.

    The default data type is STRING.

    score

    The similarity score of the retrieved documents.

    The default data type is STRING. In Proxima CE, the results are sorted in descending order of similarity scores.

    Note

    For the inner_product/mips_squared_euclidean distance algorithms in the Proxima2 kernel, a larger distance indicates a higher similarity. For other distance algorithms, a smaller distance indicates a higher similarity. However, Proxima CE provides unified processing and sorts the results in descending order of similarity scores:

    • For the inner_product/mips_squared_euclidean distance algorithms, the results are sorted in descending order of score values.

    • For other distance algorithms, the results are sorted in ascending order of score values, which is consistent with the Proxima2 kernel.

    category

    The category used in multi-category scenarios.

    This field is required only for multi-category search.

    The default data type is BIGINT.

    pt

    The partition field.

    The default data type is STRING.

Example of the output table

pk

knn_result

score

pt

id8

id1

0.1

20190322

id8

id2

0.2

20190322

id9

id1

0.1

20190322

id9

id3

0.3

20190322

...

...

...

...