All Products
Search
Document Center

MaxCompute:Install the Proxima CE package

Last Updated:Jul 11, 2024

Before you use Proxima CE, you must install the Proxima CE package. This topic describes how to prepare the environment, obtain and upload the installation package, and prepare input data for Proxima CE.

Prepare the environment

  • Create a MaxCompute project. For more information, see Create a MaxCompute project.

  • Create a DataWorks workspace and associate the created MaxCompute project with the workspace. For more information about how to create a DataWorks workspace, see Create a workspace.

  • Apply for using the external volume feature. After the application is approved, create an external volume.

    • After you create an index in Proxima CE, you must write the index to an external volume in the MaxCompute project. For more information about how to apply for the external volume feature, see Apply for trial use of new features.

    • For more information about how to create an external volume, see External volume operations.

      Note

      You may use an external volume in subsequent operations of Proxima CE. Therefore, we recommend that you create an external volume in advance. If you do not create an external volume, you must enter an AccessKey pair of your Alibaba Cloud account or Alibaba Cloud Resource Name (ARN) of a RAM role as a mandatory startup parameter. However, the two methods pose security risks. We recommend that you do not use the methods.

Obtain the Proxima CE installation package

Click Proxima CE to download the installation package.

The Proxima CE installation package contains the executable JAR file of Proxima CE. After you upload the JAR file to the MaxCompute project as a resource, you can call the JAR file to run Proxima CE tasks.

Upload the JAR file as a resource

You can use the MaxCompute client (odpscmd) or DataWorks to upload the JAR file to a MaxCompute project as a resource. In this example, DataWorks is used to upload and publish resources. For more information about how to upload a JAR file as a MaxCompute resource by using odpscmd, see Resource operations.

  1. On the DataStudio page in the DataWorks console, upload the JAR file as a resource.

  2. Submit the resource.

  3. Publish the resource.

For more information about how to upload resources by using DataWorks, see "Step 1: Create a resource or upload an existing resource" in the Create and use MaxCompute resources topic.

Prepare input tables

Before you run a Proxima CE task, you must prepare the following input tables:

  • Doc table: the base table.

  • Query table: the query table.

Table creation statements

-- Create a doc table.
create table doc_table_float_smoke(pk string,vector string) partitioned by (pt string);

-- Create a query table.
create table query_table_float_smoke(pk string,vector string) partitioned by (pt string);

Input table format

  • Table names

    • The names of input tables for a task cannot contain the tmp_ string. Otherwise, the task fails to run.

    • The names of input tables and names of partitions in input tables must be 1 to 64 characters in length. Otherwise, the task fails to run.

  • Fields

    Fixed field

    Description

    Data type

    pk

    The primary key value used for queries.

    The default data type is STRING.

    • For a primary key column, its values can be strings such as 1.nid,2.nid,3.nid,... or numeric values of the INT64 type such as 123,456,789,....

    • For a primary key column, if all its values are of the INT64 type, you can specify the data type of the column as BIGINT. If you also set -pk_type to INT64, query performance is improved.

    vector

    The vector.

    The default data type is STRING.

    category

    The category used in multi-category scenarios.

    This field is required only for multi-category search.

    The default data type is BIGINT.

    pt

    The partition.

    The default data type is STRING.

Sample input tables

  • Doc table

    pk

    vector

    pt

    id1

    0~1~1~5

    20190322

    id2

    0~1~1~2

    20190322

    id3

    3~2~1~1

    20190322

    ...

    ...

    ...

  • Query table

    pk

    vector

    pt

    id8

    0~1~1~5

    20190322

    id9

    0~1~1~2

    20190322

    id10

    3~2~1~1

    20190322

    ...

    ...

    ...

Use the vector search features

Scenario

Key feature

References

Basic vector search

Supports search for the top K results from millions of data records.

Basic vector search

Multi-category search

Supports multi-category scenarios, including scenarios where the query table and doc table belong to different categories and scenarios where a single query table belongs to multiple categories.

Multi-category search

Cluster sharding

Allows you to create indexes based on cluster sharding. This method can help reduce the amount of data to be computed and accelerate index-based queries.

Cluster sharding

Inner product and cosine distance

Supports inner-product search.

Inner product and cosine distance

Converters

Allows you to use converters. In most cases, converters can help improve performance and decrease the index size. The retrieval loss varies with actual situations.

Converters

After the vector search is performed, the system automatically generates an output table in MaxCompute. You do not need to create another table. You need to only configure the -output_table parameter in the code of Proxima CE as the table name. For more information about the format of the generated output table, see the Output table format section in this topic.

Output table format

After the vector search is performed, the system automatically generates an output table in MaxCompute. This section describes the format of the generated output table.

  • Table name: the name of the output table that you specified in the code of Proxima CE.

    • The name of the output table cannot contain periods (.), which are considered special characters in MaxCompute. If the name contains periods (.), the MaxCompute table fails to be parsed.

    • The name of the output table generated for a task cannot contain the tmp_ string. Otherwise, the task fails to run.

    • The names of output tables and names of partitions in output tables must be 1 to 64 characters in length. Otherwise, the task fails to run.

  • Fields

    Fixed field

    Description

    Data type

    pk

    The primary key value that corresponds to each query in a query table.

    The default data type is STRING.

    • For a primary key column, its values can be strings such as 1.nid,2.nid,3.nid,... or numeric values of the INT64 type such as 123,456,789,....

    • For a primary key column, if all its values are of the INT64 type, you can specify the data type of the column as BIGINT. If you also set -pk_type to INT64, query performance is improved.

    knn_result

    The primary key value of the doc table that is retrieved for queries.

    The default data type is STRING.

    score

    The similarity score of the retrieved documents.

    The default data type is STRING. Proxima CE lists the output results based on the similarity scores in descending order.

    Note

    For the distances calculated by using inner products or the MIPS squared euclidean algorithm in the Proxima 2 kernel, a longer distance indicates higher similarity. For the distances calculated by using other algorithms, a shorter distance indicates higher similarity. Proxima CE provides a uniform way to list the output results based on similarity scores in descending order.

    • For the distances calculated by using inner products or the MIPS squared euclidean algorithm, Proxima CE lists the output results based on the similarity scores in descending order.

    • For the distances calculated by using other algorithms, Proxima CE lists the output results based on the similarity scores in ascending order. The method is the same as the method used in the Proxima 2 kernel.

    category

    The category used in multi-category scenarios.

    This field is required only for multi-category search.

    The default data type is BIGINT.

    pt

    The partition.

    The default data type is STRING.

Sample output table

pk

knn_result

score

pt

id8

id1

0.1

20190322

id8

id2

0.2

20190322

id9

id1

0.1

20190322

id9

id3

0.3

20190322

...

...

...

...