Proxima CE is an engine based on the Proxima 2.x kernel and is used to process a huge number of offline vector search tasks at the same time. The tasks include basic vector search tasks, multi-category vector search tasks, and tasks used to search for the top K results from millions of data records. This topic describes how to use Proxima CE in MaxCompute.
Background information
Proxima CE is high-performance software developed by Alibaba DAMO Academy to implement vector search (nearest neighbor search). Compared with similar open source products such as Faiss, Proxima CE delivers better stability and higher performance. Proxima CE is easy to use and provides built-in executable JAR files to run in MaxCompute. When you run Proxima CE, you can use MaxCompute tables as the base and input data for vector data queries. You can use Proxima CE to create indexes and perform multiple query tasks at the same time by running MaxCompute MapReduce jobs or Graph jobs. The output results of batch query operations are generated as MaxCompute tables.
Description
Supported data types and search methods
Multiple data types, including
INT8, FLOAT, and BINARY, are supported.NoteYou can convert data of the BINARY type into the INT32 type. For more information, see the description of the binary_to_int parameter in Optional parameters.
Multiple search methods, including
Hierarchical Navigable Small World (HNSW) graph, Satellite System Graph (SSG), Hierarchical Clustering (HC), Graph Clustering (GC), Quantized Clustering (QC), and linear search, are supported. By default, anHNSWgraph is used for search.
Advanced computing capabilities based on optional parameters
Multiple distance calculation methods, including the
squared Euclidean distance, inner products, and Hamming distance, are supported. For more information, see the description of the distance_method parameter in Optional parameters.You can configure a similarity threshold. If the value of a vector exceeds the specified threshold, the system filters out the vector. For more information, see the description of the threshold_score parameter in Optional parameters.
Preparations
Before you use Proxima CE, make sure that you have completed the following preparations:
A MaxCompute project is created. For more information, see Create a MaxCompute project.
A DataWorks workspace is created, and the created MaxCompute project is added to the workspace as a data source.
If you selected Participate in Public Preview of Data Studio when you created the DataWorks workspace, bind computing resources by following the instructions in Associate a computing resource with a workspace (Participate in Public Preview of Data Studio turned on).
If you did not select Participate in Public Preview of Data Studio when you created the DataWorks workspace, bind a data source by following the instructions in Add a data source or register a cluster to a workspace.
The Volume feature is activated, and an external volume is created.
After Proxima CE creates an index, the index must be written to the Volume storage of MaxCompute. For more information about how to activate the Volume feature, see Apply for trial use of new features. You will receive a text message after the Volume feature is activated. If the Volume feature is not activated, an error message similar to
FAILED: ODPS-0420095: Access Denied - Volumes is not allowed in project config.is returned when you run a task.For more information about how to create an external volume, see External volume operations.
NoteYou can use an external volume for subsequent operations in Proxima CE. To avoid potential security risks, we recommend creating the external volume in advance. Otherwise, you will need to provide the
role_arnas a required startup parameter, which may pose security concerns.
Precaution
The external volume used by Proxima CE must be specified with an internal endpoint of Object Storage Service (OSS), such as oss-cn-beijing-internal.aliyuncs.com. For more information about OSS internal endpoints, see Regions and endpoints.
Supported tools
You can use the MaxCompute client odpscmd or DataWorks during resource upload and runtime.
odpscmd: You can use this tool only in Linux because JAR files of Proxima CE are compiled in Linux. Windows and macOS are not supported.
DataWorks: You can create ODPS MR nodes in DataWorks and run these nodes by using ODPS SQL scripts.
Instructions
Install the Proxima CE package.
Before you use Proxima CE, you must install the Proxima CE package to prepare the environment and configure settings for Proxima CE.
Use the vector search features.
The following table shows the scenarios of vector retrieval functions and various retrieval methods.
Scenario
Key feature
References
Basic vector search
Supports search for the top K results from millions of data records.
Multi-category search
Supports multi-category scenarios, including scenarios where the query table and doc table belong to different categories and scenarios where a single query table belongs to multiple categories.
Cluster sharding
Allows you to create indexes based on cluster sharding. This method can help reduce the amount of data to be computed and accelerate index-based queries.
Inner product and cosine distance
Supports inner-product and cosine distance search.
Converters
Allows you to use converters. In most cases, converters can help improve performance and decrease the index size. The retrieval loss varies with actual situations.
Obtain reference information. The following table provides links to relevant reference information.
Item
References
Parameters and kernel modules
Test reports
Feature testing
Performance testing
FAQ
Error codes