Batch processing for offline vector search tasks-Proxima CE - MaxCompute

Proxima CE is an offline vector search engine built on the Proxima 2.x kernel, developed by Alibaba DAMO Academy. It runs as MapReduce or Graph jobs inside MaxCompute, reading vector data from MaxCompute tables and writing search results back to MaxCompute tables. Use Proxima CE when you need large-scale batch vector search — including top K retrieval from millions of records, multi-category search, and cluster-sharded index queries — without managing a separate search infrastructure.

What Proxima CE supports

Data types

Data type	Notes
`INT8`	—
`FLOAT`	—
`BINARY`	Can be converted to `INT32` using the `binary_to_int` parameter. See Optional parameters.

Search methods

Method	Full name	Default
HNSW	Hierarchical Navigable Small World	Yes
SSG	Satellite System Graph	—
HC	Hierarchical Clustering	—
GC	Graph Clustering	—
QC	Quantized Clustering	—
Linear search	—	—

Distance calculation

Three distance methods are available via the distance_method parameter:

Squared Euclidean distance
Inner product
Hamming distance

For details, see Optional parameters.

Similarity threshold

Set a similarity threshold using the threshold_score parameter. If the value of a vector exceeds the specified threshold, the system filters out the vector. For details, see Optional parameters.

How it works

MaxCompute table (source data)
        │
        ▼
Proxima CE — creates index, runs batch queries
(via MapReduce or Graph jobs)
        │
        ▼
MaxCompute table (search results)

Proxima CE provides built-in executable JAR files to run in MaxCompute. Index files are stored in MaxCompute Volume storage (backed by an OSS external volume) and are reused across query tasks.

Prerequisites

Before you begin, make sure you have:

Required

A MaxCompute project. See Create a MaxCompute project.
A DataWorks workspace with the MaxCompute project added as a data source.
- If you selected Participate in Public Preview of Data Studio when creating the workspace, bind compute resources by following Associate a compute resource with a workspace (Participate in Public Preview of Data Studio turned on).
- Otherwise, follow Add a data source or register a cluster to a workspace.
The Volume feature activated and an external volume created. Proxima CE writes its index to Volume storage.
- To activate the Volume feature, see Apply for trial use of new features. You receive a text message after activation. If Volume is not activated, jobs fail with: FAILED: ODPS-0420095: Access Denied - Volumes is not allowed in project config.
- To create an external volume, see External volume operations.

Recommended

Create the external volume before you start. If you skip this step, you must provide role_arn as a required startup parameter, which introduces security risks.

Usage notes

The external volume must be configured with an OSS internal endpoint, for example, oss-cn-beijing-internal.aliyuncs.com. For OSS internal endpoints by region, see Regions and endpoints.

Supported tools

Tool	Supported platforms	Notes
odpscmd	Linux only	JAR files are compiled for Linux. Windows and macOS are not supported.
DataWorks	All platforms	Create ODPS MR nodes and run them with ODPS SQL scripts.

Get started

Install the Proxima CE package — Set up the environment and configure Proxima CE. See Install the Proxima CE package.
Run a vector search — Choose a search scenario from the table below.

Scenario	Key capability	Reference
Basic vector search	Top K retrieval from millions of records	Basic vector search
Multi-category search	Supports different-category query/doc tables and single-query-multiple-category scenarios	Multi-category search
Cluster sharding	Index by cluster shard to reduce compute and accelerate queries	Cluster sharding
Inner product and cosine distance	Inner-product and cosine distance search	Inner product and cosine distance
Converters	Improve performance and reduce index size (retrieval loss varies)	Converters

References

Parameters and kernel modules

Test reports

Feature testing:

Performance testing:

FAQ and troubleshooting