Proxima CE tasks accept the following required and optional parameters.
Required parameters
| Parameter | Description |
|---|---|
doc_table |
The MaxCompute table used as the candidate dataset for search. Table names cannot contain periods (.), which are special characters in MaxCompute that cause parse failures. To reference a table from another project, use the project_name.table_name format. |
doc_table_partition |
The partition name in the doc table. |
query_table |
The MaxCompute table used as the query dataset. Table names cannot contain periods (.). To reference a table from another project, use the project_name.table_name format. |
query_table_partition |
The partition name in the query table. |
output_table |
The table that stores search results. Proxima CE creates this table automatically — specify the name only, do not create the table in advance. |
output_table_partition |
The partition name in the output table. |
data_type |
The data type of the doc and query tables. Valid values: FLOAT, INT8, BINARY. |
dimension |
The dimension of the feature vectors. If data_type is set to BINARY, the value must be an integer multiple of 32. |
Optional parameters
Optional parameters are grouped by functional area. All parameters that apply only when sharding_mode is set to cluster are listed in the K-means clustering parameters section.
Search parameters
| Parameter | Description | Default |
|---|---|---|
topk |
The number of top-k results to return. Accepts a comma-delimited list such as 10,20,30. The actual number of results stored in the output table equals the maximum value in the list. |
200 |
threshold_score |
The score threshold for filtering results. For most distance methods, a lower score indicates higher similarity — results with scores above the threshold are filtered out. For inner_product and mips_squared_euclidean, a higher score indicates higher similarity — results with scores below the threshold are filtered out. |
No default |
query_multi_label |
Set to true to enable multi-category search. When enabled, the doc table must contain a category column. See Multi-category search. |
false |
memory_load |
Set to true to load the index into memory during the seek process, which improves search speed. Set to false if cluster memory is insufficient. |
true |
Index algorithm parameters
| Parameter | Description | Default |
|---|---|---|
algo_model |
The index algorithm used in the Proxima 2.x kernel. This parameter determines both the builder for index creation and the searcher for queries. Valid values: hnsw (Hierarchical Navigable Small World), ssg (Satellite System Graph), hc (Hierarchical Clustering), gc (Graph Clustering), qc (Quantized Clustering), linear (brute-force search). |
hnsw |
builder_params |
Parameters for the IndexBuilder module. The index type must match the algorithm set in algo_model. Specify as a single-line JSON string. Double quotation marks do not need to be escaped. Spaces are not allowed. Example for HNSW — {"proxima.hnsw.builder.efconstruction":400,"proxima.hnsw.builder.max_neighbor_count":100} sets the ef construction value and the maximum number of nearest neighbors per node. See IndexBuilder. |
No default |
searcher_params |
Parameters for the IndexSearcher module. The index type must match the algorithm set in algo_model. Specify as a single-line JSON string. Double quotation marks do not need to be escaped. Spaces are not allowed. Example for HNSW — {"proxima.hnsw.searcher.ef":400} sets the ef value for queries. See IndexSearcher. |
No default |
distance_method |
The distance formula for computing feature vector similarity. Valid values: squared_euclidean, euclidean, mips_squared_euclidean, inner_product, hamming (BINARY data only), manhattan (L1 distance), chebyshev, canberra, geo_distance, rogers_tanimoto (BINARY data only), russell_rao (BINARY data only), matching (BINARY data only). |
squared_euclidean |
measure_params |
Parameters for distance_method. Specify as a single-line JSON string. Double quotation marks do not need to be escaped. Spaces are not allowed. Example for mips_squared_euclidean — {"proxima.mips_euclidean.measure.injection_type":0}. See IndexMeasure. |
No default |
converter |
The name of the IndexConverter module, which converts feature vectors in Proxima 2.x (for example, half-float conversion or INT8 quantization). IndexConverter can be used standalone or combined with other modules in the search process. See IndexConverter. | No default |
converter_params |
Parameters for the IndexConverter module. Specify as a single-line JSON string. Double quotation marks do not need to be escaped. Spaces are not allowed. Example for MIPS converter — {"proxima.mips.converter.m_value":4,"proxima.mips.converter.u_value":0.38196601,"proxima.mips.converter.forced_half_float":false,"proxima.mips.converter.spherical_injection":false}. See IndexConverter. |
No default |
Index sharding parameters
| Parameter | Description | Default |
|---|---|---|
sharding_mode |
The index sharding method. hash shards indexes using modulo hashing. cluster shards indexes using k-means clustering, which reduces the amount of data to compute in the seek process. |
hash |
column_num |
The number of columns for the IndexBuilder module. When set to 0, the system calculates this automatically based on the data volume in doc_table and the data_type: data under 50 GB uses 2 GB per column; data over 50 GB uses 2.5 GB per column. Takes effect only when both column_num and row_num are set to positive values. |
0 |
row_num |
The number of rows for the IndexSearcher module. When set to 0, the system calculates this automatically: fewer than 100 million queries use 2 million queries per row; more than 100 million queries use 10 million queries per row. Takes effect only when both column_num and row_num are set to positive values. |
0 |
Multi-category search parameters
These parameters apply to multi-category search scenarios. See query_multi_label to enable multi-category search.
| Parameter | Description | Default |
|---|---|---|
category_threshold |
The document count threshold that determines whether a category uses large-category or small-category search. Categories with more documents than this threshold use large-category search; categories with fewer documents use small-category search (linear search, with multiple small categories merged). | 1000000 |
category_col_num |
The number of index columns for small categories (fewer than 1 million documents). See column_num for configuration details. |
0 |
category_row_num |
The number of index rows for small categories (fewer than 1 million documents). See row_num for configuration details. |
0 |
category_thread_num |
The thread pool size for large-category search (more than 1 million documents). A larger value increases search concurrency. | 10 |
K-means clustering parameters
These parameters apply only when sharding_mode is set to cluster.
| Parameter | Description | Default |
|---|---|---|
kmeans_resource_name |
The centroid name for k-means clustering. Proxima CE starts a Graph computing task in MaxCompute to perform k-means clustering on the source data. | kmeans_resource_name |
kmeans_sample_ratio |
The sample rate for k-means centroid initialization. Valid values: 0–1. | 0.05 |
kmeans_seek_ratio |
The centroid filtering rate during the seek process. Valid values: 0–1. | 0.1 |
kmeans_iter_num |
The number of k-means clustering iterations. | 30 |
kmeans_cluster_num |
The number of k-means centroids. | 1000 |
kmeans_init_center_method |
The centroid initialization method. | "" |
kmeans_worker_num |
The number of workers for k-means clustering. | 0 |
Task and resource parameters
| Parameter | Description | Default |
|---|---|---|
job_mode |
The stages to run. Valid values: train:build:seek (default), build:seek, seek, train:build:seek:recall, build:seek:recall, seek:recall. |
train:build:seek |
clean_build_volume |
Set to true to delete the index after the build and seek processes complete. The index is written to an external volume in MaxCompute during the build process, loaded during the seek process, and then deleted automatically. Note
If set to |
true |
pk_type |
The data type of the pk column in input tables. Valid values: int64, string. If the pk column is not INT64 (for example, STRING), Proxima CE creates a temporary table, maps the pk column to a tmp_pk column of type INT64, and then performs a JOIN to produce the final result. This mapping adds approximately 30 minutes to a search over 100 million documents. |
string |
vector_separator |
The separator character for feature vectors. Tildes (~) are not supported. To use spaces as separators, set this parameter to blank. Do not enclose the separator in single or double quotation marks — the entire quoted string is treated as the separator. For example, setting this to ',' uses the three-character string ',' as the separator, not a comma. |
~ |
binary_to_int |
Set to true to convert BINARY type data to INT32 before indexing. Valid only for data_type=BINARY. The dimension parameter continues to refer to the binary vector dimension. This conversion reduces index sizes by packing N binary values (0 or 1) into N 32-bit integers. When false, input data looks like 1,1,1,1,.... When true, input data looks like 12345,13423,.... |
false |
column_num |
See Index sharding parameters. | 0 |
row_num |
See Index sharding parameters. | 0 |
mapper_split_size |
The data volume processed per internal mapper, in MB. This exposes the mapper.split.size option. |
256 |
odps_task_priority |
The priority of the Proxima CE task, applied across all MaxCompute job types (SQL, MapReduce, Graph). Valid values: 0–9. Lower values indicate higher priority. Set to -1 to use the MaxCompute baseline priority. |
-1 |
h (-help) |
Displays help information. | No default |
OSS storage parameters
These parameters configure access to Object Storage Service (OSS), where the index is stored during the build process.
| Parameter | Description | Default |
|---|---|---|
oss_access_id |
The AccessKey ID of your Alibaba Cloud account or a Resource Access Management (RAM) user. Get your AccessKey ID from the AccessKey Pair page. | No default |
oss_access_key |
The AccessKey secret corresponding to the AccessKey ID. Get your AccessKey secret from the AccessKey Pair page. | No default |
oss_endpoint |
The endpoint of the MaxCompute project. The endpoint varies by region and network connection type. For a complete list, see Endpoints. | No default |
oss_bucket |
The name of the OSS bucket. To view your bucket names, see List buckets. | No default |
tunnel_endpoint |
The MaxCompute Tunnel endpoint. Specify this parameter to prevent download session failures when accessing tables across networks. For details, see MaxCompute Tunnel endpoint issue. | No default |