All Products
Search
Document Center

MaxCompute:Reference: Proxima CE parameters

Last Updated:Jun 19, 2023

This topic describes the required and optional parameters that are used when you run a task in Proxima CE.

Required parameters

Parameter

Description

doc_table

The name of the doc table, which is a MaxCompute table. You must prepare the doc table and use it as a candidate dataset for search.

Important

The table name cannot contain periods (.), which are considered special characters in MaxCompute. If the table name contains periods (.), the table fails to be parsed. For a table that is referenced from another project, you can specify the table name in the project_name.table_name format.

doc_table_partition

The name of the partition in the doc table.

query_table

The name of the query table, which is a MaxCompute table. You must prepare the query table and use it as a dataset for search.

Important

The table name cannot contain periods (.), which are considered special characters in MaxCompute. If the table name contains periods (.), the table fails to be parsed. For a table that is referenced from another project, you can specify the table name in the project_name.table_name format.

query_table_partition

The name of the partition in the query table.

output_table

The name of the output table that is used to store search results. The output table is automatically generated. You do not need to create an output table. You need only to specify the table name.

output_table_partition

The name of the partition in the output table.

data_type

The data type of the doc and query tables. The FLOAT, INT8, and BINARY data types are supported.

dimension

The dimension of characteristic vectors. If the data_type parameter is set to BINARY, the dimension must be an integer multiple of 32.

Optional parameters

Parameter

Description

Default value

h (-help)

The help information.

No default value

topk

The numbers of similarity results to be retrieved. You can specify a comma-delimited list of values such as 10,20,30. The actual number of retrieved similarity results in the output table is determined based on the maximum value in the list.

200

pk_type

The data type of the pk column in an input table. Valid values: int64 and string. Default value: string. If values in the pk column are of a data type other than INT64, such as the STRING type, Proxima CE creates a temporary input table for the task, maps the pk column to the tmp_pk column of the INT64 type, and then performs the JOIN operation to obtain the final result. In this case, a search for vector data from 100 million documents requires an additional period of about 30 minutes.

string

vector_separator

The vector separator. You can specify special characters other than tildes (~) as separators. Spaces are supported. If you want to use spaces as separators, set this parameter to blank. When you configure this parameter, do not enclose the character that you want to use as a separator in single quotation marks (') or double quotation marks ("). Otherwise, the entire character string is considered as a separator. For example, if you set this parameter to ',', the character string ',' rather than a comma , is considered as a separator.

~

binary_to_int

Specifies whether to convert data of the BINARY type into the INT32 type. This parameter is valid only for data of the BINARY type. If you specify this parameter, the dimension parameter still specifies the dimension of the binary characteristic vectors. For example, commas (,) are used as separators. If the binary_to_int parameter is set to false, input data can be 1,1,1,1,1,1,..... If the binary_to_int parameter is set to true, input data can be 12345,13423,13325,..... This parameter allows you to convert N binary values made up of 0 and 1 into N 32-bit integers. This conversion helps reduce index sizes.

false

job_mode

The job mode. Valid values:

  • train:build:seek (default)

  • build:seek

  • seek

  • train:build:seek:recall

  • build:seek:recall

  • seek:recall

train:build:seek

clean_build_volume

Specifies whether to delete an index after the build and seek processes are complete. After an index is created in the build process, the index is written to an external volume in MaxCompute, and then loaded in the seek process. After the seek process is complete, the index is automatically deleted.

Note

If you set this parameter to true, the index is also deleted when the task fails.

true

algo_model

The index creation algorithm. The following algorithms are supported in the Proxima 2.x kernel: Hierarchical Navigable Small World (HNSW) graph, Satellite System Graph (SSG), Hierarchical Clustering (HC), Graph Clustering (GC), Quantized Clustering (QC), and linear search. This parameter determines the builder for index creation and searcher for data queries. Mappings between parameter values and the builders and searchers:

  • hnsm: HNSW builder and HNSW searcher

  • ssg: SSG builder and SSG searcher

  • hc: clustering builder and clustering searcher

  • gc: GC builder and GC searcher

  • qc: QC builder and QC searcher

  • linear: linear builder and linear searcher (brute-force search)

hnsw

builder_params

The parameters that you specify for the IndexBuilder module. By default, this parameter is left empty. The index type of the parameters that you specify must correspond to the algorithm specified by algo_model. You must specify a single-line JSON string for this parameter. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, if you set this parameter to {"proxima.hnsw.builder.efconstruction":400,"proxima.hnsw.builder.max_neighbor_count":100}, the ef value for the construction method used by the HNSW graph algorithm and the maximum number of nearest neighbors of the node are specified. For more information about this parameter, see IndexBuilder.

No default value

searcher_params

The parameters that you specify for the IndexSearcher module. By default, this parameter is left empty. The index type of the parameters that you specify must correspond to the algorithm specified by algo_model. You must specify a single-line JSON string for this parameter. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, if you set this parameter to {"proxima.hnsw.searcher.ef":400}, the ef value for HNSW-based queries is specified. For more information about this parameter, see IndexSearcher.

No default value

converter

The name of the IndexConverter module. IndexConverter is a module that is used by Proxima 2.x to convert characteristic vectors. For example, you can perform half-float conversion and INT8 quantization on characteristic vectors. The IndexConverter module can be separately used or used with other modules in the search process. For more information, see IndexConverter.

No default value

converter_params

The parameters that you specify for the IndexConverter module. You must specify a single-line JSON string for this parameter. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, you can specify MIPS converter parameters. Sample configuration: {"proxima.mips.converter.m_value":4,"proxima.mips.converter.u_value":0.38196601,"proxima.mips.converter.forced_half_float":false,"proxima.mips.converter.spherical_injection":false}. For more information, see IndexConverter.

No default value

distance_method

The formula for calculating the characteristic vector distance. Valid values:

  • squared_euclidean

  • euclidean

  • mips_squared_euclidean

  • inner_product

  • hamming (used for data of the BINARY type)

  • manhattan (L1 distance)

  • chebyshev

  • canberra

  • geo_distance

  • rogers_tanimoto (used for data of the BINARY type)

  • russell_rao (used for data of the BINARY type)

  • matching (used for data of the BINARY type)

squared_euclidean

measure_params

The parameters that you specify for distance_method. You must specify a single-line JSON string for measure_params. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, if you set distance_method to mips_squared_euclidean, you can specify {"proxima.mips_euclidean.measure.injection_type":0}. For more information, see IndexMeasure.

No default value

column_num

The number of columns for the IndexBuilder module. Default value: 0.

  • Automatic configuration: The system calculates the number of columns based on the amount of data in the table specified by doc_table and the data type specified by data_type. If the total amount of data is less than 50 GB, 2-GB data is allocated to each column. If the total amount of data is greater than 50 GB, 2.5-GB data is allocated to each column.

  • Manual configuration: The automatic configuration method is used in most cases. You can also modify the configuration based on cluster resources.

This parameter is valid only if you set both column_num and row_num to positive values.

0

row_num

The number of rows for the IndexSearcher module. Default value: 0.

  • Automatic configuration: The system calculates the number of rows based on the amount of data in the table specified by doc_table and the data type specified by data_type. If the total number of queries is less than 100 million, 2 million queries are allocated to each row. If the total number of queries is greater than 100 million, 10 million queries are allocated to each row.

  • Manual configuration: The automatic configuration method is used in most cases. You can also modify the configuration based on cluster resources.

This parameter is valid only if you set both column_num and row_num to positive values.

0

category_threshold

The threshold for triggering large-category searches in multi-category search scenarios. If the number of documents in a category exceeds the specified threshold, the system performs large-category search for this category. Otherwise, the system performs small-category search for this category. By default, the linear search method is used for small-category search and data in multiple small categories is merged for search.

1000000

category_col_num

The number of columns for which indexes are created for small categories when you query data by category. A small category has less than 1 million documents. For more information, see column_num.

0

category_row_num

The number of rows for which indexes are created for small categories when you query data by category. A small category has less than 1 million documents. For more information, see row_num.

0

category_thread_num

The concurrency of tasks that are used to perform large-category search when you query data by category. A large category has more than 1 million documents. The task concurrency indicates the size of a thread pool.

10

query_multi_label

Specifies whether a query involves multiple categories. If you set this parameter to true, the doc table must contain the category column. For more information, see Multi-category search.

false

threshold_score

The score threshold for filtering out search results. For the similarity scores of retrieved documents, except for the similarity scores that are calculated by using the inner product or MIPS squared Euclidean distance, a low score indicates high similarity. If a score exceeds the specified threshold, the system filters out the score. For the similarity scores that are calculated by using the inner product or MIPS squared Euclidean distance, a high score indicates high similarity. If a score is lower than the threshold, the system filters out the score.

No default value

tunnel_endpoint

The MaxCompute Tunnel endpoint. By default, this parameter is left empty. You can specify this parameter to configure a valid MaxCompute Tunnel endpoint. This prevents a failure to establish a download session when you access a table across networks. For more information, see MaxCompute Tunnel Endpoint issue.

No default value

memory_load

Specifies whether an index is loaded into the memory in the seek process. By default, this parameter is set to true, indicating that the index is loaded into the memory. If memory resources of the cluster are insufficient, you can set this parameter to false.

true

sharding_mode

Specifies how to perform index sharding. Valid values: hash and cluster. If you set this parameter to hash, index sharding is performed by using the modulo hash feature. If you set this parameter to cluster, index sharding is performed by using k-means clustering. This sharding method can help reduce the amount of data to be computed in the seek process.

hash

kmeans_resource_name

The name of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to cluster. The cluster performs k-means clustering for the original data by starting a graph computing task in MaxCompute.

kmeans_resource_name

kmeans_sample_ratio

The sample rate of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to cluster. Valid values: 0-1.

0.05

kmeans_seek_ratio

The filtering rate of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to cluster. Valid values: 0-1.

0.1

kmeans_iter_num

The number of iterations for k-means clustering. This parameter is valid if sharding_mode is set to cluster.

30

kmeans_cluster_num

The number of centroids for k-means clustering. This parameter is valid if sharding_mode is set to cluster.

1000

kmeans_init_center_method

Specifies how to initialize a centroid for k-means clustering. This parameter is valid if sharding_mode is set to cluster.

""

kmeans_worker_num

The number of workers for k-means clustering. This parameter is valid if sharding_mode is set to cluster.

0

mapper_split_size

The amount of data that an internal mapper can process. This parameter is used to expose mapper.split.size option. Unit: MB. If you do not specify this parameter, the default size (256 MB) of MaxCompute MapReduce is used.

256

odps_task_priority

The priority of a Proxima CE task. You can configure priorities for Proxima CE tasks in all MaxCompute jobs, such as SQL jobs, MapReduce jobs, and Graph jobs. Valid values: [0-9]. A small value indicates a high priority. If you set this parameter to -1, the baseline priority of MaxCompute is used as the task priority.

-1

oss_access_id

The AccessKey ID of your Alibaba Cloud account or a RAM user of your Alibaba Cloud account. You can obtain the AccessKey ID from the AccessKey Pair page.

No default value

oss_access_key

The AccessKey secret that corresponds to the AccessKey ID.

You can obtain the AccessKey secret from the AccessKey Pair page.

No default value

oss_endpoint

The endpoint of the MaxCompute project.

The parameter value varies based on the region and network connection method that you selected when you create the MaxCompute project. For more information about the endpoints that correspond to different regions and network connection methods, see Endpoints.

No default value

oss_bucket

The name of the Object Storage Service (OSS) bucket. For more information about how to view the names of OSS buckets, see List buckets.

No default value