This topic describes the required and optional parameters that are used when you run a task in Proxima CE.
Required parameters
Parameter | Description |
doc_table | The name of the doc table, which is a MaxCompute table. You must prepare the doc table and use it as a candidate dataset for search. Important The table name cannot contain periods |
doc_table_partition | The name of the partition in the doc table. |
query_table | The name of the query table, which is a MaxCompute table. You must prepare the query table and use it as a dataset for search. Important The table name cannot contain periods |
query_table_partition | The name of the partition in the query table. |
output_table | The name of the output table that is used to store search results. The output table is automatically generated. You do not need to create an output table. You need only to specify the table name. |
output_table_partition | The name of the partition in the output table. |
data_type | The data type of the doc and query tables. The |
dimension | The dimension of characteristic vectors. If the |
Optional parameters
Parameter | Description | Default value |
h (-help) | The help information. | No default value |
topk | The numbers of similarity results to be retrieved. You can specify a comma-delimited list of values such as | 200 |
pk_type | The data type of the | string |
vector_separator | The vector separator. You can specify special characters other than tildes (~) as separators. Spaces are supported. If you want to use spaces as separators, set this parameter to | ~ |
binary_to_int | Specifies whether to convert data of the BINARY type into the INT32 type. This parameter is valid only for data of the BINARY type. If you specify this parameter, the | false |
job_mode | The job mode. Valid values:
| train:build:seek |
clean_build_volume | Specifies whether to delete an index after the build and seek processes are complete. After an index is created in the build process, the index is written to an external volume in MaxCompute, and then loaded in the seek process. After the seek process is complete, the index is automatically deleted. Note If you set this parameter to true, the index is also deleted when the task fails. | true |
algo_model | The index creation algorithm. The following algorithms are supported in the Proxima 2.x kernel:
| hnsw |
builder_params | The parameters that you specify for the IndexBuilder module. By default, this parameter is left empty. The index type of the parameters that you specify must correspond to the algorithm specified by | No default value |
searcher_params | The parameters that you specify for the IndexSearcher module. By default, this parameter is left empty. The index type of the parameters that you specify must correspond to the algorithm specified by | No default value |
converter | The name of the IndexConverter module. IndexConverter is a module that is used by Proxima 2.x to convert characteristic vectors. For example, you can perform half-float conversion and INT8 quantization on characteristic vectors. The IndexConverter module can be separately used or used with other modules in the search process. For more information, see IndexConverter. | No default value |
converter_params | The parameters that you specify for the IndexConverter module. You must specify a single-line JSON string for this parameter. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, you can specify | No default value |
distance_method | The formula for calculating the characteristic vector distance. Valid values:
| squared_euclidean |
measure_params | The parameters that you specify for distance_method. You must specify a single-line JSON string for measure_params. Double quotation marks (") do not need to be escaped. Spaces are not allowed. For example, if you set distance_method to | No default value |
column_num | The number of columns for the IndexBuilder module. Default value: 0.
This parameter is valid only if you set both | 0 |
row_num | The number of rows for the IndexSearcher module. Default value: 0.
This parameter is valid only if you set both | 0 |
category_threshold | The threshold for triggering large-category searches in multi-category search scenarios. If the number of documents in a category exceeds the specified threshold, the system performs large-category search for this category. Otherwise, the system performs small-category search for this category. By default, the linear search method is used for small-category search and data in multiple small categories is merged for search. | 1000000 |
category_col_num | The number of columns for which indexes are created for small categories when you query data by category. A small category has less than 1 million documents. For more information, see | 0 |
category_row_num | The number of rows for which indexes are created for small categories when you query data by category. A small category has less than 1 million documents. For more information, see | 0 |
category_thread_num | The concurrency of tasks that are used to perform large-category search when you query data by category. A large category has more than 1 million documents. The task concurrency indicates the size of a thread pool. | 10 |
query_multi_label | Specifies whether a query involves multiple categories. If you set this parameter to | false |
threshold_score | The score threshold for filtering out search results. For the similarity | No default value |
tunnel_endpoint | The MaxCompute Tunnel endpoint. By default, this parameter is left empty. You can specify this parameter to configure a valid MaxCompute Tunnel endpoint. This prevents a failure to establish a download session when you access a table across networks. For more information, see MaxCompute Tunnel Endpoint issue. | No default value |
memory_load | Specifies whether an index is loaded into the memory in the seek process. By default, this parameter is set to true, indicating that the index is loaded into the memory. If memory resources of the cluster are insufficient, you can set this parameter to false. | true |
sharding_mode | Specifies how to perform index sharding. Valid values: | hash |
kmeans_resource_name | The name of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to | kmeans_resource_name |
kmeans_sample_ratio | The sample rate of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to | 0.05 |
kmeans_seek_ratio | The filtering rate of a centroid for k-means clustering. This parameter is valid if sharding_mode is set to | 0.1 |
kmeans_iter_num | The number of iterations for k-means clustering. This parameter is valid if sharding_mode is set to | 30 |
kmeans_cluster_num | The number of centroids for k-means clustering. This parameter is valid if sharding_mode is set to | 1000 |
kmeans_init_center_method | Specifies how to initialize a centroid for k-means clustering. This parameter is valid if sharding_mode is set to | "" |
kmeans_worker_num | The number of workers for k-means clustering. This parameter is valid if sharding_mode is set to | 0 |
mapper_split_size | The amount of data that an internal mapper can process. This parameter is used to expose | 256 |
odps_task_priority | The priority of a Proxima CE task. You can configure priorities for Proxima CE tasks in all MaxCompute jobs, such as SQL jobs, MapReduce jobs, and Graph jobs. Valid values: [0-9]. A small value indicates a high priority. If you set this parameter to -1, the baseline priority of MaxCompute is used as the task priority. | -1 |
oss_access_id | The AccessKey ID of your Alibaba Cloud account or a RAM user of your Alibaba Cloud account. You can obtain the AccessKey ID from the AccessKey Pair page. | No default value |
oss_access_key | The AccessKey secret that corresponds to the AccessKey ID. You can obtain the AccessKey secret from the AccessKey Pair page. | No default value |
oss_endpoint | The endpoint of the MaxCompute project. The parameter value varies based on the region and network connection method that you selected when you create the MaxCompute project. For more information about the endpoints that correspond to different regions and network connection methods, see Endpoints. | No default value |
oss_bucket | The name of the Object Storage Service (OSS) bucket. For more information about how to view the names of OSS buckets, see List buckets. | No default value |