This topic provides answers to some frequently asked questions about Proxima CE.
What resources are used by Proxima CE?
The resources in the MaxCompute project to which your account belongs are used by Proxima CE.
Can vectors in an input table be of the BINARY type supported in MaxCompute?
No, vectors in an input table cannot be of the BINARY type. Proxima CE creates indexes by converting vector columns in the doc table into indexes. By default, vector columns in the doc table support only the STRING type. The BINARY type is not supported. However, Proxima CE provides the -binary_to_int command-line parameter. This parameter specifies whether to convert data of the BINARY type into the INT type. For example, if commas (,) are used as delimiters of input data, this parameter takes effect in the following ways:
If -binary_to_int is set to false, input data can be
1,1,1,1,1,1,…..If -binary_to_int is set to true, input data can be
12345,13423,13325,…..
The data type conversion feature allows you to convert N binary values made up of 0 and 1 into N 32-bit integers. This way, the sizes of created indexes are reduced.
How do I specify the -column_num and -row_num parameters?
Proxima CE is a distributed engine and works with MaxCompute MapReduce to process large amounts of vector data in offline mode. In the build process, the system divides a doc table into columns and creates indexes for each column. In the seek process, the system divides a query table into rows for search. The system implements the two processes to search for vector data from large amounts of data.
For users:
Columns affect the build process. If a large number of columns are created, the index size of each column is small, and the search speed of a single column is fast. This accelerates the build process but increases the amount of cluster machine resources that are used.
Rows affect the seek process. If a large number of rows are created, the number of queries for each row is small, and the search speed of a single row is fast. This accelerates the seek process but increases the amount of cluster machine resources that are used.
Take note of the following limits on row and column division:
Default limits on the usage of cluster resources. You must contact the owner of the MaxCompute project to which your account belongs to learn about the default limits on the usage of cluster resources.
Limits on MapReduce instances. In MaxCompute MapReduce, a maximum number of 99,999 instances can be used to run reduce tasks. In the build process, the number of instances is specified by the
column_numparameter. In the seek process, the number of instances is determined by using the following formula:column_num × row_num. You must make sure that the value calculated by using the preceding formula isless than 99999.
In consideration of the above-mentioned architecture and related limits, we recommend that you use the configurations of rows and columns automatically calculated by Proxima CE based on the input parameters. For more information, see Multi-category search.
The number of rows and columns calculated by Proxima CE can ensure normal operations. If query acceleration is required, you can add rows and columns. If cluster resources are insufficient, you can decrease the number of rows and columns. You can determine whether to add or remove rows and columns based on actual situations of your MaxCompute project and the principle described in How do I accelerate the execution of a task?.
How do I accelerate the execution of a task?
In multi-category scenarios, two types of categories are involved in a task: small categories and large categories. Each small category has less than 1 million (default threshold, which is configurable) documents and each large category has more than 1 million documents. For all the small categories, the system uses the linear search method. To accelerate queries from small categories, you can configure the following parameters: -category_row_num and -category_col_num. To accelerate queries from large categories, you can configure the following parameters: -row_num and -column_num. For the acceleration purpose, the general principle is to add columns and rows. If you add columns, the sizes of indexes in each column decrease, and the search speed of a single column accelerates. If you add rows, the number of queries from each row decreases, and the search speed of each batch of data accelerates. However, more resources are consumed in these cases. You need to determine whether to perform the addition operation based on the analysis of business and resources. For more information, see Multi-category search.
In non-multi-category scenarios, you can configure the -row_num and -column_num parameters to improve the overall task concurrency.
Why is the number of query results less than the specified number?
If you have confirmed that the input data is valid and the system runs as expected, the issue may lie in the data of the doc table. By default, Proxima CE uses the Hierarchical Navigable Small World (HNSW) graph algorithm to create indexes. In this case, objects in the graph may not be linked. As a result, the number of retrieved results is less than the specified number. Solutions:
Decrease the recall rate. You cannot completely resolve the issue by using this method. If the graph linking issue is caused by the underlying algorithm, the number of retrieved results cannot be the same as the specified number (200 in this example) regardless of the recall rate. If you decrease the recall rate in special cases, the retrieval of other vectors may also be affected. You must evaluate the impact before you use this method.
Change the index creation algorithm. For example, if you use Hierarchical Clustering (HC) to create a graph, you can configure the -algo_model command-line parameter to specify the index creation algorithm.
Complement top K results. For versions later than Proxima 2.4, if you use the HNSW graph algorithm, you can add the configuration
{"proxima.hnsw.searcher.force_padding_result_enable" : True}to complement the specified top K results based on the search results. This method has a negative impact on the similarity search in extreme cases. You must evaluate the impact based on actual business before you use this method.
Can I configure the cosine distance for Proxima CE?
Proxima CE supports the cosine distance and optimizes the inner-product search. For more information, see Inner product and cosine distance.
Why is the execution of the submitted Proxima CE task slow?
Proxima CE tasks run in MaxCompute MapReduce jobs. If a task is compiled and run normally, the issue may lie in MaxCompute scheduling and resources. You can join the DingTalk group of the MaxCompute developer community by searching for the DingTalk group (ID: 11782920) to contact the MaxCompute technical support engineers for assistance.
Why does a temporary table fail to be created?
If a temporary table fails to be created, the error message invalid table name: xxx.yyy appears in most cases. This is because the name of the output table is invalid.
The names of the input and output tables of Proxima CE tasks must comply with the naming convention of MaxCompute. The names cannot contain periods(.), which are considered special characters in MaxCompute. If the names contain periods (.), subsequent processes fail. In most cases, this issue occurs because the output table is named in the xxx.output_table_name format.
Can multiple tasks be executed at the same time?
In Proxima CE, you cannot execute tasks that involve the same doc table at the same time. If you execute such tasks at the same time, the index overwrite issue may occur. For example, if an index is created for Task A, the index of Task B is overwritten by the created index. As a result, various issues occur. Common issues:
Errors related to the underlying OSS volume file system are reported.
The build process fails and the errors related to
Java Native Interface (JNI)-based index writesare reported.The seek process fails and the errors related to
JNI-based index loadingare reported.
Why do my offline tasks affect online tasks?
The common cause is that offline and online tasks run in the same cluster. In common cases, different types of MaxCompute jobs are executed in clusters in a hybrid manner. The underlying layer of clusters is shared by offline and online tasks. Therefore, offline tasks may occupy large amounts of cluster resources. As a result, online tasks cannot obtain sufficient resources, and these tasks slowly run and even fail. Solutions:
Restrict the resource usage of offline tasks. You can limit the concurrency of offline tasks by specifying the number of rows and columns of Proxima offline tasks. You can also impose limits on your MaxCompute project in the MaxCompute console. These operations slow down the execution of offline tasks.
Try to run online and offline tasks during different periods of time. You can also try to apply for new resources.
What are run logs and Logview? What are the differences between them?
Run logs
Technical support engineers usually use run logs that contain the information generated after a DataWorks node is run. You can copy the run log information and send it to technical support engineers for troubleshooting. If you use the MaxCompute client odpscmd to run scripts, run logs are also generated on the MaxCompute client. You can copy all logs on the client and send them to technical support engineers. You can also redirect the logs to a log file and send the file to technical support engineers.

Logview
Logview is a tool that you can use to view and debug MaxCompute jobs after you submit the jobs. If you use Proxima CE, you can use Logview to visualize log status information when related jobs such as SQL jobs and MapReduce jobs are run. This way, you can view the status of jobs such as SQL jobs, MapReduce jobs, and Graph jobs that are run on MaxCompute and identify the issues that occur during debugging. For more information, see Use LogView to view job information.
What do I do if "ERROR: KILLED" is logged for a task?
The following or similar information is logged.
A task may be killed due to the following reasons:
The task is run for a long period of time. If an SQL task is run in MaxCompute for more than 24 hours, it is automatically killed. You can set odps.sql.job.max.time.hours to 72. This way, you can run the task for up to 72 hours.
set odps.sql.job.max.time.hours=72;The cluster is overloaded and resources are preempted by the task for a long time. As a result, the task is killed.
The task is manually killed. You can check whether the task is killed by the project owner or other users assigned with the project administrator role.
In most cases, if a task is killed, you can rerun it. However, if the cluster is overloaded, we recommend that you use one of the following methods:
Prioritize the task by configuring the -odps_task_priority parameter. For more information, see Optional parameters.
ImportantThis method is risky. If you use this method, the task may preempt the resources that are allocated to other high-priority jobs. You need to contact the project owner or the relevant personnel and ask them to confirm that no high-priority online or offline tasks run in the cluster of the project.
Wait until the resource usage of high-priority tasks decreases. Then, rerun the Proxima CE task.
Why does the task priority specified by -odps_task_priority not take effect?
If you configure a baseline priority for your project, the priority specified by the -odps_task_priority parameter becomes ineffective when it is higher than the baseline priority. For more information about baseline management, see Manage baselines.