This topic provides answers to some frequently asked questions about issues that occur when a task fails to run.
What do I do if a task fails to run due to duplicate primary keys in the seek phase?
Error message

Solution
Open Logview that runs the MaxCompute task and view the value in the StdOut column. For more information about how to use Logview, see Use LogView to view job information.

If the input record number of this instance that is shown in the preceding figure is much greater than the input record number of other instances, check whether the data contains vectors that have the same primary key but different values. If the data contains vectors that have the same primary key but different values, the data in the database is not deduplicated. From Reducer1 to Reducer2 during the Mapper-Reducer1-Reducer2 process in the seek phase, the data in which vectors have the same primary key but different values is processed by the same reducer. As a result, data skew occurs and a reduce task may be unexpectedly terminated.
What do I do if the result of GetSmallCategoryDocNum is empty in the search preparation phase for small categories?
Error message

Solution
The main cause of this issue is that the value of a specific field such as category or pk in the table is empty. We recommend that you use SQL statements to delete the record that corresponds to the empty field.
What do I do if schema validation fails in the search phase for small categories?
Error message
2020-07-28 16:58:15.221 [main] INFO p.a.p.p.ProximaCEPipelineExecutor - [] - execute SmallCategorySeek worker start .......... [400] com.aliyun.odps.OdpsException: ODPS-0420031: Invalid xml in HTTP request body - The request body is malformed or the server version doesn't match this sdk/client. XML Schema validation failed: Element 'Value': [facet 'maxLength'] The value has a length of '4952955'; this exceeds the allowed maximum length of '2097152'. Element 'Value': '{"MaxCompute.pipeline.0.output.key.schema":"instId:BIGINT,type:BIGINT,pk:STRING,category:BIGINT","MaxCompute.pipeline.0.output.value.schema":"vSolution
The doc table contains a large number of categories. We recommend that you split the doc table into multiple tables based on the categories.
What do I do if the error message "-nan" appears during the parsing of table data?
Error message

Solution
In most cases, this issue occurs because the format of an input value in the original doc or query table is invalid. The table may contain a large value or a value that is close to 0. For example, if the value of a vector in a row is
1.23~4.56~7.89~nan~4.21or1.1~2.2~127197893781729178311928739179222121.23128767846816278193456789087654~ 0.000000000000000000000000000000000000000001~5.5, an overflow or a division by zero error occurs during calculation. To resolve this issue, use SQL user-defined functions (UDFs) of MaxCompute to find invalid values in the doc and query tables.
What do I do if JNI call exception occurs when the number of documents in a category in the doc table is 0 but in the query table is not 0 in a multi-category search scenario?
Error message


Solution
The doc table is an input table. An error due to an input table is considered a user input issue. When the system detects a user input issue, it reports the error and terminates the task, instead of ignoring the issue. This mechanism helps prevent the scenario in which no recall result of a desired category is returned. If the error is ignored, a hidden bug exists. The recall operation of a category whose number of documents is 0 is meaningless. To resolve this issue, remove the category from the doc or query table.
What do I do if access to a Tunnel endpoint of MaxCompute fails?
Error message
Failed to create DownloadSession ErrorCode=Local Error, ErrorMessage=Failed to create download session with tunnel endpointSolutions
In most cases, this issue occurs because Proxima CE fails to call an interface of MaxCompute Tunnel due to one of the following reasons:
Network issue: In this case, recreate a download session.
Cross-network access issue: The number of MaxCompute tables fails to be obtained because the configured Tunnel endpoint is invalid. For more information, see When I use the Tunnel Upload command to upload data, I configure the Tunnel endpoint of the classic network for my MaxCompute project but the project is connected to the public Tunnel endpoint. Why?. To resolve this issue, add the -tunnel_endpoint startup parameter to the code to specify a valid Tunnel endpoint and rerun the code.
What do I do if the error message "java.lang.NullPointerException" appears when a MapReduce task is running?
Error message
Error example 1: The error occurs due to
setKey().
Error example 2: The error occurs due to
VectorConvert.convert().
Error example 3: The error occurs due to
getInputVolumeFileSystem().
Solutions
Error example 1 and error example 2: In most cases, these errors occur because a MapReduce task fails to read data from a column in the input table in the build or seek phase. Possible causes:
The column does not exist.
The value of a row in the column is null.
The value of a row in the column is invalid. As a result, a parsing error occurs.
If the task fails in the build phase, check whether the preceding issues exist in the doc table. If the task fails in the seek phase, check whether the preceding issues exist in the query table. You can determine whether the task fails in the build or seek phase based on the error message.
The MapReduce tasks that are generated in the Mapper-Reducer process run in the build phase. The output that is shown in Error example 2 is recorded in run logs.
MRR tasks that are generated in the Mapper-Reducer1-Reducer2 process run in the seek phase. The output that is shown in Error example 1 is recorded in run logs.
Error example 3: In most cases, this error occurs when a mapper task fails to obtain information about a volume and its volume partition of MaxCompute in the seek phase. Possible causes:
Multiple tasks that use the same doc table run at the same time. The output that is shown in Error example 3 is recorded in run logs. For Proxima CE of the current version, the volume partition that stores indexes is based on the name of the doc table and partition names. If multiple tasks that use the same doc table run at the same time, index files in the same volume of the tasks may be overwritten or deleted. In this case, the volume of MaxCompute fails to be read and indexes may fail to be loaded. You can change the names of the doc tables that are used by the tasks to ensure that the names of the doc tables of the tasks are different. This helps ensure that each task runs as expected.
The row and column parameters are incorrectly configured. MaxCompute supports a maximum of 99,999 reducer instances. If the -column_num and -row_num parameters are set to excessively large values, the number of instances of MRR tasks in the seek phase exceeds the upper limit. The number of
instances is the product of column_num multiplied by row_num. In this case, an uncontrollable error may occur and specific reducer instances in the seek phase cannot find the correct volume partition that stores indexes. To prevent this issue, specify valid values for the row and column parameters. For more information, see Cluster sharding.
What do I do if the partition key column pt cannot be found by calling the getPartitionColumn operation?
Error message

Solution
In most cases, this error occurs because the
ptcolumn is not found. The schemas of the doc and query tables are forcefully formulated. Thepartition key columnmust be theptcolumn of the STRING data type. For more information, see Import input table data. Make sure that theptcolumn is specified as a partition key column. If the pt column is not specified as a partition key column, this issue occurs.
What do I do if the error message "ShuffleServiceMode: Dump checkpoint failed" appears?
Error message
0010000:System internal error - fuxi job failed, caused by: ShuffleServiceMode: Dump checkpoint failedSolution
In most cases, this issue occurs because the output size of an instance exceeds the upper limit. The output size of a mapper instance or a reducer instance in the MapReduce process exceeds 400 GB. This issue occurs because a single instance processes an excessively large amount of data, or the internal logic of the mapper or reducer task causes excessive data bloat. In this case, configure the -mapper_split_size parameter to reduce the size of data to be split by a single mapper. Unit: MB.
What do I do if the error message "FAILED: MaxCompute-0430071: Rejected by policy - rejected by system throttling rule" appears?
Error message
What do I do if the error message "FAILED: MaxCompute-0430071: Rejected by policy - rejected by system throttling rule" appears?Solution
This issue occurs due to the throttling of MaxCompute. A task execution request is rejected because the cluster to which the project belongs is running other tasks, such as stress testing tasks, or users are not allowed to run tasks for a specific period of time due to security reasons. To resolve this issue, rerun the task after a specific period of time. You can contact the project owner or the person in charge of the project cluster to determine the time to rerun the task.
What do I do if the error message "java.lang.ArrayIndexOutOfBoundException" appears due to a table reading failure?
Error message

Solutions
The mapper task
GeneratePkMapperfails to obtain a valid array length. In most cases, an SQL task that runs before the mapper task reads table data to obtain the array length. This issue occurs due to a table reading failure.Timeout: The network of the cluster to which the project belongs is unstable. In this case, retry the task several times or rerun the task after a specific period of time.

Tunnel exception: When Proxima CE reads a table, Proxima CE obtains the table records by using Tunnel commands. Therefore, a Tunnel exception also causes a table reading failure. For more information about how to handle Tunnel errors, see What do I do if access to a Tunnel endpoint of MaxCompute fails?.
What do I do if a timeout error occurs during cluster sharding?
Error message
FAILED: ODPS-0010000:System internal error - Timeout when graph master wait all workers start java.io.IOException: Job failed! at com.aliyun.odps.graph.GraphJob.run(GraphJob.java:429) at com.alibaba.proxima2.ce.pipeline.odps.worker.KmeansGraphJobWorker.apply(KmeansGraphJobWorker.java:131) at com.alibaba.proxima2.ce.pipeline.odps.runner.OdpsPipelineRunner.run(OdpsPipelineRunner.java:31) at com.alibaba.proxima2.ce.pipeline.PipelineExecutor.execute(PipelineExecutor.java:27) at com.alibaba.proxima2.ce.ProximaCERunner.runWithNormal(ProximaCERunner.java:28) at com.alibaba.proxima2.ce.ProximaCERunner.main(ProximaCERunner.java:149)Solution
When Proxima CE builds an index during cluster sharding, Proxima CE clusters data based on the Graph engine. By default, GraphJob runs to apply for a specific number of worker nodes. If the cluster resources are insufficient, this issue occurs. You can use the following command to limit the number of worker nodes that can execute GraphJob. This helps limit resource requests. Add the following parameter configuration before the startup command of Proxima CE.
set odps.graph.worker.num=400; -- In this example, 400 is a sample value. You can specify the value based on the cluster resource configuration.Complete commands:
set odps.graph.worker.num=400; jar -resources kmeans_center_resource_cl,proxima-ce-aliyun-1.0.0.jar -classpath http://schedule@{env}inside.cheetah.alibaba-inc.com/scheduler/res?id=251678818 com.alibaba.proxima2.ce com.alibaba.proxima2.ce.ProximaCERunner -doc_table doc_table_pailitao -doc_table_partition 20210707 -query_table query_table_pailitao -query_table_partition 20210707 -output_table output_table_pailitao_cluster_2000w -output_table_partition 20210707 -data_type float -dimension 512 -oss_access_id xxx -oss_access_key xxx -oss_endpoint xxx -oss_bucket xxx -owner_id 123456 -vector_separator blank -pk_type int64 -row_num 10 -column_num 10 -job_mode build:seek:recall -topk 1,50,100,200 -sharding_mode cluster -kmeans_resource_name kmeans_center_resource_cl -kmeans_cluster_num 1000;