All Products
Search
Document Center

MaxCompute:Run failures

Last Updated:Mar 26, 2026

This topic covers common task failures in Proxima CE on MaxCompute, with causes and solutions for each error.

Duplicate primary keys in the seek phase

Error message报错

Most common cause: The doc table contains vectors that share the same primary key but have different values. During the Mapper-Reducer1-Reducer2 (MRR) process in the seek phase, all records with the same primary key are routed to the same reducer. If the data is not deduplicated, one reducer processes far more records than others, causing data skew that terminates the reduce task.

To diagnose the issue:

  1. Open LogView for the failing MaxCompute task. For instructions, see Use LogView to view job information.

  2. Check the StdOut column. logstdout

  3. If the input record count for one instance is much higher than other instances, the data in your doc table has not been deduplicated.

Deduplicate the data to ensure each primary key maps to a single vector value.

GetSmallCategoryDocNum returns empty in the search preparation phase

Error message报错

A field — typically category or pk — contains an empty value in the table. Use SQL statements to delete the records where the field is empty.

Schema validation fails in the search phase for small categories

Error message

2020-07-28 16:58:15.221 [main] INFO  p.a.p.p.ProximaCEPipelineExecutor - [] - execute SmallCategorySeek worker start ..........
[400] com.aliyun.odps.OdpsException: ODPS-0420031: Invalid xml in HTTP request body - The request body is malformed or the server version doesn't match this sdk/client. XML Schema validation failed: Element 'Value': [facet 'maxLength'] The value has a length of '4952955'; this exceeds the allowed maximum length of '2097152'.
Element 'Value': '{"MaxCompute.pipeline.0.output.key.schema":"instId:BIGINT,type:BIGINT,pk:STRING,category:BIGINT","MaxCompute.pipeline.0.output.value.schema":"v

The schema payload exceeds the 2,097,152-character limit because the doc table contains too many categories. Split the doc table into multiple tables, one per category group, and run separate tasks for each.

"-nan" appears during table data parsing

Error message报错

The doc or query table contains an invalid vector value — either a number too large to represent (overflow) or a value so close to zero that division fails. Examples of invalid values:

  • 1.23~4.56~7.89~nan~4.21

  • 1.1~2.2~127197893781729178311928739179222121.23128767846816278193456789087654~0.000000000000000000000000000000000000000001~5.5

Use MaxCompute SQL user-defined functions (UDFs) to scan both the doc and query tables for invalid values, then remove or correct them before rerunning.

JNI call exception in a multi-category search (doc table has 0 documents for a category)

Error message报错报错

The query table references a category that has zero documents in the doc table. Proxima CE treats this as a user input error and terminates the task rather than silently skipping the category — continuing would return no recall results for that category, creating a hidden bug.

Remove the empty category from the doc table or query table, then rerun.

Access to a MaxCompute Tunnel endpoint fails

Error message

Failed to create DownloadSession ErrorCode=Local Error, ErrorMessage=Failed to create download session with tunnel endpoint

Proxima CE calls the MaxCompute Tunnel interface to read table data. This error has two common causes:

"java.lang.NullPointerException" in a MapReduce task

There are three distinct error patterns, each with a different cause.

Error example 1 (due to setKey()) 报错

Error example 2 (due to VectorConvert.convert()) 报错

Error example 3 (due to getInputVolumeFileSystem()) 报错

Error examples 1 and 2

Both errors occur when a MapReduce task cannot read a column in the input table. Possible causes:

  • The column does not exist in the table.

  • A row in the column has a null value.

  • A row in the column has an invalid value that causes a parsing error.

The error log tells you which phase failed:

  • Build phase (Mapper-Reducer process): check the doc table for the issues above.

  • Seek phase (Mapper-Reducer1-Reducer2 process): check the query table.

Error example 2 is recorded during the Mapper-Reducer (build) phase; error example 1 is recorded during the Mapper-Reducer1-Reducer2 (seek) phase.

Error example 3

This error occurs when a mapper task cannot read volume or volume partition information in the seek phase. Possible causes:

  • Concurrent tasks sharing the same doc table: Volume partitions are named based on the doc table name and partition names. If multiple tasks use the same doc table simultaneously, their index files may overwrite or delete each other, causing volume reads to fail and indexes to fail to load. Give each task a unique doc table name to prevent conflicts.

  • -column_num or -row_num set too high: The number of reducer instances in MRR tasks equals column_num x row_num. MaxCompute supports a maximum of 99,999 reducer instances. Exceeding this limit causes unpredictable errors where specific reducer instances cannot locate the correct volume partition. Set valid values for both parameters — see Cluster sharding for recommended values.

Partition key column pt not found by getPartitionColumn

Error message报错

The doc and query table schemas enforce a strict requirement: the partition key column must be named pt with the STRING data type. If pt is defined as a regular column instead of a partition key column, this error occurs.

Make sure pt is specified as a partition key column. For the required schema format, see Import input table data.

"ShuffleServiceMode: Dump checkpoint failed"

Error message

0010000:System internal error - fuxi job failed, caused by: ShuffleServiceMode: Dump checkpoint failed

The output of a single mapper or reducer instance exceeded 400 GB. This happens when a single instance processes too much data, or when the internal logic of a mapper or reducer task causes excessive data bloat.

Set the -mapper_split_size parameter (in MB) to reduce the amount of data each mapper processes.

"MaxCompute-0430071: Rejected by policy - rejected by system throttling rule"

Error message

FAILED: MaxCompute-0430071: Rejected by policy - rejected by system throttling rule

MaxCompute throttled the task execution request. This typically happens when the cluster is running other workloads (such as stress tests) or when task execution is restricted for a period due to security policy.

Wait and rerun the task. Contact the project owner or the cluster administrator to find out when the restriction lifts.

"java.lang.ArrayIndexOutOfBoundException" due to a table reading failure

Error message报错

The GeneratePkMapper task failed to read the table and could not get a valid array length. Possible causes:

  • Unstable cluster network (timeout): The read timed out. 报错 Retry the task a few times, or wait until the network stabilizes before rerunning.

  • Tunnel exception: Proxima CE reads tables via Tunnel commands, so a Tunnel error also causes this failure. See Access to a MaxCompute Tunnel endpoint fails for troubleshooting steps.

Timeout during cluster sharding

Error message

FAILED: ODPS-0010000:System internal error - Timeout when graph master wait all workers start
java.io.IOException: Job failed!
    at com.aliyun.odps.graph.GraphJob.run(GraphJob.java:429)
    at com.alibaba.proxima2.ce.pipeline.odps.worker.KmeansGraphJobWorker.apply(KmeansGraphJobWorker.java:131)
    at com.alibaba.proxima2.ce.pipeline.odps.runner.OdpsPipelineRunner.run(OdpsPipelineRunner.java:31)
    at com.alibaba.proxima2.ce.pipeline.PipelineExecutor.execute(PipelineExecutor.java:27)
    at com.alibaba.proxima2.ce.ProximaCERunner.runWithNormal(ProximaCERunner.java:28)
    at com.alibaba.proxima2.ce.ProximaCERunner.main(ProximaCERunner.java:149)

During cluster sharding, Proxima CE uses the Graph engine to cluster data. GraphJob requests a set number of worker nodes by default. If the cluster does not have enough available resources, the job times out waiting for workers to start.

Limit the number of worker nodes GraphJob can request by adding the following parameter before the Proxima CE startup command:

set odps.graph.worker.num=400; -- 400 is an example. Set this based on your cluster's available resources.

Complete command example:

set odps.graph.worker.num=400;
jar -resources kmeans_center_resource_cl,proxima-ce-aliyun-1.0.0.jar
-classpath http://schedule@{env}inside.cheetah.alibaba-inc.com/scheduler/res?id=251678818 com.alibaba.proxima2.ce com.alibaba.proxima2.ce.ProximaCERunner
-doc_table doc_table_pailitao
-doc_table_partition 20210707
-query_table query_table_pailitao
-query_table_partition 20210707
-output_table output_table_pailitao_cluster_2000w
-output_table_partition 20210707
-data_type float
-dimension 512
-oss_access_id xxx
-oss_access_key xxx
-oss_endpoint xxx
-oss_bucket xxx
-owner_id 123456
-vector_separator blank
-pk_type int64
-row_num 10
-column_num 10
-job_mode build:seek:recall
-topk 1,50,100,200
-sharding_mode cluster
-kmeans_resource_name kmeans_center_resource_cl
-kmeans_cluster_num 1000;