FAQ about MaxCompute MapReduce

This page answers frequently asked questions about MaxCompute MapReduce, organized into three categories.

Category	Questions
Features	Input sources, write mode, API behavior, sorting, backups
Program development	Resources, logging, local run, node configuration
Common errors	OOM, BufferOverflowException, Class Not Found, and more

Before you troubleshoot

When a job fails, check the error message and Logview first:

stdout in Logview — output from System.out.println in your mapper or reducer code
stderr in Logview — output from standard logging frameworks
MaxCompute client output — for many job failures, the client prints error details directly, so logs are not needed for diagnosis

Features

Can I use views as input sources?

No. MapReduce jobs only accept tables as input sources, not views.

In which mode are results written to a table or partition?

Results are written in overwrite mode.

Can I run a MapReduce job by calling shell files?

No. Java sandbox restrictions prevent MapReduce jobs from calling shell files. For details on what the sandbox restricts, see Java sandbox.

Can the reducer's setup() method read from input tables?

No, but it can read from cached tables. The setup() method of a reducer cannot access input tables directly.

Does a mapper support multiple partitions as input?

Yes. A mapper can take data from multiple partitions of the same table. Each partition is treated as a separate table.

Can a mapper read partition fields from data records?

No. Partition fields are not included in data records. Use context.getInputTableInfo().getPartitionSpec() instead:

PartitionSpec ps = context.getInputTableInfo().getPartitionSpec();
String area = ps.get("area");

What is the relationship between labels and partitions?

Labels identify the partitions that output data is written to.

Does MaxCompute MapReduce support map-only jobs?

Yes. Set the number of reducers to 0:

job.setNumReduceTasks(0);

Can a mapper read records by column name?

Yes. Records can be accessed either by index (record.get(i)) or by column name (record.get("size")).

What are the differences between `write(Record key, Record value)` and `write(Record record)`?

write(Record key, Record value) generates intermediate results that a mapper passes to a reducer over the network. Because no output table schema is available at this stage, you must declare the field data types explicitly:

job.setMapOutputKeySchema(SchemaUtils.fromString("id:string"));
job.setMapOutputValueSchema(SchemaUtils.fromString("size:bigint"));

write(Record record) writes final results directly to an output table. The table schema is available, so data types are inferred automatically — no explicit declaration needed.

Why do I need both `-libjars` and `-classpath` when submitting a MapReduce job?

Two executors are involved: a local executor (your MaxCompute client) and a remote executor (the MaxCompute cluster).

-classpath specifies the JAR for the local executor — for example, -classpath lib/mapreduce-examples.jar
-libjars specifies the JAR for the remote executor — for example, -libjars mapreduce-examples.jar

Both are needed because the client performs local operations (such as job configuration) while the cluster performs the actual computation.

Can I use Hadoop MapReduce source code directly in MaxCompute?

No. Although the overall programming style is similar, the MaxCompute MapReduce API is different from the Hadoop MapReduce API. To run Hadoop MapReduce code in MaxCompute, rewrite it against the MaxCompute MapReduce SDK.

How do I sort data with MaxCompute MapReduce?

Specify the sort columns and order in your job configuration:

// Sort by i1 (ascending) and i2 (descending)
job.setOutputKeySortColumns(new String[] { "i1", "i2" });
job.setOutputKeySortOrder(new SortOrder[] { SortOrder.ASC, SortOrder.DESC });

setOutputKeySortOrder accepts an array of SortOrder values. Valid values are ASC (ascending) and DESC (descending), one per sort column.

What are backups for MapReduce jobs?

MaxCompute monitors job progress. When a job needs to process a large amount of data, MaxCompute automatically launches a backup job that processes the same data in parallel. The result of whichever job finishes first is used, and the other is discarded. If the data volume is extremely large, both the original and backup jobs may exceed the time limit — in that case, backups do not improve performance.

Program development

How do I pass multiple resources when submitting a MapReduce job?

Separate resource names with commas:

jar -resource resource1,resource2,resource3

How do I check whether a table is empty in the main method?

Odps odps = SessionState.get().getOdps();
Table table = odps.tables().get("tableName");
RecordReader recordReader = table.read(1);
if (recordReader.read() == null) {
    // Table is empty
}

How do I generate logs for a MapReduce job?

Use System.out.println in your code — the output goes to stdout in Logview. If you use a standard logging framework, logs go to stderr in Logview.

If a job fails, the MaxCompute client also prints the error details directly, so logs are often not needed for error diagnosis.

Does a result table accumulate duplicate data across two MapReduce jobs?

Yes. If two separate MapReduce jobs both write to the same result table, the table will contain duplicate records — one from each job. Queries on that table will return both copies.

Do I need to configure nodes for distributed processing in MaxCompute?

No. MaxCompute handles shard allocation automatically. Unlike Hadoop MapReduce, there is no node configuration — the MaxCompute runtime determines which shards each job uses based on its internal algorithm.

After adding a combiner, reducers receive no input. Why?

This issue occurs because each record generated by a reducer cannot be mapped to the key-value pairs generated by a mapper. Verify that the combiner's output schema matches the mapper's output schema.

Why can't I specify the output table schema when running a map-only job?

The schema is fixed at table creation time — defined in the CREATE TABLE statement. When a map-only job writes to the output table, it uses that existing schema directly. There is no mechanism to specify or override the schema at job runtime.

How do I run a MapReduce job using the local MaxCompute server?

Use the jar command on the MaxCompute client. For command syntax, see Submit a MapReduce job.

To run a job programmatically against the server, configure JobConf as follows:

// Configure the MaxCompute connection
Account account = new AliyunAccount(accessid, accesskey);
Odps odps = new Odps(account);
odps.setEndpoint(endpoint);
odps.setDefaultProject(project);

// Initialize the session
SessionState ss = SessionState.get();
ss.setOdps(odps);
ss.setLocalRun(false);  // Set to true for local debugging

// Configure the job
Job job = new Job();
job.setResources("mr.jar");  // Equivalent to jar -resources mr.jar
job.setMapperClass(XXXMapper.class);
job.setReducerClass(XXXReducer.class);

Before running the job:

Add the dependency JARs from the lib folder of your MaxCompute client to your classpath. Import all JARs from the latest client version.
Package your MapReduce program and upload it as a resource. For upload instructions, see Resource operations.

Common errors

Memory and OOM errors

BufferOverflowException

A single field value exceeds the allowed size limit. Check that your data conforms to these limits per field:

Data type	Limit
String	8 MB
Bigint	-9,223,372,036,854,775,807 to 9,223,372,036,854,775,807
Double	-1.0 x 10^308 to 1.0 x 10^308
Boolean	`True` / `False`
Date	0001-01-01 00:00:00 to 9999-12-31 23:59:59

The full error looks like:

FAILED: ODPS-0123131:User defined function exception - Traceback:
     java.nio.BufferOverflowException
     at java.nio.DirectByteBuffer.put(Unknown Source)
     at com.aliyun.odps.udf.impl.batch.TextBinary.put(TextBinary.java:35)

OOM error before the Reduce stage

Too much data is being held in memory during the Map stage. Either avoid using a combiner, or set odps.mapred.map.min.split.size to 512 for the combiner you're using.

OOM error during a job run

Increase the Java Virtual Machine (JVM) heap memory for mappers or reducers. For example, to set mapper memory to 2 GB:

odps.stage.mapper.jvm.mem=2048

`java.lang.OutOfMemoryError: Java heap space` with many reducers

If you're loading a configuration file across 600 or more reducers and seeing Java heap space errors, you've hit a MaxCompute MapReduce memory limit. Adjust the memory-related parameters — see Overview for available options. For the applicable limits, see Limits.

Resource and class loading errors

"Resource not found"

The -resources parameter was not set when submitting the job. Add the required resources:

jar -resources resource1,resource2 ...

Separate multiple resources with commas.

"Class Not Found"

Check two things:

`-classpath` value — verify it contains the full package path to your JAR, not just a directory.
JAR contents — make sure the JAR includes compiled .class files, not just source files from the src folder.

Too many open files — resource reference limit exceeded

A single MapReduce job can reference a maximum of 256 resources. Each table and each archive file counts as one resource. The full error looks like:

Caused by: com.aliyun.odps.OdpsException: java.io.FileNotFoundException:
  temp/mr_XXXXXX/resource/meta.user.group.config (Too many open files)

Reduce the number of resources referenced by the job. For the full list of limits, see Limits.

"Exceed maximum read times per resource" — ODPS-0730001

Resources are being read too many times. Read each resource once in the setup() method and cache the result — don't re-read inside the map() or reduce() methods.

ODPS-0730001: Exceed maximum read times per resource

Third-party classes not found when using Maven Assembly

The MapReduce distributed runtime is subject to Java sandbox restrictions, but the main program is not. If you're processing JSON data, use Gson — it does not need to be bundled in the JAR. For date string conversion, use SimpleDateFormat from the Java standard library.

Permissions and sandbox errors

`java.security.AccessControlException`

Your code is attempting an operation that the Java sandbox doesn't allow. MaxCompute does not permit access to external resources from within the sandbox. The full error looks like:

FAILED: ODPS-0123131:User defined function exception - Traceback:
java.lang.ExceptionInInitializerError
  ...
Caused by: java.security.AccessControlException: access denied
  ("java.lang.RuntimePermission" "getProtectionDomain")

To work around this, store the external data or configuration logic inside MaxCompute and read it as a resource. See Resource usage example and Java sandbox.

ODPS-0420095: access denied — task not in release range

MapReduce jobs are not supported on the MaxCompute developer edition. Only PyODPS jobs and MaxCompute SQL jobs with user-defined functions (UDFs) are supported on that edition. Other jobs, such as MapReduce jobs and Spark jobs, are not supported. The full error:

Exception in thread "main" java.io.IOException:
  com.aliyun.odps.OdpsException: ODPS-0420095: Access Denied -
  The task is not in release range: LOT

To run MapReduce jobs, upgrade your project edition. See Switch billing methods.

Input, output, and timeout errors

ODPS-0010000: system internal error — `get input pangu dir meta fail`

The partition referenced by the job doesn't exist or has no data. Create the partition and insert data before submitting the job.

"Table not found"

The project name or table name in the job configuration is wrong, or the output table doesn't exist:

Exception in thread "main" com.aliyun.odps.OdpsException:
  Table not found: project_name.table_name.

Use TableInfo.Builder and set both ProjectName and TableName to match the actual project and table.

ODPS-0123144: Fuxi job failed — WorkerRestart

The secondary cluster node timed out after 10 minutes and cannot be changed, causing the primary node to treat it as failed. This is typically caused by a large loop in the Reduce stage — for example, when processing heavily skewed data or generating Cartesian products. Rewrite the Reduce logic to avoid large loops.

`java.io.IOException` — too many local-run maps

The default limit for local-run maps is 100. If you're hitting this limit:

ODPS-0740001: Too many local-run maps: 101, must be <= 100
  (specified by local-run parameter 'odps.mapred.local.map.max.tasks')

Add -Dodps.mapred.local.map.max.tasks=200 to raise the limit.

Subscript out of bounds when running Hadoop MapReduce on MaxCompute

Rewrite the job using the MaxCompute MapReduce API. If the workload requires it, consider using Spark on MaxCompute instead.