Run a MapReduce job in local mode for debugging - MaxCompute

Local mode lets you debug a MapReduce job on your machine before submitting it to a distributed cluster. Instead of redeploying to the cluster for every code change, run the job locally with the -local flag and inspect the results immediately.

How it works

Add the -local flag to the jar command to run the job locally. The MaxCompute client:

Downloads the metadata and data of the input table, the metadata of the output table, and any required resources from MaxCompute into a local directory named warehouse.
Runs multiple map tasks and reduce tasks in sequence (not in parallel).
Writes the results to a file in the warehouse directory.

On subsequent runs, if the input table and resources are already in warehouse, the client reads them directly — no re-download.

Note

Data can be downloaded from MaxCompute only for MapReduce jobs that run in local mode.

Local mode vs. distributed mode

Dimension	Local mode	Distributed mode
Input rows	Maximum 100 rows	No limit
Resource size	No limit	Limited (see MapReduce limits)
Security	No restrictions	Java sandbox applies to MapReduce and user-defined functions (UDFs)

Run a job in local mode

The following example runs the WordCount program against the wc_in table and writes output to wc_out.

Step 1: Run the jar command

odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out

For the full WordCount sample code, see WordCount.

Step 2: Check the output

A successful run prints a summary with counters:

Summary:
counters: 10
    map-reduce framework
            combine_input_groups=2
            combine_output_records=2
            map_input_bytes=4
            map_input_records=1
            map_output_records=2
            map_output_[wc_out]_bytes=0
            map_output_[wc_out]_records=0
            reduce_input_groups=2
            reduce_output_[wc_out]_bytes=8
            reduce_output_[wc_out]_records=2
OK

Step 3: Verify the results in the warehouse directory

The first run creates a warehouse directory in the current path with the following structure:

<warehouse>
   |____my_project (project directory)
          |____ <__tables__>
          |       |__wc_in (table data directory)
          |       |      |____ data (file)
          |       |      |
          |       |      |____ <__schema__> (file)
          |       |__wc_out (table data directory)
          |               |____ data (file)
          |               |
          |               |____ <__schema__> (file)
          |
          |____ <__resources__>
                  |
                  |___table_resource_name (table resource)
                  |         |____<__ref__>
                  |
                  |___ file_resource_name (file resource)

Directories at the same level as my_project represent projects.
Directories at the same level as wc_in and wc_out represent data tables. Table data read or written by the jar command is stored at this level.

Verify the wc_in input data

The <__schema__> file stores the table metadata. For wc_in:

my_project.wc_in,key:STRING,value:STRING

The data file contains the downloaded rows:

0,2

The client downloaded the metadata and a portion of the table data from MaxCompute and saved them to these files. The next run reads directly from wc_in — no download needed.

Verify the wc_out output data

The schema for wc_out:

my_project.wc_out,key:STRING,cnt:BIGINT

After the job completes, the data file contains the results:

0,1
2,1

Schema file format

The <__schema__> file defines the table structure:

project=local_project_name
table=local_table_name
columns=col1_name:col1_type,col2_name:col2_type
partitions=p1:STRING,p2:BIGINT    -- optional

Formatting rules:

Separate a column name and its type with a colon (:).
Separate columns with a comma (,).
Declare project_name.table_name at the beginning of the file, separated from the column definitions by a comma. Example: project_name.table_name,col1_name:col1_type,col2_name:col2_type

Usage notes

Edit the <__schema__> and data files directly to supply custom test data without downloading from MaxCompute.
If the client detects that a table directory already exists in warehouse, it skips the download for that table. This means a local table directory can reference a table that does not exist in MaxCompute.

What's next

WordCount — full sample code for the WordCount example used in this topic
MapReduce limits — resource size and other limits that apply in distributed mode