Local mode lets you debug a MapReduce job on your machine before submitting it to a distributed cluster. Instead of redeploying to the cluster for every code change, run the job locally with the -local flag and inspect the results immediately.
How it works
Add the -local flag to the jar command to run the job locally. The MaxCompute client:
-
Downloads the metadata and data of the input table, the metadata of the output table, and any required resources from MaxCompute into a local directory named
warehouse. -
Runs multiple map tasks and reduce tasks in sequence (not in parallel).
-
Writes the results to a file in the
warehousedirectory.
On subsequent runs, if the input table and resources are already in warehouse, the client reads them directly — no re-download.
Data can be downloaded from MaxCompute only for MapReduce jobs that run in local mode.
Local mode vs. distributed mode
| Dimension | Local mode | Distributed mode |
|---|---|---|
| Input rows | Maximum 100 rows | No limit |
| Resource size | No limit | Limited (see MapReduce limits) |
| Security | No restrictions | Java sandbox applies to MapReduce and user-defined functions (UDFs) |
Run a job in local mode
The following example runs the WordCount program against the wc_in table and writes output to wc_out.
Step 1: Run the jar command
odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out
For the full WordCount sample code, see WordCount.
Step 2: Check the output
A successful run prints a summary with counters:
Summary:
counters: 10
map-reduce framework
combine_input_groups=2
combine_output_records=2
map_input_bytes=4
map_input_records=1
map_output_records=2
map_output_[wc_out]_bytes=0
map_output_[wc_out]_records=0
reduce_input_groups=2
reduce_output_[wc_out]_bytes=8
reduce_output_[wc_out]_records=2
OK
Step 3: Verify the results in the warehouse directory
The first run creates a warehouse directory in the current path with the following structure:
<warehouse>
|____my_project (project directory)
|____ <__tables__>
| |__wc_in (table data directory)
| | |____ data (file)
| | |
| | |____ <__schema__> (file)
| |__wc_out (table data directory)
| |____ data (file)
| |
| |____ <__schema__> (file)
|
|____ <__resources__>
|
|___table_resource_name (table resource)
| |____<__ref__>
|
|___ file_resource_name (file resource)
-
Directories at the same level as
my_projectrepresent projects. -
Directories at the same level as
wc_inandwc_outrepresent data tables. Table data read or written by thejarcommand is stored at this level.
Verify the wc_in input data
The <__schema__> file stores the table metadata. For wc_in:
my_project.wc_in,key:STRING,value:STRING
The data file contains the downloaded rows:
0,2
The client downloaded the metadata and a portion of the table data from MaxCompute and saved them to these files. The next run reads directly from wc_in — no download needed.
Verify the wc_out output data
The schema for wc_out:
my_project.wc_out,key:STRING,cnt:BIGINT
After the job completes, the data file contains the results:
0,1
2,1
Schema file format
The <__schema__> file defines the table structure:
project=local_project_name
table=local_table_name
columns=col1_name:col1_type,col2_name:col2_type
partitions=p1:STRING,p2:BIGINT -- optional
Formatting rules:
-
Separate a column name and its type with a colon (
:). -
Separate columns with a comma (
,). -
Declare
project_name.table_nameat the beginning of the file, separated from the column definitions by a comma. Example:project_name.table_name,col1_name:col1_type,col2_name:col2_type
Usage notes
-
Edit the
<__schema__>anddatafiles directly to supply custom test data without downloading from MaxCompute. -
If the client detects that a table directory already exists in
warehouse, it skips the download for that table. This means a local table directory can reference a table that does not exist in MaxCompute.
What's next
-
WordCount — full sample code for the WordCount example used in this topic
-
MapReduce limits — resource size and other limits that apply in distributed mode