Basic stages Introduction

Local run prerequisite: By setting –local parameter in jar command, user can simulate MapReduce  running process on the local to continue local debugging.

At local operation time: The client downloads required Meta information of input tables, resources, and Meta information of output tables from MaxCompute,  and saves them into a local directory named ‘warehouse’.

After running the program: The calculation result is output into a file in ‘warehouse’.  If the input table and referenced resources have been downloaded in the local warehouse directory,  the data and files in ‘warehouse’ directory are referenced directly at next running time, and do not repeat the downloading.

Differences between running locally and running distributed environments

In the local operation course, multiple Map and Reduce workers are still started to process data. But these workers are not running concurrently and followed by serial running. 

In addition, this simulation process and real distributed operation have the following differences:

  • A restriction for row number of input table exists: now, up to 100 rows of data can be downloaded.
  • Usage of resource: in distributed environment, MaxCompute limits the size of referenced resource. For more information, see Application  Restriction.  Note that in local running environment, the resource size is no limitation.
  • Security restriction: MaxCompute MapReduce and UDF program running in a distributed environment are limited by Java Sandbox.  Note that in local operations the restriction does not exists.


A local operation example is as follows:
    odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out
    counters: 10
        map-reduce framework

For a detailed WordCount example, see WordCount Code Example.

If a user runs local debugging command for the first time, a path named ‘warehouse’ appears in the current path after the command is executed successfully.  The  directory structure of warehouse is as follows:
   |____my_project(project directory)
          |____ <__tables__>
          | |__wc_in(table directory)
          | | |____ data(file)
          |       |      |
          | | |____ <__schema__> (file)
          | |__wc_out(table data directory)
          | |____ data(file)
          |               |
          | |____ <__schema__> (file)
          |____ <__resources__>
                  |___table_resource_name (table resource)
                  | |____<__ref__>
                  |___ file_resource_name (file resource)
  • The same level directory of myproject indicates the project.  ‘wcin’ and ‘wc_out’ indicate tables. The table files read by user in JAR  command is downloaded into this directory. 
  • The contents in <__schema__> indicate table Meta information. The format is defined as follows: 

    Columns and column types are separated by a colon ‘:’, and columns and columns are separated by a comma ‘,’. In the front of <__schema__> file,  the Project name and Table  name must be declared, such as project_name.table_name, and separated by comma and column definition. project_name.table_name,col1_name:col1_type,col2_name:col2_type,……

  • The file ‘data; indicates table data.  The column quantity and corresponding data must comply with the definition in schema_,  that is, extra columns and missing columns are not allowed.
    The content of  _schema_  in wc_in is as follows:
    The content of  ‘data’ is as follows:
    The client downloads  the Meta information of table and part of the data from MaxCompute, and save them into the two preceding files. If you run this example again, the data in the directory ‘wc_in’  is used directly and will not be downloaded again. 
    Note that the function to download data from MaxCompute is only supported in MapReduce local operation mode.  If the local debugging is executed in Eclipse development plug-in,  the data of MaxCompute cannot be downloaded to local.
    The content of _schema_   in wc_out is as follows:
    The  content of ‘data’ is as follows:
    The client downloads the  Meta information of wc_out from MaxCompute and saves it to the file _schema_ .  The file ‘data’ is a result data file generated  after local operation.
    • Users can also edit _schema_ file and ‘data’  and then place these two files into the corresponding table directory.
    • When running on the local, the client detects that the table directory already exists, and does not download the information of this table from MaxCompute.  The table directory on the local can be  a table that does not exist in MaxCompute.