This topic describes the differences between the local mode and distributed mode in which MapReduce jobs run. It also provides examples of MapReduce jobs in local mode.
Introduction to the local mode
Before you run a job in local mode, you can specify the -local option in the JAR command to simulate the running of the job. This way, you can perform local debugging on the job.
During job running, the client downloads the metadata and data of the input table, metadata of the output table, and resources that are required for local debugging from MaxCompute. The downloaded data is saved to a local directory named
After the job is completed, the computing results are saved to a file in the
warehouse directory. If the input table and required resources are downloaded to the
warehouse directory, MapReduce directly references the data and files in the directory next time, instead of downloading the data again.
Differences between the local mode and distributed mode
A MapReduce job that runs in local mode starts multiple map and reduce tasks to process data. These tasks run in sequence.
- Rows in the input table: A maximum of 100 rows of data can be downloaded in local mode.
- Resource usage: In distributed mode, MaxCompute limits the size of resources that can be referenced. For more information, see MapReduce limits. However, no limits are imposed on the size of resources in local mode.
- Security: MaxCompute MapReduce and user-defined functions (UDFs) are limited by a Java sandbox in distributed mode. However, no limits are imposed in local mode.
odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out Summary: counters: 10 map-reduce framework combine_input_groups=2 combine_output_records=2 map_input_bytes=4 map_input_records=1 map_output_records=2 map_output_[wc_out]_bytes=0 map_output_[wc_out]_records=0 reduce_input_groups=2 reduce_output_[wc_out]_bytes=8 reduce_output_[wc_out]_records=2 OK
For more information about the sample code of WordCount, see WordCount.
<warehouse> |____my_project (project directory) |____ <__tables__> | |__wc_in (table data directory) | | |____ data (file) | | | | | |____ <__schema__> (file) | |__wc_out (table data directory) | |____ data (file) | | | |____ <__schema__> (file) | |____ <__resources__> | |___table_resource_name (table resource) | |____<__ref__> | |___ file_resource_name (file resource)
- Directories at the same level as
my_projectindicate projects. Directories at the same level as
wc_outindicate data tables. The table data that you read or write by using the JAR command is downloaded to directories at this level.
- The <__schema__> file stores the metadata of a table. The following code defines the file format:
project=local_project_name table=local_table_name columns=col1_name:col1_type,col2_name:col2_type partitions=p1:STRING,p2:BIGINT -- In this example, you do not need to specify this field.
Separate the name and data type of a column with a colon (:). Separate columns with commas (,). The project and table names,
project_name.table_name, must be declared at the beginning of the <__schema__> file. Separate the declaration and column definition with a comma (,). Example:
- The data file in the
tablesdirectory stores the table data. The number of columns and column data must match the definition in the _schema_ file. Separate columns with commas (,).The _schema_ file in the
wc_indirectory contains the following data:
my_project.wc_in,key:STRING,value:STRINGThe data file contains the following data:
0,2The client downloads the metadata and part of the data of a table from MaxCompute, and saves the data to the preceding files. The next time you run this example program, the client directly uses the data in the
wc_indirectory, instead of downloading it again.Note Data can be downloaded from MaxCompute only for MapReduce jobs that run in local mode.The _schema_ file in the
wc_outdirectory contains the following data:
my_project.wc_out,key:STRING,cnt:BIGINTThe data file contains the following data:
0,1 2,1The client downloads the metadata of the
wc_outtable from MaxCompute, and saves the data to the _schema_ file. After a job is completed, the results are saved to the data file.Note
- You can also edit the _schema_ and data files, and save the files in table directories.
- If you run a job in local mode and the client detects that the table directory exists, the client does not download the information of this table from MaxCompute. The local table directory can include a table that does not exist in MaxCompute.