Basic stages introduction
Local run prerequisite: By setting –local parameter in the jar command, user can simulate MapReduce running process on the local to initiate local debugging.
At local operation time: The client downloads required meta information of input tables, resources, and meta information of output tables from MaxCompute, and saves them into a local directory named ‘warehouse’.
After running the program: The calculation result is output into a file in the ‘warehouse’. If the input table and referenced resources have been downloaded in the local warehouse directory, the data and files in ‘warehouse’ directory are referenced directly during the next run time, and the downloading process does not need to be repeated.
Difference between running locally and running distributed environments
In the local operation course, multiple Map and Reduce workers are yet to start data processing. But these workers do not run concurrently and run serially.
The distinguishing points between the simulation process and real distributed operation are as follows:
- A limit on the row number of input table exists. Currently, up to 100 rows of data can be downloaded.
- Usage of resources: In a distributed environment, MaxCompute limits the size of the referenced resource. For more information, see Application Restriction. Note that in the local running environment, the resource size has no limits.
- Security limits: MaxCompute, MapReduce, and UDF program running in a distributed environment are limited by Java Sandbox. Note that in local operations this limit is not applicable.
Example:
odps:my_project> jar -l com.aliyun.odps.mapred.example.WordCount wc_in wc_out
Summary:
counters: 10
map-reduce framework
combine_input_groups=2
combine_output_records=2
map_input_bytes=4
map_input_records=1
map_output_records=2
map_output_[wc_out]_bytes=0
map_output_[wc_out]_records=0
reduce_input_groups=2
reduce_output_[wc_out]_bytes=8
reduce_output_[wc_out]_records=2
OK
For a detailed WordCount example, see WordCount Code example.
<warehouse>
|____my_project(project directory)
|____ <__tables__>
| |__wc_in(table directory)
| | |____ data(file)
| | |
| | |____ <__schema__> (file)
| |__wc_out(table data directory)
| |____ data(file)
| |
| |____ <__schema__> (file)
|
|____ <__resources__>
|
|___table_resource_name (table resource)
| |____<__ref__>
|
|___ file_resource_name (file resource)
- The same level directory of myproject indicates the project. ‘wcin’ and ‘wc_out’ indicate tables. The table files read by user in JAR command is downloaded into this directory.
- The contents in <__schema__> indicate table meta information. The format is defined as follows:
project=local_project_name table=local_table_name columns=col1_name:col1_type,col2_name:col2_type partitions=p1:STRING,p2:BIGINT
Columns and column types are separated by colons (:), and columns are separated by commas (,). Corresponding to <__schema__> file, the Project name and Table name must be declared, such as
project_name.table_name
, and separated by a comma (,) and column definition.project_name.table_name,col1_name:col1_type,col2_name:col2_type,……
- The file ‘data; indicates the table data. The column quantity and corresponding data must comply with the definition in _schema_. Moreover, extra columns and missing columns are not allowed.
The content of _schema_ in wc_in is as follows:
my_project.wc_in,key:STRING,value:STRING
The content of ‘data’ is as follows:0,2
The client downloads the meta information of table and part of the data from MaxCompute, and save them into the two preceding files. If you run this example again, the data in the directory ‘wc_in’ is used directly and will not be downloaded again.Note The function to download the data from MaxCompute is only supported in MapReduce local operation mode. If the local debugging is executed in Eclipse development plug-in, the data of MaxCompute cannot be downloaded to local.The content of ‘_schema_’ in wc_out is as follows:my_project.wc_out,key:STRING,cnt:BIGINT
The content of ‘data’ is as follows:0,1 2,1
The client downloads the meta information of wc_out from MaxCompute and saves it to the file _schema_. The file ‘data’ is a result data file generated after the local operation.Note- Users can also edit _schema_ file and ‘data’ and then place these two files into the corresponding table directory.
- When running on the local, the client can detect the table directory already exists, and does not download the information of this table from MaxCompute. The table directory on the local can be a table that does not exist in MaxCompute.