ODPS provides MapReduce programming interface. User can use Java API to write MapReduce program for processing data in ODPS. Now, the MapReduce function is not opened. If you want to use this function, you can apply for an invitation code from Alibaba Cloud website. This section only introduces how to use MapReduce SDK. For a detailed description of the MapReduce SDK, please refer to official Java Doc.
Note: Now, MapReduce is still in beta. If you want to use the function, you can apply it through the ticket system. Please specify the name of your project and we will deal with it within 7 working days.
MapReduce is a distributed data processing model proposed by Google at first. It has drawn a lot of attention and has been applied to all kinds of business scenarios.
The data processing process of MapReduce is mainly divided into two stages: Map stage and Reduce stage. Map stage is executed before Reduce stage. The processing logic of Map and Reduce is defined by user, but must comply with the MapReduce framework protocol.
- Before executing Map, the input data must be sliced. ‘Slicing’, is to divide input data into blocks of equal size. Each block is processed as the input of a single Map Worker, so that multiple Map Workers can work simultaneously.
- After a slice is finished, multiple Map Workers can work simultaneously. Each Map Worker will do computing after reading their data and output the result to Reduce. While Map Worker outputs the data, it needs to specify a key for each output record. The value of this Key determines which Reduce Worker this data will be sent to. The relationship between key value and Reduce Worker is m any-to-one relationship. Data with the same key will be sent to the same Reduce Worker. A single Reduce Worker may receive the data with multiple key values.
- Before Reduce stage, MapReduce framework will sort the data according to their Key values, and make sure data with same Key value will be grouped together. If user specifies ‘Combiner’, the framework will call Combiner to aggregate the same key data. The logic of Combiner is also defined by user. Differently from the classical MapReduce framework, the input parameter and output parameter of Combiner must be consistent with Reduce in ODPS. This processing is usually called ‘Shuffle’.
- Next stage is Reduce. Data with the same key will be shuffled to the same Reduce Worker. A Reduce Worker will receive data from multiple Map Workers. Each Reduce Worker will execute Reduce operation for multiple records of the same key. At last, multiple records of same key will become a value through Reduce processing.
- MapReduce framework is only briefly introduced here. Consult other information for more details.
The following will take WordCount as an example, to explain the concepts of ODPS MapReduce stages. Suppose there is a text ‘a.txt’, where each row is indicted by a number. We need to count the appearance occurrence of each number. The number in the text is called ‘Word’ and the number appearance occurrence is called Count. To complete this function by ODPS Mapreduce, we will go through the steps described in following figure:
At first, slice the text and take the data in each slice as the input of single Map Worker.
- Map processes the input. Once getting a number, set Count to be 1. Then output <Word, Count>queues. Now take ‘Word’ as the Key of output data.
- In the prophase of Shuffle stage, sort the output of each Map Worker according to Key value (value of Word) at first. Then execute Combine operation after sorting, that is to accumulate the Count of same Key value (Word value) to constitute a new <Word, Count> queue. This process is called combiner sorting.
- In the later stage of Shuffle, data is transmitted to Reduce. Reduce Worker will sort the data based on Key value again after receiving data.
- At the time of processing data, each Reduce Worker adopts same logic with Combiner by accumulating Count with same Key value (Word value) to get the output result.
- As data in ODPS are stored in tables, the input and output of ODPS MapReduce can only be a table. User-defined output is not allowed and the similar file system interface is not provided.