MaxCompute provides the MapReduce programming interface. User can use the Java API to write MapReduce program for processing data in MaxCompute. You must apply for the MapReduce function from Alibaba Cloud website to use MapReduce. This section introduces how to use the MapReduce SDK. For a more detailed description of the MapReduce SDK, see the official Java documentation.
MapReduce is a distributed data processing model. Its data processing process is generally divided into two sequential stages: Map stage and Reduce stage. The processing logic of Map and Reduce is defined by a user, but must comply with the MapReduce framework protocol.
Before executing Map, the input data must be sliced, that is, input data is divided into blocks of equal size. Each block is processed as the input of a single Map Worker, so that multiple Map Workers can work simultaneously.
Each Map Worker performs computing after reading the data and output the result to Reduce. Because Map Worker outputs the data, it needs to specify a key for each output record. The value of this Key determines which Reduce Worker the data is sent to. The relationship between key value and Reduce Worker is an any-to-one relationship. Data with the same key is sent to the same Reduce Worker, and a single Reduce Worker may receive data of multiple key values.
Before Reduce stage, MapReduce framework sorts the data according to their Key values, and make sure data with same Key value is grouped together. If a user specifies ‘Combiner’, the framework calls Combiner to aggregate the same key data. The user must define the logic of Combiner. Compared to the classical MapReduce framework, the input parameter and output parameter of Combiner must be consistent with Reduce in MaxCompute. This processing is generally called ‘Shuffle’.
At Reduce stage, data with the same key is shuffled to the same Reduce Worker. A Reduce Worker receives data from multiple Map Workers. Each Reduce Worker executes Reduce operation for multiple records of the same key. Multiple records of the same key then become a value through Reduce processing.
The following example uses WordCount to explain the stages of MaxCompute MapReduce. Assume there is a text named ‘a.txt’, where each row is indicted by a number, and the frequency of appearance of each number needs to be counted. The number in the text is called ‘Word’ and the number appearance occurrence is called Count. To complete this function though MaxCompute Mapreduce, the following figure details the steps required:
First, text is sliced and the data in each slice is input into a single Map Worker.
Map processes the input. Once Map gets a number, it sets the Count as 1. Then, output <Word, Count>queues. and take ‘Word’ as the Key of output data.
In the initial actions of Shuffle stage, the output of each Map Worker is sorted according to Key value (value of Word). Then the Combine operation is executed after sorting to accumulate the Count of same Key value (Word value) and constitute a new <Word, Count> queue. This process is called combiner sorting.
In the later actions of Shuffle, data is transmitted to Reduce. Reduce Worker sorts the data based on Key value again after receiving data.
At the time of processing data, each Reduce Worker adopts that same logic as that of Combiner by accumulating Count with same Key value (Word value) to get the output result.
Because data in MaxCompute are stored in tables, the input and output of MaxCompute MapReduce can only be a table. User-defined output is not allowed and the corresponding file system interface is not provided.