MaxCompute MapReduce：Native interface for MaxCompute, which is faster than other interfaces. It is more convenient to develop a program without exposing file system.
MR2 (Extended MapReduce): The extension to MaxCompute, which supports more complex job scheduling logic. Map/Reduce is implemented in the same manner as the MaxCompute native interface.
|You are not yet able to read or write data from the external tables through MapReduce .|
- Search: web crawl, flip index, PageRank.
- Web access log analytics:
- Analize and mine the web access, shopping behavior characteristics to achieve personalized recommendation.
- Analyze user's access behavior.
- Statistics and analysis for text:
- The Wordcount and TFIDF analysis of Mo Yan novels.
- Reference analysis and statistics of academic papers and patent documents.
- Wikipedia data analysis, etc.
- Massive Data Mining: unstructured data, spatial and temporal data, image data mining.
- Machine Learning: supervised learning, unsupervised learning, classification algorithm such as decision tree, SVM, etc.
- Natural Language Processing:
- Training and forecasting based on big data.
- Based on the corpus to construct the current matrix of words, frequent itemset data mining, repeated document detection and so on.
- Advertisement recommendations: user-click (CTR) and purchase behavior (CVR) forecasts.
- Before executing Map, the input data must be sliced, that is, input data is divided into blocks of equal size. Each block is processed as the input of a single Map Worker, so that multiple Map Workers can work simultaneously.
- After the slice is split, multiple Map Worker can work together. Each Map Worker performs computing after reading the data and output the result to Reduce. Because Map Worker outputs the data, it needs to specify a key for each output record. The value of this Key determines which Reduce Worker the data is sent to. The relationship between key value and Reduce Worker is an any-to-one relationship. Data with the same key is sent to the same Reduce Worker, and a single Reduce Worker may receive data of multiple key values.
- Before Reduce stage, MapReduce framework sorts the data according to their Key values, and make sure data with same Key value is grouped together. If a user specifies Combiner, the framework calls Combiner to aggregate the same key data. The user must define the logic of Combiner. Compared to the classical MapReduce framework, the input parameter and output parameter of Combiner must be consistent with Reduce in MaxCompute. This processing is generally called Shuffle.
- At Reduce stage, data with the same key is shuffled to the same Reduce Worker. A Reduce Worker receives data from multiple Map Workers. Each Reduce Worker executes Reduce operation for multiple records of the same key. Multiple records of the same key then become a value through Reduce processing.
|A brief introduction to the MapReduce framework is provided above. For more details, see relevant documents.|
The following example uses WordCount to explain the stages of MaxCompute MapReduce.
- First, text is sliced and the data in each slice is input into a single Map Worker.
- Map processes the input. Once Map gets a number, it sets the Count as 1. Then, output <Word, Count>queues. and take ‘Word’ as the Key of output data.
- In the initial actions of Shuffle stage, the output of each Map Worker is sorted according to Key value (value of Word). Then the Combine operation is executed after sorting to accumulate the Count of same Key value (Word value) and constitute a new <Word, Count> queue. This process is called combiner sorting.
- In the later actions of Shuffle, data is transmitted to Reduce. Reduce Worker sorts the data based on Key value again after receiving data.
- At the time of processing data, each Reduce Worker adopts that same logic as that of Combiner by accumulating Count with same Key value (Word value) to get the output
|Because data in MaxCompute are stored in tables, the input and output of MaxCompute MapReduce can only be a table. User-defined output is not allowed and the corresponding file system interface is not provided.|