Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
In the first blog , I mentioned how parallel computing could solve the problem of slow calculation. This blog will help you understand how to divide the problem into smaller tasks for execution and merge the outcomes to obtain the final result.
With no correlation in theory, the division of calculation is based on the storage division and governance in practice. However, without this theory it becomes difficult to process a large amount of data for each task, requiring high performance and several resources.
In the previous two blogs, I discussed distributed storage as pre-knowledge. Today, I will talk about the simplest distributed computing system and what it looks like.
The division of calculation can be understood from two major perspectives:
● The division and conquer of computational logic
● The division of the calculation results
The division and conquer of computational logic is the parallel execution of code. To ensure that the results are correct, the code running in parallel by multi-machine/multi-process should be the same except for the data that is being processed. This is why the division of calculation is based on the premise of the storage division. Therefore, the division of computational logic refers to the logic in which different instances of the same code must be executed parallelly independent of each other.
While discussing distributed storage engines, I mentioned how to split data. Now, I will turn to the distributed computing framework and how to split the computing logic?
As I said, the difference between the division and the division of computing logic is of data that is being processed so that the segmentation of the calculation logic is equivalent to the segmentation of the data. Of course, the purpose of splitting data here is different from that in distributed storage, where we split them to "block" units for storing massive data. Our purpose is to improve the parallelism of calculations. Considering that the calculation logic is at the application layer, we split data keeping the file as a basic unit.
For instance, several files can be processed together if the file is too small. However, if the file is too large, one file can be split into several copies for parallel processing (see the compression and segmentation of the file mentioned in the previous blog).
When it comes to the division of the calculation, the results are different. We process data to get results. A distributed computing method is used to improve the performance to run the same code parallelly. Each code instance processes a portion of the data. However, there can only be one final result. Therefore, merging the result or data obtained from each instance becomes crucial. This is similar to the logic of the original single-machine single-thread sequential execution, which is forced to split into two parts. Let's take a look at a simple example.
If we have an array of 100 trillion integers, we want to multiply each element by 2 and then sum it. For convenience, take Python as an example and 9 numbers instead of 100 trillion:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9]
When the amount of data is small, we can process it on a single machine and single thread, and it would be like:
`result = 0` `for I in a:` `result += I *2`
However, in a distributed scenario, you have to split it into two steps (the following is just for demonstration, the code here is still running on a single machine).
First step: multiply by 2:
`B = map(lambda x: x*2, a)` // output: [2, 4, 6, 8, 10, 12, 14, 16, 18]
Second step: summarize the results:
`reduce(lambda x,y: x+y, B)` // output: 90
The entire process utilizes two functions: map and reduce. As the name suggests, both these functions describe their responsibilities well. Map and reduce are typical concepts of functional programming. The distributed computing framework MapReduce, is a component of the mainstream distributed framework “Hadoop”. It is a multi-machine version of the idea of map and then reduce.
The question is, at what granularity MapReduce can split the task?
Map and reduce have different processing methods. As mentioned above, in case of map, the division of computing logic depends on the division of input data at the file level. Specifically, the number of splitted data, determined by the file format and block size, will automatically start the machine and the map tasks.
For instance, if a file can be divided (splittable), then the file can be divided into file size/block size (a number). If a file cannot be divided, even if the size is 100 times block size, it can only be handled by a map task (and this situation needs to be avoided).
On the other hand, reduce, works differently.
The output of the map is the input of the reduce. We often cannot control the input of the map, but the output of the map is completely under our control. This makes the input of reduce the same as output of map which is not as limited as the input of map. Naturally, the parallelism of the reduce stage, that is, the number of reducers, can be set.
The number of reducers is usually the number of final result files. When the number of reducers is set to 0, the reduce stage will be canceled. The program will end after running the map.
Considering the running speed and resource consumption, the specific number of reducers can be set and then flexibly adjusted according to the actual situation.
Similar to the distributed storage engine, the simplest distributed computing system is still based on the idea of division. To save resources, the division of calculation is premised on the storage division. The division of calculations can be understood from two perspectives: the division of computational logic and the division of computational results.
The division of computing logic means we execute the same code on multiple machines parallelly. On the other hand, the distribution of computing results in the calculation process is divided into two stages: map and reduce. Moreover, the division of calculation results is the natural influence of the calculation process. The parallelism of the map stage depends on the cut degree of the input data and the file format, compression method, block size, and more. You can easily set the specific number of reducers depending on running speed and resource consumption.
Additionally, increasing the calculation speed isn’t enough. You need to calculate well to get accurate results. Massive computing resources, if poorly managed, results in huge waste. Let's take a look at the management and scheduling of massive computing resources.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Alibaba Cloud Security - November 17, 2018
Alibaba Cloud Security - November 6, 2019
Alibaba Cloud_Academy - May 18, 2022
Alibaba Cloud Security - February 17, 2020
Alibaba Cloud Security - March 20, 2019
Alibaba Clouder - December 14, 2017
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Plan and optimize your storage budget with flexible storage servicesLearn More
More Posts by Alibaba Cloud_Academy