A detailed explanation of Hadoop core architecture HDFS
1. Background and Definition
As the amount of data becomes larger and larger, an operating system cannot store all the data, so it is allocated to more disks managed by the operating system, but management and maintenance are extremely inconvenient, so a system is urgently needed to manage multiple disks The files on the machine, this is the distributed management system, HDFS is one of them.
The use of HDFS is suitable for the scenario of writing once and reading many times, and does not support direct modification of the file, only supports appending at the end of the file
HDFS adopts a streaming data access method: the feature is that, like running water, the data does not come at once, but "flows" bit by bit, and the data is processed bit by bit. If the data is processed after all the data has come, the delay will be very large, and it will consume a lot of memory.
2. Why choose HDFS to store data
High fault tolerance: data is automatically saved in multiple copies, and automatically recovered after the copy is lost
Suitable for batch processing: mobile computing and data on the fly. The data location is exposed to the computing framework
Suitable for big data processing: GB, TB, set PB level data. The number of files over a million scale. 10K+ node scale.
Streaming file access: write once, read many times. Ensure data consistency.
Can be built on cheap machines: Improve reliability with multiple copies. Provides fault tolerance and recovery mechanisms.
3. Disadvantages of HDFS
Inability to store small files efficiently: Because the nameNode places the metadata of the file system in memory, the number of files that the file system can hold is determined by the memory size of the NameNode. Generally speaking, each file, folder, and Block It needs to occupy about 150 bytes of space, so if you have 1 million files, each occupying a block, you need at least 300MB of memory. Currently, millions of files are still feasible. At a billion, it's not achievable with current hardware levels. Another problem is that because the number of map tasks is determined by splits, when MR is used to process a large number of small files, too many map tasks will be generated, and the thread management overhead will increase the job time. For example, to process 10000M files, if each split is 1M, then there will be 10000 maptasks, which will have a lot of thread overhead; if each split is 100M, then 100 Maptasks, each maptask will have more things to do, and the thread management overhead will be much reduced.
Do not support multi-user writing and arbitrary modification of files: a file has only one writing thread, multiple threads cannot read and write at the same time, and the writing operation can only be completed at the end of the file, only file appending is supported, and modification is not supported.