Community Blog Learning about Distributed Systems - Part 4: Smart Ways to Store Data

Learning about Distributed Systems - Part 4: Smart Ways to Store Data

Last time we talked about WHERE to store massive data, and this time, HOW. Massive data brings massive costs.

Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.

In the last blog, I talked about how distributed storage systems deal with data storage using HDFS.

But to store data is not enough, it's more important to store them well.

Massive data brings massive costs. If you don't store well, you're wasting money.

Since I've dedicated the last blog on this very topic, I will just sort out the main points from blog 3 of the series and add another important point below.

1. Store Data Wisely

Data Deleting

Deleting data sounds crudely simple. But it is actually very useful. In many cases, a lot of data can be deleted, such as:

· The outdated data from 5 years ago which are no longer time-sensitive and may never be used again.

· Temporary data, intermediate results, etc., one forgets to delete after use.

· Copies of the same data stored by different businesses or different people, who don't know that they were duplicating.

These data exist only to bring nothing but unnecessary costs.

Of course, to delete and dare to delete them, we will need to know really well how to use data and the relations of data, so that corresponding procedures and data support could keep up.

Reduce Copies

To cut copies is also a no-no for those who heard it the first time. Normally, it is for the sake of high availability to get more copies. By reducing copies, wouldn't we be reducing the data availability?

It's not wrong. Therefore this method should NOT be used on a large scale. For example, in my company, we only cut the temp library's data copies. Temp library stores temporary data and does not have so many restrictions. Because of lack of management and limits, it had become a "garbage dump" for data and grew larger over time.

That's when we started to set the number of copies of the library at 2. This simple act had effectively saved us millions of dollars.

The reason why we dare to do this is, on the one hand, even if the data is lost, it will not have much impact; on the other hand, as the cluster and the network bandwidth are large enough when individual machines go down, we still can make up missing copies from other machines in a very short time.

Of course, a small risk is still a risk, so it needs to be applied with caution.

In addition, there's HDFS RAID, which is essentially reducing copies. Large companies like Facebook and Tencent have realized and applied HDFS RAID in production. What's more, Hadoop 3.0 support will help by correcting and deleting codes.


As we all know, compression is the most important and conventional way to save space.

In the case of massive data, cost savings brought by compression will be more significant. Of course, several aspects need to be considered:

· Balance of compression/decompression speed and compression ratio. It's a typical trade-off between effect and efficiency. So in many scenarios, snappy is a more balanced compression algorithm.

· Splittable, that is, can be split. In a computing model such as MapReduce, if a file is too large and can not be divided, it can only be processed as one task, and can not be assigned as multiple tasks to be parallel processed. Naturally, it will slow down the overall performance. Not all compression formats support segmentation, which is another point to choose.

· Format selection. The most common format TextFile is not efficient, while formats like Avro, ORC, and Parquet are usually more efficient. Especially in column storage format, because the data in the same column is often close, it will have a better compression effect when they were stored together.


Everyone wants to access data faster, but fast speed brings high costs, so they can only choose different storage media based on needs. Media like memory and hard drives as we know them. In a broader sense, this type of medium can be called heterogeneous storage.

The idea of tiered storage is to select different storage media according to the heat and cold nature of the data. The hottest data is placed in memory, and the second hottest is placed in SSD, warm data SATA, etc. HDFS supports tiers by default. Methods like combing Alluxio could also be used.

Data with great access and high-performance requirements could be placed on hot storage, usually with a relatively small data scale. Data with low access and low access performance requirements could be placed on cold storage, usually with a large data scale.

The more storage tiers there are, the more choices and finer controls you will have when balancing performance and cost.

Like deleting data, you need to have sufficient knowledge of data access to apply this method. Only by knowing the data by heart can you divide the hot and cold data, and store different data on different media to ensure performance while also saving costs.

2. Separate Storage and Computing

You must have heard of data locality and know that data locality is very important to improve performance on local data. The data processing framework we usually use, such as MapReduce, Spark, etc, all have been optimized in data locality.

Data locality, or in other words, storage and computing aggregation, is easy to understand: improve performance by reducing network IO as much as possible.

The aggregation of storage and computing is also very simple. It's much easier than transferring data to code and transferring code to data.

However, with the performance of network equipment greatly improved, the network IO is no longer a performance bottleneck. Nowadays 10 Gigabit network card is just basic, coupled with extended functions such as network card binding, the network performance of the server has a great magnitude of improvement. Similarly, switch performance has been improved by an order of magnitude.

But at the same time, the disk IO performance has only improved 1-2 times. The performance improvement of the CPU has gradually made it difficult to keep up with the pace of Moore's Law.

So now the problem with data processing lies no longer in network IO, but in disk IO or CPU. It also makes aggregated storage and computing unnecessary.

On the other hand, aggregated storage and computing have always had unavoidable drawbacks:

· Complex resource management and coordination. CPU and memory are relatively easy to manage, and the network IO is much more complex and changeable. You either ignore it or it would be very complicated to deal with.

· Storage and computing resources are dragged down by each other. If the storage resources are insufficient, their expansion will cause the expanded computing resources to be wasted; if the computing resources need to be removed, it will lead to unwanted data migration or even data loss.

· Poor compatibility of cloud services. At present, the mainstream general cloud service has the default disk at the remote end. The so-called data locality is meaningless, only a limitation.

In this way, storage and computing disaggregation has become a natural choice.

It also is a great choice for cost savings.

When storage resources are insufficient, a memory machine of large hard drives and a small CPU will do, without additional computing resources and hardware costs.
When computing resources are insufficient, a memory machine with small hard drives and a large CPU is good, without additional storage resource hardware costs.


Massive data brings massive costs. The data storage method needs to be adjusted as much as possible to save costs.

· For very old, temporary data, delete it directly.

· For unimportant data that can't be deleted, cut its copies.

· Compression is the most common way to save space. It is necessary to balance the compression rate and speed, consider if the file can be split, and make good use of column storage.

· Tiered storage to store hot and cold data, which can effectively save costs.

· Data locality is less important, and the disaggregation of storage and computing provides flexibility and significant cost savings.

· Data access records and consanguinity are important and are the direct basis for adjusting storage strategies.

This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!

0 0 0
Share on

Alibaba Cloud_Academy

61 posts | 47 followers

You may also like