Community Blog Improve Scalability with Partitioning - Part 7 of About Distributed Systems

Improve Scalability with Partitioning - Part 7 of About Distributed Systems

Learn how to store,read and use massive data including files, key-value, or doc, json on ElasticSearch by partitioning.

*Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.

Divide and Conquer

In the previous blogs, I repeatedly mentioned a concept - "divide and conquer".

Divide and conquer, is to solve the problem that the data is too large to be stored and too slow to calculate. It is also the core problem to be solved by the basic distributed system.

Or, we switch to the perspective of programs and services to understand, to solve the (horizontal) scalability problem.

  • The distributed storage framework represented by HDFS solves the problem of data storage scalability by dividing the data into blocks of fixed size and matching the metadata to record the location of the blocks. As long as there are enough hard drives, it can last forever.
  • The distributed computing framework represented by MapReduce solves the problem of data computing scalability by dividing the computing logic into Mappers and Reducers. As long as you have enough CPU and memory, you can keep counting.

Therefore, the most basic problem of distributed systems, scalability, is solved by the general method of partitioning.

In different systems, partitions may have other names, such as block, region, bucket, shard, stage, task, etc. Change the soup but not the medicine, and tell the same story.

As we mentioned in the fifth part of the series, the segmentation of computing logic is equivalent to the segmentation of data. Therefore, we focus on the scalability of data to understand common partitioning schemes.

Typically, we can divide data into three types to study:

  • File data, files of any format on the file system, such as text files on HDFS.
  • key-value data, data with a primary key, such as data in MySQL.
  • Document data, similar to json-formatted data, differs from key-value in that there is no primary key in business sense, such as data on ElasticSearch.

Specifically, for scalability, we focus on the following points:

  • Parititioning , that is, how the data is logically divided
  • Localization, ie how data is physically distributed
  • Rebalance , that is, how the data is physically redistributed after the number of nodes changes

File Data


File data is the lowest and most flexible data. For other types of data, the bottom layer must eventually exist in the form of files (pure memory data that is not entangled and untouched).

It is precisely because of this feature that the partition of file data can only be done at a lower level and has nothing to do with the application.

Therefore, whether it is various file systems supported by Linux or HDFS in the distributed field, a method similar to a block of fixed size is used to segment data.

For distributed systems, another metadata is needed to record the correspondence between files and blocks, as well as the correspondence between blocks and machines.

Localization & Rebalance

Since block has no application layer meaning, the distribution of file data (Localization) is close to random, but it is more considered from the perspective of storage space.

This is well understood and has several benefits:
Sufficient free space needs to be guaranteed, otherwise service and machine failure may result.
It is necessary to ensure the balance of data distribution, because the calculation follows the data, and the balance of data distribution can make more efficient use of computing resources and IO resources.

The Rebalance after adding and deleting nodes is also very simple to implement.

In essence, you only need to copy the data to the target machine, and then modify the mapping relationship on the metadata, just fine.

Metadata modification is lightweight. However, the movement of data will bring a lot of IO, so you can consider staggering the business peak time, or limit the movement speed to reduce the resource competition for business programs.

For example, HDFS also provides trigger threshold settings and automatic rebalance functions, which facilitates expansion and contraction to a greater extent.

Key-Value data


Key-value data is a type of data we often deal with. The biggest difference from file type data is that the key-value structure gives the meaning of the application layer, so that we can get rid of the shackles of the bottom layer and do a lot of things in the application layer.

Specific to partitioning, we don't need to do it at the block level. The key-value is based on the key, so we do partitioning in units of keys.

The first method that is easy to think of is to split by key range.

For example, a data with a mobile phone number as the key can be easily segmented like this:

  • ........
  • 13100000000 - 13199999999
  • 13200000000 - 13299999999
  • 13300000000 - 13399999999
  • ........

HBase adopts this method for partitioning.

But this can easily lead to uneven data distribution. For example, the number segment such as 135 will have a lot of data, and the number segment such as 101 may have no data at all.

The root cause is that the key of partitioning has business meaning, and the business itself may be unbalanced.

Easy to handle, then make the partiton key business irrelevant.

In some scenarios, some simple transformations can achieve the effect.

For example, in the example of the mobile phone number above, flipping the mobile phone number and using it as the partitioning key can solve the data skew (unbalanced) very well.

In a more general scenario, if you want to break up the data, the following two methods are more general:

  • Adding random numbers, this effect is definitely the most thorough, but in this way, it is impossible to precisely locate the key to access the data, only range scanning.
  • Hash processing, the digest of the hash algorithm provides a certain degree of smashing, and the certainty of the output ensures that the post-event access can be accurately located.

Therefore, generally choose the method of hashing the key, and then you can still continue to do the segmentation according to the range.

In a broad sense, mobile phone number flip can also be regarded as a hash function. The more commonly used standard hash algorithms are md5, sha, etc.

Hash does solve the problem of uneven distribution, but it also loses one of the benefits of range: querying by range.

After hashing, the range query has to be spread to multiple partitions, the query performance will be greatly affected, and the order between the data will be lost.

So there can be a compromise, the so-called compound primary key. Similar to key1_key2, only key1 is used as a hash partition, and key2 is used to meet the needs of range queries.

For example, consider a forum scenario. To query all posts of a user on a certain day, you can design a primary key such as (user_id, timestamp), and then a range query of scan(user_id, start_timestamp, end_timestamp) can easily get the results. .

Cassandra uses this approach to partitioning.

Localization & Rebalance

Since the partitoin of key-value data brings business meaning, it can no longer be considered like file data, only considering storage space.

Of course, our localization strategy cannot break the rule that the same range of data is in the same partiton.

Typically, there are several options.

The first is hash mod N to determine which partition the data should be placed in, where N is the number of nodes.

The benefit of this approach is that there is no cost of metadata management, since metadata changes from data to computational logic.

The disadvantage is that it is very inflexible. Once the number of nodes changes, rebalance may have to move a large amount or even all of the data.

The root cause is that partition carries the variable N, so when the number of nodes changes, rebalance will affect localization.

Simple, then don't bring variables, so there is a second way - a fixed number of partitions. No matter how much the number of nodes increases or decreases, the number of partitions remains the same.

In this way, when nodes are added or removed, it is enough to move a small number of partitions. The downside is, of course, the overhead of metadata management.

But the metadata is usually not large, so things like ElasticSearch, Couchbase, etc. have adopted this scheme.

There is another problem with fixing the number of partitions, which needs to be suitable, otherwise the data has to be redistributed. But how much is appropriate?

Maybe 100,000 was set at the beginning, and it was dragged down by the excessive metadata management and synchronization costs from the first day.

Probably set to 100 at first, which is fine. But after a year, the data volume has increased by 10 times, and it is not enough.

Therefore, there is no standard answer, depending on the business scenario, and the same business scenario may also change over time.

What if we can't find a suitable value?

Then don't! If the enemy does not move, I will not move, and if the enemy moves, I will also move.

So there is a third way: a dynamic number of partitions.

After adding nodes, split the old partitions and distribute some to the new nodes.
If the node is reduced, merge the old paritition and move it to the available node.

The scope of data movement is only the affected part of the partition.

In this way, you will no longer be afraid of rebalance.

But it's not all right.

  • The operating cost and performance impact of split and merge cannot be avoided.
  • The trigger conditions for split and merge should be carefully designed, and the data volume, number of files, file size, etc. of the partition may need to be considered.
  • Although dynamic, there must be an initial value, otherwise problems such as writing hot spots are enough to cause headaches. It is a good choice to open the pre-partition function to developers.

HBase takes this approach.

Document Data


Although document data does not have a primary key in the business sense, it usually has a unique internal doc_id, just as we often have an auto-incrementing id as the primary key in a relational database.

In this way, the document data becomes a key-value structure such as {doc_id: doc}, and the partitioning method can naturally be mentioned in the key-value data.

More common is the mod N approach, such as ElasticSearch.

Localization & Rebalance

Another point similar to relational databases is that document data is usually queried through a secondary index, similar to a search.

In this way, the data of the same secondary index may appear on different partitions, so a method similar to map + reduce can only be used to query the data.

Obviously, it is not friendly to the scenario of a large number of reads, and each query must be broadcast to all partitions.

Therefore, there is a way to use the secondary index as the so-called routing key to read and write data. Of course, this optimization of localization will inevitably affect partitioning.

In this way, the data that agrees with the secondary index value is written to a fixed partition, and the problem of read amplification is solved.

Of course, in the scenario of multiple secondary indexes, with different routing keys, you may have to write multiple copies of data.

As for the access of the secondary index, there are two implementations of the so-called document local index and term global index, which will not be expanded here.


Divide and conquer is the core idea of distributed systems.
The implementation of divide and conquer is partition, which solves the most basic problem of distributed systems - scalability.
There are three important issues to consider for scalability: partitioning, localization, and rebalance.
File-type data is usually partitioned approximately randomly, localization is provided by metadata, and the scope of influence of rebalance is also controllable.
For key-value data, since there are more business meanings, it usually starts with the business meaning of the key. The advantages and disadvantages of different methods are obvious.
Document data, due to the existence of doc_id, can learn from key-value data for partitioning, but also consider the impact of secondary indexes.


This blog, including the previous ones, are essentially solving one of the core problems of distributed systems - the problem of scalability. The solution is partitioning.

Next blog, we will take a look at another core issue of distributed systems - availability.

Till now, we've walked throught the advantages of the distributed systems. Now, it is finally time to face the many problems brought about by distributed systems.

Thanks for reading, leave your thoughts in the comment section below and I will see you next week!

This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!

0 1 0
Share on

Alibaba Cloud_Academy

32 posts | 26 followers

You may also like


Alibaba Cloud_Academy

32 posts | 26 followers

Related Products