Community Blog Learning about Distributed Systems – Part 7: Improve Scalability with Partitioning

Learning about Distributed Systems – Part 7: Improve Scalability with Partitioning

Part 7 of this series discusses one of the core problems of distributed systems: scalability.

By Qinxia

Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统. All rights reserved to the original author.

Divide and Conquer

I have repeatedly mentioned the phrase divide and conquer in the previous articles.

The purpose of divide and conquer is to solve the problems of data that is too large to store or too slow to calculate. This is the core problem to be solved by the basic distributed system.

We need to solve the problem of (horizontal) scalability from the perspective of procedures and services.

  • The distributed storage framework represented by HDFS solves the problem of data storage scalability by dividing the data into fixed-size blocks coupled with metadata that records the location of the blocks. As long as there are enough hard disks, they can be stored forever.
  • The distributed computing framework represented by MapReduce solves the problem of data computing scalability by dividing the computing logic into mappers and reducers. As long as there is enough CPU and memory, they can be computed forever.

The simplest problem of distributed systems, scalability, is solved with the general method of partitioning.

Partitions may have other names in different systems, such as block, region, bucket, shard, stage, and task. They are the same.

As we mentioned in the fifth article, dividing the computing logic is equivalent to dividing the data. Therefore, we focus on data scalability to understand the common partitioning scheme.

Typically, we can divide the data into three types to study:

  • File Data: Files in any format on the file system (such as text files on HDFS)
  • Key-Value Data: Data with primary keys (such as data in MySQL)
  • Document Data: It is similar to JSON data. It is different from key-value because there is no primary key in the business sense (such as data on Elasticsearch).

Specifically, we focus on the following points for scalability:

  • Partitioning: How is the data logically divided?
  • Localization: How is the data physically distributed?
  • Rebalance: How is the data physically redistributed after the number of nodes changes?

File Data


File data is the most flexible data in the lowest layer. For other types of data, the bottom layer must eventually exist in the form of files (not to struggle with pure memory data that does not land).

Because of this feature, the partition of file data can only be made at the bottom where there has nothing to do with the application.

Therefore, whether it is various file systems supported by Linux or HDFS in the distributed field, they adopt the method of the fixed-size block to divide data.

For a distributed system, you must add metadata to record the correspondence between files and blocks and the correspondence between blocks and machines.

Localization & Rebalance

Blocks do not have any meaning in the application layer. Therefore, the localization of file data is close to random. We only give more consideration to the perspective of the size of the storage space.

This is easy to understand, and there are several benefits:

  • Enough remaining space is required. Otherwise, service and machine failures may occur.
  • It is necessary to ensure the balance of data distribution. Since computing follows the data, the balance of data distribution can help us use computing resources and IO resources more efficiently.

After you add or delete nodes, Rebalance is easy to implement.

Essentially, you need to replicate the data to the target machine and modify the mapping relationship on the metadata.

The modification of the metadata is very lightweight. However, the movement of data will bring a large number of IO. You can avoid the business peak time or limit the movement speed to reduce resource competition for business programs.

For example, HDFS provides the setting of trigger thresholds and the automatic rebalance function, which facilitates scaling to a greater extent.

Key-Value Data


We often deal with this type of data. The biggest difference between the key-value and file data is that the key-value structure gives the application layer meaning so that we can get rid of the shackles of the underlying layer and do many things at the application layer.

We don't have to do partitioning at the block level. The key-value regards the key as the core, so we do partitioning in units of keys.

The first way is to divide by key range.

For example, data with a mobile phone number as the key can be divided easily this way.

  • ........
  • 13100000000 - 13199999999
  • 13200000000 - 13299999999
  • 13300000000 - 13399999999
  • ........

HBase uses this partitioning method.

However, this can easily lead to uneven data distribution. For example, number segments like 135 will have a lot of data, while number segments like 101 may have no data at all.

The root cause is that the key of partitioning has business implications, and the business itself may be unbalanced.

Then, make the partition key irrelevant to business.

The effect can be achieved through some simple transformations in some scenarios.

For example, in the example of a mobile phone number, turning over the mobile phone number and using it as the partitioning key can solve the data imbalance.

If you want to scatter the data in more general scenarios, the following two methods are more common:

  1. Add random numbers, which is the most thorough way, but then, you can't accurately locate the key to access the data. You can only scan the range.
  2. Hash Processing: The summary of the hash algorithm provides a certain degree of scattering, and the certainty of the output ensures that the access can be accurately located afterward.

Therefore, the key is often hashed, and you can divide the key by range.

Broadly speaking, turning over mobile phone numbers can be regarded as a hash function. More commonly used standard hash algorithms include md5 and sha.

Hash solves the problem of uneven distribution but also loses one of the benefits of range: querying by range.

After the hash is completed, range queries have to be spread to multiple partitions, which significantly affects the query performance and loses the order of data.

As a compromise, the compound primary key is available. Similar to key1_key2, only key1 is used as the hash partition, while key2 is used to meet the requirements of the range query.

For example, consider a forum scenario. You can design a primary key such as (user_id, timestamp) to query all posts of a user on a certain day. Then, a range query of scan (user_id, start_timestamp, end_timestamp) can easily obtain results.

Cassandra uses this partitioning method.

Localization & Rebalance

Since the partition of key-value data has business implications, you can no longer only consider storage space (like file data).

Our localization strategy cannot break the rule that the same range of data is in the same partition.

Typically, there are several options:

First, the hash mod N determines which partition the data should be placed in, where N is the number of nodes.

The benefit of this approach is that there is no cost to metadata management, as metadata changes from data to computing logic.

The disadvantage is that it is very inflexible. Once the number of nodes changes, we may have to move a large amount or even all of the data to rebalance.

The root cause is that partition has the variable N. When the number of nodes changes, rebalance affects localization.

Then, don't bring variables, so there is a second method - a fixed number of partitions*. The number of partitions remains the same regardless of whether the number of nodes increases or decreases.

As such, when nodes increase or decrease, only a small number of partitions need to be moved. The downside is the overhead associated with metadata management.

However, metadata is usually not large, so Elasticsearch and Couchbase have adopted this scheme.

There is another problem with fixing the number of partitions, which needs to be appropriate, or you have to redistribute the data.

How much is appropriate?

In the beginning, it may be 100000. It was dragged down by excessive metadata management and synchronization costs from the first day.

In the beginning, it may be 100. A year later, the amount of data has increased ten times, but that is not enough.

There is no standard answer, depending on the business scenarios. The same business scenario may have to change over time.

If you can't find the right value, what should you do?

The third method is the dynamic number of partitions.

When nodes are added, the old partition is split, and some are distributed to new nodes. When nodes are reduced, the old partition is merged and moved to the available node.

The scope of data movement is only part of the affected partition.

You will not be afraid of rebalance.

However, there are some problems:

  • The operating costs and performance impact of split and merge are unavoidable.
  • The trigger conditions for split and merge must be carefully designed. The data volume, number of files, and file size of partitions may need to be considered.
  • Although it is dynamic, there must be an initial value. Otherwise, problems (such as writing hot spots) will be challenging. It is a good choice to open the pre-partition function to developers.

HBase uses the method.

Document Data


Although document data does not have a primary key in the business sense, it usually has a unique internal doc_id, just as we often have an auto-increment id as the primary key in a relational database.

As such, the document data becomes a key-value structure, such as {doc_id: doc}. You can use the partitioning method mentioned in the key-value data.

It is more common to use mod N methods, such as Elasticsearch.

Localization & Rebalance

Similar to relational databases, document data is also queried through secondary indexes (like searching).

Then, the data of the same secondary index may appear in different partitions, so you can only use a method similar to map and reduce to query data.

It is not friendly to a large number of read scenarios. Each query has to be broadcasted to all partitions.

Therefore, there is a method that uses a secondary index as a routing key to read and write data. This optimization of localization will affect partitioning.

As such, the data that agrees with the secondary index value is written to a fixed partition, and the problem of read amplification is solved.

You may have to write multiple replications of data with different routing keys in scenarios with multiple secondary indexes.

As for the access to the secondary index, there can be two implementations, document local index and term global index, which will not be talked about here.


  • Divide and conquer is the core idea of distributed systems.
  • The implementation of divide and conquer is partition, which solves the most basic problem of distributed systems - scalability.
  • There are three important issues to consider for scalability: partitioning, localization, and rebalance.
  • File data almost randomly perform partitioning. The localization and rebalance provided by metadata has a controllable influence.
  • Key-value data usually starts with the business meaning of the key due to its many business meanings. The advantages and disadvantages of different methods are clear.
  • For document data, you can use key-value data for partitioning due to the existence of doc_id. However, you must consider the impact of secondary indexes.


This article, which summarizes the previous articles, is essentially solving one of the core problems of distributed systems - the scalability problem. The solution is partitioning.

The next article will discuss another core issue of distributed systems: availability.

I have been talking about the advantages of a distributed system. Now, it is time to discuss its problems.

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

0 1 0
Share on

Alibaba Cloud_Academy

60 posts | 47 followers

You may also like