The Data Distribution of Hbase

Abstract: The rowkey design of HBase has always been a difficulty and pain point. Inappropriate rowkey design will lead to many problems such as read and write performance and poor throughput. This article starts from the data distribution problem, introduces HBase's Range-based distribution strategy and region scheduling problem, discusses the comparison rules and applications of rowkey in detail, and hopes to deepen users' understanding of HBase data distribution mechanism and rowkey, so as to make more suitable Design, accurate and efficient use of HBase.

Benefits: HBaseCon Asia 2018, the top international event, will be held in Beijing in August, and applications are currently open for free. For more details, please refer to

If you are interested in big data storage, distributed database, HBase, etc., welcome to join us and make the best online big data storage together. Job reference and contact information: 1aOrffoj1&src=app&fr=my_jobsrecruit_job

A brief description of the data distribution problem
The root of distributed generation is "scale", and scale can be understood as the demand for computing and storage. When the capacity of a single machine cannot bear the growing demand for computing and storage, it is necessary to seek ways to expand the system. There are usually two expansion methods: increasing the capacity of a single machine (scale up) and adding machines (scale out, horizontal expansion). Limited to hardware technology, there is an upper limit to the improvement of single-machine capabilities within a stage; and horizontal expansion can theoretically be unlimited, and at the same time, it is cheaper and easier to implement. Horizontal expansion can effectively solve the problem of rapid business growth through fast and simple "adding machines", which is almost a necessary capability for modern distributed systems. For an explosively growing business, horizontal scaling seems to be the only option.

For storage systems, data originally stored on one machine is now stored on multiple machines. Two problems must be solved at this point: sharding, replication.

Data sharding, also known as partition, divides the data set into multiple shards "reasonably", and each machine is responsible for several shards. In this way, the limitation of single-machine capacity is broken, and the overall access capability is also improved. In addition, sharding reduces the impact of a single shard failure.
Data replication (replica), also known as "copy". Fragmentation cannot solve the problem of data loss in a single machine failure. Therefore, redundancy must be used to solve the problem of high system availability. At the same time, the copy mechanism is also an important means to improve system throughput and solve hot issues.
Shards and replicas are orthogonal, meaning we can use either or both, but usually both. Because sharding solves problems of scale and scalability, replicas solve problems of reliability and availability. For a production-ready system to be available, both must be present.

From a consumer/client perspective, sharding and replicas boil down to the same problem: request routing, i.e. which machine should the request be sent to for processing.

When reading data, there is a mechanism to ensure that there is a suitable shard/replica to serve
When writing data, the same mechanism can be used to ensure that it is written to an appropriate place and to ensure the consistency of the replica
Whether the client's request is directly to the server (such as HBase/cassandra) or through a proxy (such as the gateway-based access method on the public cloud), request routing is a problem that must be solved by distributed systems.

Whether it is sharding or replica, it is essentially the embodiment of data distribution. Let's take a look at the data distribution model of HBase.

HBase's data distribution model
HBase's data sharding is performed by table, with row granularity, and is split based on the rowkey range. Each shard is called a region. A cluster has multiple tables, each table is divided into multiple regions, and each server serves many regions. Therefore, the HBase server is called RegionServer, or RS for short. The RS and the table are orthogonal, that is, the regions of one table will be distributed to multiple RSs, and one RS will also schedule the regions of multiple tables. As shown below:

"Row granularity" means that a row is the smallest unit of region division, that is, a row of data either belongs to region A or region B, and will not be split into two regions. (The way of splitting rows is "vertical splitting", which can usually only be done at the business level, and HBase is horizontal splitting)

The replication mechanism of HBase is implemented through the underlying HDFS. Therefore, HBase's copy and sharding are decoupled, and storage and computation are separated. This allows regions to move flexibly between RSs without data migration, which gives HBase the ability to scale in seconds and great flexibility.

For a single table, a "good" data distribution should be that the data volume of each region is similar in size, the request volume (throughput) is similar, and the number of regions scheduled by each machine is roughly the same. In this way, the data and access of this table can be evenly distributed in the entire cluster, so as to obtain the best resource utilization and service quality, that is, to achieve load balancing. When the cluster expands and shrinks, we hope that this "balance" can be automatically maintained. If the data distribution fails to achieve load balancing, the machine with high load can easily be called the bottleneck of the entire system. The slow response of this machine may cause most of the threads of the client to wait for the machine to return, thus affecting the overall throughput. . Therefore, load balancing is an important goal of region division and scheduling.

This involves three levels of load balancing:

Logical distribution of data: that is, region division/distribution, which is a mapping problem from rowkey to region
Physical distribution of data: the scheduling problem of region on RS
Access distribution: that is, the distribution of system throughput (requests) on each RS, involving the relationship between data volume and access volume, access hot spots, etc.
It can be seen that for the distribution of a row of data (to find the RS where the row of data is located), there are two levels of routing: one is the routing from rowkey to region, and the other is routing from region to RS. This is the key to HBase's ability to achieve flexible scheduling and second-level expansion. We will discuss in detail later. This article only discusses the first two issues, and the third issue is discussed in a follow-up article.

Region division based on rowkey range
First, let's look at the logical distribution of data, that is, how a table is divided into multiple regions.

The granularity of region division is rows, and a region is a collection of multiple consecutive rows in this table. The unique identifier of a row is rowkey, so a region can be understood as a collection of rowkeys distributed continuously. Therefore, this method is called the division based on the rowkey range.

The rowkey range for a region is a left-closed and right-open interval, so the start key of the next region is the end key of the previous region. Note that the first region has no start key, and the last region has no end key. In this way, all regions of this table can be added together to cover any rowkey value field. As shown below:

In the above figure, region1 is the first region with no startKey, and region3 is the last region with no endKey. The region distribution in the figure is relatively uniform, that is, the number of rows in each region is equivalent, so how is this distribution obtained? In other words, how is the boundary of the region determined?

Generally speaking, there are three ways to generate regions:

Pre-partitioning when creating tables: By estimating the rowkey, pre-divide the region
Region splitting: manual splitting, or automatic splitting when certain conditions are met (for example, the region size exceeds a threshold)
region merge: manual merge
If the region distribution is not explicitly specified when creating the table, HBase will create only one region, and this region can only be scheduled by one machine (the case where a region is scheduled by multiple RSs will be discussed later). The throughput upper limit of this region is the throughput upper limit of a single machine. If the table is divided into 8 regions through reasonable pre-partitioning and distributed on 8 RSs, the upper limit of the throughput of the entire table is the upper limit of the throughput of 8 machines.

Therefore, in order to make the table have good throughput and performance from the beginning, pre-partitioning is usually required to build a table in an actual production environment. But there are some exceptions. For example, the rowkey range cannot be estimated in advance, or it is not easy to split the rowkey range evenly. In this case, you can also create a table with only one region and split it by the system itself, thereby gradually forming a "" Uniform" region distribution.

For example, in a table that stores employee information of multiple companies, the rowkey consists of orgId + userid, where orgId is the company's id. Since the number of people in each company is uncertain and may vary greatly, it is difficult to determine whether it is appropriate to include several orgIds in a region. At this point, you can create a single-region table for it, and then import the initial data. As the data is imported, the regions are automatically split, and an ideal region distribution can usually be obtained. If there are major changes in the personnel of the subsequent company, you can also split and merge regions at any time to obtain the best distribution.

lexicographical and rowkey comparison
In the previous section, we mentioned that the rowkey range of a region is a left-closed and right-open range, and all rowkeys that fall within this range belong to this region. In order to make this judgment, it must be compared with the start and end rowkeys of this region. In addition to the judgment of region ownership, within the region, it is also necessary to rely on the rowkey comparison rules to sort the rowkeys.

Many people will think that the comparison of rowkey is very simple, and there is no need for discussion. But it is precisely because of its simplicity that its use can be flexible and diverse, making HBase infinite possibilities. It can be said that the comparison rule of rowkey is the core of the entire HBase data model, which directly affects the design of the entire request routing system, read and write links, rowkey design, scan use, etc., throughout the entire HBase. For users, a deep understanding of this rule and its application helps to make good table design and write accurate and efficient scans.

The rowkey of HBase is a string of binary data, which is a byte[] in Java, which is the unique identifier of a row of data. The primary key of the business may have various data types, so here are two problems to be solved:

Converts various actual data types to and from byte[]
Order preservation: the sorting result of the rowkey in the form of byte[] is consistent with the sorting result of the original data
The comparison of rowkey is the comparison of byte[], which is compared in lexicographical order (binary sorting). Simply put, it is the memcmp function in C language. Through the following example, we understand this comparison rule and data type conversion by sorting the results.

(1) Size comparison of ascii code

1234 -> 0x31 32 33 34

5 -> 0x35

From the number represented by the ascii code, 1234 > 5, but from the lexicographical point of view, 1234 < 5

(2) Comparison of ascii codes with the same prefix

1234 -> 0x31 32 33 34

12340 -> 0x31 32 33 34 00

In the C language, strings generally end with 0 itself. The two strings in this example have the same prefix, but the second one has 0 bytes at the end, so the second one is "larger".

(3) Comparison of positive and negative numbers

100 of type int -> 0x00 00 00 64

-100 of type int -> 0xFF FF FF 9C

100 > -100, but in its binary representation, 100 < -100

We can summarize this comparison rule as follows:

The comparison is performed byte by byte from left to right, and the comparison result of the first different byte is used as the comparison result of the two byte[]
Bytes are compared as unsigned numbers
"does not exist" is less than "exists"
Common rowkey encoding problems:

Signed number: In binary representation, the first bit of a signed number is 1. Under the lexicographical order, negative numbers are larger than positive numbers. Therefore, when the value range of rowkey contains both positive and negative numbers, the sign bit needs to be inverted. turn to ensure that positive numbers are larger than negative numbers
Reverse order: Long is usually used to describe time, which is generally reversed. Assuming the original value is v, the reverse order of v is Long#MAX_VALUE - v.
Let's take a look at the application of this comparison rule through an example of prefix scanning.

Example: Prefix Scan

Hbase's rowkey can be understood as a single primary key column. If the business scenario requires multiple columns to form a joint primary key (also called multi-column primary key, composite primary key, composite primary key, etc.), it is necessary to splicing multiple columns into one column. In general, just splicing the binary together directly. E.g:

rowkey composition: userId + ts

For simplicity, assume that userid and ts are both fixed length and only 1 byte. E.g:

Now, what we need to do is to find all the data for a certain userid = 2. This is a typical prefix scan scenario. We need to construct a Scan operation to complete: set the correct scan range [startRow, stopRow). Like the boundary of the region, the scan range is also a left-closed and right-open interval.

A straightforward idea is to find the smallest and largest ts, concatenated with userid = 2, as the query range, ie [0x02 00, 0x02 FF). Since scan is a left-arm and right-opening range, 0x02 FF will not be returned as a result. Therefore, this solution is not feasible.

The correct scan range must satisfy:

startRow: must have any userId=2's rowkey is smaller and larger than any rowkey with userId = 1
stopRow: any rowkey with userId = 2 must be larger and smaller than any rowkey with userId = 3
How to use the collation of rowkey to "find" such a scan range?

The correct scan range is [0x02, 0x03).

0x02 is smaller than any row with userid=2. Because the ts column is missing. Similarly, 0x03 is larger than any row with userid=2, and smaller than any row with userId=3. It can be seen that to achieve prefix scanning, the required startRow and stopRow can be obtained only according to the value of the prefix, without knowing the following columns and their meanings.

Please read this example carefully, and then think about how to construct startRow and stopRow in the following scenarios (see the end of the article for the answer).

where userid = 2 and ts >= 5 and ts < 20
where userid = 2 and ts > 5 and ts < 20
where userid = 2 and ts > 5 and ts <= 20
where userid > 2 and userid < 4
There are also the following combined scenarios:

where userid in (3, 5, 7, 9)
where userid = 2 and ts in (10, 20, 30)
Now, you can feel the difficulties and pain points of using scan. In the above example, there are only two fixed-length columns, but in actual business, the columns may be variable-length, with various data types and various rich query modes. At this point, it is difficult to construct a correct and efficient scan. So why do these problems exist? Is there a systematic solution?

In terms of form, this is a problem of "how to convert business query logic into HBase query logic", which is essentially a mapping problem from relational table model to KV model. HBase only provides the API of the KV layer, so that users have to implement the conversion between the two models by themselves. Therefore, there are so many difficult problems above. Not only HBase, but all KV storage systems face the same dilemma when faced with complex business models.

The solution to this problem is SQL on NoSQL. There are many such solutions in the industry (such as Hive, presto, etc.), and the solution above HBase is Phoenix. Such solutions address the ease-of-use problem of NoSQL by introducing SQL. For traditional relational databases, although there is strong SQL and transaction support, the scalability and performance are limited. In order to solve the performance problem, MySQL provides a KV access method based on Memcached; in order to solve the scalability problem, there are various NewSQL products such as Spanner/F1, TiDB, CockroachDB, etc. NoSQL is doing SQL, and those that support SQL are doing KV. We can imagine what the future storage and database systems will look like. This topic is too large to be discussed in this article, so I won't expand it here.

Region metadata management and routing
Earlier we discussed that dividing the rows of a table by a reasonable region can obtain a region distribution with roughly similar data volumes. Through reasonable operation and maintenance methods (region splitting and merging), we can ensure that the region is evenly distributed during the continuous operation of the system. At this point, the logical splitting of data can be achieved evenly. In this section we look at how regions are distributed on RS and how clients locate regions.

Because of the uncertainty or subjectivity of the rowkey range of the region itself (artificial splitting), it is impossible to calculate which region the rowkey belongs to through a mathematical formula (compared to the sharding method of consistent hash). Therefore, the range-based sharding method requires a metadata table to record which regions a table is divided into, and what the start and end rowkeys of each region are. This metadata table is the meta table, and in HBase 1.x the table name is "hbase:meta" (in 094 or older versions, the two metadata tables -ROOT- and .META.).

We briefly understand the region positioning process from the Put operation.

Find the RS (cache) where the meta table is located on ZK
Go to the meta table to find the region where the rowkey is located and the RS (cache) where this region is located
Send a Put request to the RS, and the RS performs the write operation according to the region name
If the RS finds that the region is not here, an exception is thrown, and the client reroutes
The logic for locating the region is the same whether it is read or written. In order to reduce the client's access to the meta table, the client will cache the region location information. If and only if the cache is incorrect, the meta table needs to be accessed to obtain the latest information. Therefore, HBase's request routing is a routing table-based solution. Correspondingly, the sharding method based on consistent Hash obtains the distribution information through calculation.

This routing table based approach

Advantages: The home RS of a region can be changed arbitrarily, in other words, the scheduling of a region on the RS is flexible and can be manually intervened.
Disadvantage: The meta table is a single point, its limited throughput limits the size of the cluster and the number of clients
The flexible scheduling of regions, combined with the architecture that separates storage and computing, endows HBase with extremely powerful capabilities.

Second-level expansion: The newly added RS only needs to move the region and can be put into production immediately, without relying on data migration (subsequent migration)
Manual isolation: For problematic regions (such as hot spots with abnormal requests), they can be manually moved to a separate RS to quickly isolate the fault domain.
These two points are beyond the reach of many sharding schemes based on consistent hashing. Of course, the price HBase pays for this flexibility is a complex meta table management mechanism. One of the more critical issues is the single point problem of the meta table. For example, a large number of clients will request the meta table to obtain the region location. The high load of the meta table will limit the overall throughput of obtaining the location, thereby limiting the scale of the cluster and the scale of the client.

For a cluster with hundreds of machines and hundreds of thousands of regions, this mechanism can work well. However, when the scale of the cluster is further expanded and the access limit of the meta table is reached, the service will be affected due to the blocking of access to the meta table. Of course, most business scenarios cannot reach this critical scale.

There are many ways to solve the problem of meta table, the easiest way is to copy. For example, in the PD service of TiDB, a request for obtaining a location can be sent to any PD server.

region scheduling
Next we discuss the region scheduling problem:

Load balancing of regions between RSs
The same region is scheduled on multiple RSs
For the first question, HBase's default balancing strategy is: in table units, the same number of regions as possible is scheduled on each RS.

This strategy assumes that the data volume distribution in each region is relatively uniform, and the requests of each region are relatively uniform. At this point, the strategy is very effective. This is also the most used one. At the same time, HBase also provides load-based scheduling (StochasticLoadBalancer), which will comprehensively consider a variety of factors to make scheduling decisions. However, cases and data used in the production environment are temporarily lacking.

For the second problem, the region is only scheduled on one RS at the same time, so that HBase provides strong consistent semantics when the request is successful, that is, the successfully written data can be read immediately. The cost is the single-point scheduling of the region, that is, the server where the region is located jitters for various reasons, which will affect the service quality of the region. We can divide issues affecting region services into two categories:

Unexpected: crash recovery, GC, network problems, disk thrashing, hardware problems, etc.
Predictable (or artificial): region movement caused by expansion/reduction, region split/merge, etc.
When these events occur, they will have more or less impact on the services of this region. Especially in the downtime scenario, it generally takes more than 1 minute to complete the execution of some series of steps from the time ZK discovers that the node is down to the re-assign, split log, and log replay of the region. For regions on downed nodes, it means that these regions cannot be served during this time.

The solution is still a replica scheme, where regions are scheduled on multiple RSs, and the client selects one of them to access. This feature is called "region replia". Introducing replicas inevitably brings additional cost and consistency issues. At present, the implementation of this feature has not reduced the MTTR time, the control of the memory water level, and the dirty reading, so that this feature has not been used on a large scale in production.

The problem of data distribution and region scheduling in Hbase, which is expanded to the entire distributed system, is the problem of task splitting and scheduling. The connotation of this topic is large enough to write several books. This article only discusses the topic of data sharding from the perspective of HBase, hoping to deepen readers' thinking and understanding of the concepts of HBase rowkey and region. Whether it is a database user or a developer, they can all benefit from this discussion.

The scan range corresponding to some query scenarios in the text:

where userid = 2 and ts >= 5 and ts < 20: [0x02 05, 0x02 14)
where userid = 2 and ts > 5 and ts < 20: [0x02 06, 0x02 14)
where userid = 2 and ts > 5 and ts <= 20: [0x02 06, 0x02 15)
where userid > 2 and userid < 5: [0x03, 0x05)
Author: Yang Han

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us