Production environment using HBase

Date: Oct 27, 2022

Related Tags:1. ApsaraDB for HBase
2. The Data Distribution of Hbase

Abstract: We mainly talk about some best practices that need to be paid attention to in the actual development and use of HBase.

Seven Principles of Schema Design

1) The size of each region should be controlled between 10G and 50G;

2) A table is best kept at a scale of 50 to 100 regions;

3) The maximum size of each cell should not exceed 10MB. If it exceeds, some consideration should be given to business splitting. If it is impossible to split, only mob can be used;

4) Unlike traditional relational databases, there are no more than 3 column families in an HBase table. Columns in a column family can be added dynamically. Do not design too many column families;

5) The column family name must be as short as possible, because we know that when storing, each keyvalue will contain the column family name;

6) If there is more than one column family in a table, it must be noted that the difference in the number of rows between different column families is not too large. For example, column family A has 100,000 rows, and column family B has 100 million rows, then rowkey has 100 million rows, and regions are divided according to row keys, so column family A may be broken into many, many small region, which will cause more IO when scanning column family A, which is inefficient.

7) The TTL time can be set for the column family, and HBase will automatically delete the data after the set time is exceeded.

There are two setting methods:

# Set when creating a table, the TTL unit is seconds, in this example, the data of the column cluster 'f1' is retained for 1 day (86400 seconds)

hbase(main):002:0>create 'table', {NAME => 'f1', TTL => 86400}

# By modifying table settings

hbase(main):002:0>alter 'table', {NAME => 'f1', TTL => 86400}

It should be noted here that once the set time is exceeded, the data cannot be read. However, the real deletion of expired data occurs during major compaction.


Three Strategies for RowKey Design

As a distributed storage database, HBase is very easy to expand, but it is still a headache for the "hot spot" problem.

The so-called "hot spot" problem (HotSpotting) is that the request (read or write) falls on a concentrated individual region in a short time, resulting in a sharp increase in the load of the machine where the region is located, exceeding the capacity of the single point instance, thus causing Performance is degraded or unavailable.

To solve this problem, it is necessary to design the RowKey so that the data is written to as many regions as possible.

for example:

If the region is divided into 26 according to 26 letters, then the records written to the rowkey starting with m at the same time will be written to the same region at the same time

For example, m001, m002, m003, m004, m005.

Therefore, the design of the RowKey is very critical. There are several common design strategies.

1) salting

The salting strategy is to place the generated random number at the beginning of the row key as a prefix, so that each row key has a random lexicographical order.

To optimize the above case, we adopted the salting strategy to generate a random letter for each rowkey before insertion, which becomes

am001, zm002, nm003, qm004, lm005

In this way, you can write to 5 regions at the same time and break it up successfully.

Side effect: Since the prefix generation is random, if you want to query the rows lexicographically, you need to do more work. From this point of view, salting increases the throughput of write operations, but also increases the overhead of read operations.

2) Hashing

Hashing strategy is also a special kind of salting, which uses a one-way hash instead of randomly assigning prefixes.

This enables rows for a given rowkey to have the same prefix when "salted", thus spreading the load between RegionServers and allowing read operations to predict what the prefix value will be. Deterministic hash ( deterministic hash ) allows the client to reconstruct the complete row key, and then use the Get method to query the determined row as normal.

3) reverse key

The third way to prevent hotspotting is to reverse a fixed-length or countable key, so that the position that changes the most is placed first in the rowkey.

Side effect: It has no effect on the Get operation, but it is not conducive to the range query of the Scan operation, because the order of the data on the original RowKey has been disrupted.

In HBase core feature - region split, we know that we have already mentioned about pre-partitioning.

The main reason is that when a table is first created, only one region is allocated to the table. Therefore, at the beginning, all read and write requests will fall on the region server where this region is located, regardless of how many region servers you have in the entire cluster. The distributed nature of the cluster cannot be fully utilized.

Therefore, pre-partitioning is mainly to solve the "hot spot" problem.

The most common table building statements are:

create 'tb',{NAME => 'f1',COMPRESSION => 'snappy' }, { NUMREGIONS => 50, SPLITALGO => 'HexStringSplit' }

NUMREGIONS is the number of regions. Generally, the number of regions is calculated according to about 8-10GB per region. If the cluster size is very large, the number of regions can be appropriately larger.

SPLITALGO is an algorithm for rowkey splitting. Hbase comes with three pre-split algorithms, namely HexStringSplit, DecimalStringSplit and UniformSplit.

Various Split algorithm applicable scenarios:

HexStringSplit: rowkey is prefixed with a hexadecimal string

DecimalStringSplit: rowkey is prefixed with a decimal number string

UniformSplit: rowkey prefix is ​​completely random

Read performance optimization
The front mainly talks about some design optimization points.

If you find that the query is slow during the use of HBase, then you need to analyze the reasons for the slow query according to the specific situation, and take corresponding strategies.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us