Rowkey design is an important factor that affects the performance of ApsaraDB for HBase. The following text provides a collection of common problems and their solutions:

Rows are sorted in lexicographical order by rowkey. This design optimizes scanning and allows you to store related rows or adjacent rows that will be read together. However, poor rowkey design is a common cause of hotspotting. Hotspotting occurs when a large amount of traffic is concentrated on one or a small number of nodes in a cluster. Traffic refers to operations such as reads and writes. If the traffic overloads the server that hosts a region, the performance of the region will degrade and may even cause the region to become unavailable. This may also have adverse effects on other regions in the same region server because the server cannot provide services for the requested load. Therefore, it is important to design data access patterns that can distribute load evenly across clusters.

To prevent hotspotting during write operations, the rowkey must be designed so that data can be written to as many regions as possible at the same time. Try to avoid writing data to only one region, unless it is necessary for the data to be in one region.

The following sections describe common methods to prevent hotspotting. Each method has its own pros and cons.

Salting

Salting in HBase refers to placing a random number at the beginning of a rowkey. This operation randomly assigns a prefix to each rowkey to cause it to sort differently than usual. The number of the possible prefixes corresponds to the number of regions to which you want to distribute data. If you notice rowkey patterns that appear repeatedly in other more evenly distributed rows, we recommend that you use salting. In the following example, salting distributes the load across multiple region servers. The example also illustrates the negative impact salting has on read operations.

The following table shows a few rowkeys, which are divided into regions based on their prefixes. For example, rowkeys that have the prefix 'a' are distributed to Region A, and rowkeys that have the prefix 'b' are distributed to Region B. The rowkeys of the following table all start with 'f'. Therefore, these rows are distributed to a single region.

foo0001
foo0002
foo0003
foo0004

To distribute the rows evenly across different regions, you need four salts: a, b, c, and d. Each letter prefix corresponds to a different region, which distributes the rowkeys across four different regions. The following rowkeys are prefixed with a different letter each, and are written to four different regions at the same time. The throughput is four times that of writing all the data to one region.

a-foo0003
b-foo0001
c-foo0004
d-foo0002

When you insert a new row, a random prefix from one of the four possible salt values is assigned to the row.

a-foo0003
b-foo0001
c-foo0003
c-foo0004
d-foo0002

When salting is performed, prefixes are assigned at random, which improves the throughput of write operations. However, the original order of the rows are affected, which increases the workload of read operations.

Hashing

Compared with salting, hashing is the use of a one-way hash to generate a consistent prefix instead of a randomly generated prefix. This allows you to specify the same prefix for specific rows in a way that distributes the load across region servers, but allows for predictability during read operations. You can use deterministic hash to refactor a rowkey on the client and retrieve the row by using the GET operation.

Take the example described in the salting section. You can use a one-way hash to obtain the rowkey foo0003 and predict the prefix a. You can then combine the rowkey and prefix to obtain the row. This method can be further optimized. For example, always make the specific rowkey pairs in the same region.

Reversing the Key

Another common method used to prevent hotspotting is to reverse a fixed-length or numeric rowkey so that the most frequently changed part (the least significant digit) is in the front. This randomizes the rowkey at the cost of row order.

Monotonically increasing rowkeys or time series data

When data is written to an ApsaraDB for HBase cluster, the process is locked. During this time, all clients will wait for a region (a single node) to become unlocked. After the write operation completes, the cycle starts again. This problem frequently occurs when monotonically increasing or time series data are used as rowkeys. This also applies to sequential rowkeys, which orders non-sequential data in a sequential order, causing hotspotting. Therefore, try to avoid using timestamps or sequences (for example, 1, 2, 3) as rowkeys.

If you need to import files ordered by time (such as logs) to ApsaraDB for HBase, we recommend that you reference OpenTSDB documentation. The documentation includes a page that describes the pattern for ApsaraDB for HBase. The format of rowkey in OpenTSDB is [metric_type] [event_timestamp]. At first glance, this seems to contradict the idea of not using timestamps as rowkeys. However, OpenTSDB puts metric_type before event_timestamp. There are hundreds of metric_type values which are enough to distribute the load across the regions. Therefore, the PUT operations can still be distributed across various regions of the table, despite a continuous data input stream.

Minimize row and column sizes

In ApsaraDB for HBase, values are stored as cells in the system. To find a cell, you need to know the row, column name, and timestamp. Typically, if the size of the rows or column names is too large or even larger than the size of the value, some interesting scenarios may occur. There is an index in the storefiles of ApsaraDB for HBase that is used to facilitate random access to values. However, if the coordinates required to access a cell is too large, the index may consume a large amount of memory and is ultimately exhausted. To solve this problem, you can set a larger region size, or use smaller rows and column names. You can also use compression to solve this problem to a greater degree.

In most cases, minor inefficiencies do not have great impacts on performance. However, in big data scenarios, it cannot be ignored. because column families, properties, and rowkeys may be repeated hundreds of millions of times in the data.

Column family

Make sure that the name of the column family is as small as possible. We recommend that you use only one character. (For example, use f)

Properties

Although the detailed property names (such as myVeryImportantAttribute) are easy to understand, we recommend that you use short property names (for example, via) in ApsaraDB for HBase.

Rowkey length

Make the rowkey short enough to be readable, which is helpful for obtaining data. (for example, Get vs. Scan). The short rowkey is useless for data access, but it is not better than a longer rowkey on improving the retrieve capabilities of get/scan. You need tradeoffs when you design rowkeys.

Byte pattern

The long type has 8 bytes. You can save unsigned integers up to 18,446,744,073,709,551,615 within 8 bytes. If you store the preceding number as a string, assuming that each character takes up on byte, the number of bytes required to store the number is nearly three times the original.

You can use the following sample code to test this:

// long
//
long l = 1234567890L;
byte[] lb = Bytes.toBytes(l);
System.out.println("long bytes length: " + lb.length);   // Returns 8

String s = String.valueOf(l);
byte[] sb = Bytes.toBytes(s);
System.out.println("long as string length: " + sb.length);    // Returns 10

// hash
//
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(Bytes.toBytes(s));
System.out.println("md5 digest bytes length: " + digest.length);    // Returns 16

String sDigest = new String(digest);
byte[] sbDigest = Bytes.toBytes(sDigest);
System.out.println("md5 digest as string length: " + sbDigest.length);    // Returns 26

However, binary representation makes the data hard to read outside the code. In the following example, a shell is displayed when you want to add a value:

hbase(main):001:0> incr 't', 'r', 'f:q', 1
COUNTER VALUE = 1

hbase(main):002:0> get 't', 'r'
COLUMN                                        CELL
 f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
1 row(s) in 0.0310 seconds

The shell tries to print a string, but in this case, it can only print a hexadecimal. The same thing happens when the rowkey is in the region. That will be fine if you know what is stored, but it may be hard to understand the result if any data can be put in the same cell. This is the most important consideration.

Reverse timestamp

A common problem in databases is finding the latest version of a row. In this case, a reversed timestamp can be used as part of the rowkey to facilitate sorting. This technology includes appending (Long. MAX_VALUE-timestamp) to the end of a key. For example, [key] [reverse_timestamp].

You can use [key] to scan the value of the latest [key] in the table and obtain the first record. The rowkeys of ApsaraDB for HBase are sorted in a sequential order, so the key is the first and before any other older rowkeys.

This technique can be used instead of requesting the version numbers in order to permanently save all versions (or for an extended period of time). In addition, you can quickly obtain other versions by using the same scan technique.

Rowkey and column family

The rowkey is within a column family. Therefore, the same rowkey can exist in each column family of the same table without any conflicts.

Rowkeys cannot be modified

Rowkeys cannot be modified. The only way to modify a rowkey is to delete it and then insert a new one. We recommend that you use a well-designed rowkey from the beginning (or before you insert a large amount of data).

Relationship between the rowkey and the region split

If the table has been pre-splitted, the next step is to understand how rowkeys are distributed across the region boundaries. Consider using displayable 16-bit characters as the key part of the rowkey to explain the importance. For example, 0000000000000000 to ffffffffffffffff. You can obtain 10 regions by using Bytes.split to specify the key ranges. This is a splitting method when you create a region with Admin.createTable(byte[] startKey, byte[] endKey, numRegions).

48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48                                // 0
54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10                 // 6
61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68                 // =
68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126  // D
75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72                                // K
82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14                                // R
88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44                 // X
95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102                // _
102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102                // f

The problem is that the data will be stored in the first two regions and the last region, which causes heavy workloads on these regions due to uneven data distribution. You can refer to the ASCII Table to understand the reason. Based on the ASCII table, 0 is the 48th and f is the 102nd. Only the values [0-9] and [a-f] are meaningful, so the values of the range from 58th to 96th will not appear in the keyspace and the regions in the middle within this range will never be used. To pre-split the keyspace in this example, you need to customize the split.

Tutorial 1: Pre-splitting tables is a good practice. However, when you pre-split tables, make sure that all regions have the corresponding keyspaces. Although the preceding example is about the keyspace of the 16-bit key, the solution is also applicable to other keyspaces.

Tutorial 2: Although the 16-bit key is not preferred (usually used for the data that can be displayed), it can still be used with the pre-splittling tables if all regions have the corresponding keyspaces.

The following case shows how to pre-split a 16-bit rowkey.

public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;
  }
}

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  }
  return splits;
}