A rowkey is the unique identifier for each row in an HBase table. It controls how data is stored, partitioned, and accessed. Design rowkeys carefully before writing data at scale.
This topic covers five design considerations, with tradeoffs and examples for log data and transaction data.
How rowkeys work
Query methods
HBase supports two query methods, each with different constraints on rowkey design:
| Method | Description | Constraint |
|---|---|---|
| GET | Looks up a single row by its complete rowkey | All fields that make up the rowkey must be known |
| Scan | Reads a range of rows between a start key and an end key | Only prefix-based ranges are supported |
Prefix constraint for scans: A scan matches rows that start with a given prefix, but cannot query by suffix or match values in the middle of a rowkey. For example, if rowkeys are dictionary words, a scan can find all words starting with pre, but cannot find words ending with ing.
For queries that cannot be expressed as a prefix scan, use one of the following approaches:
Create an index table with an inverted key structure
Apply a server-side filter to discard unwanted rows
Use secondary indexes
Rowkey uniqueness and versions
Rows with the same rowkey are treated as a single record with multiple versions. By default, a GET returns the latest version. Rowkeys must be unique unless you are intentionally using Multi-Version Concurrency Control (MVCC).
Use a rowkey like a database primary key. It can be a single field or a composite of multiple fields:
[user_id]— one record per user[user_id][order_id]— multiple records per user
Design considerations
Data distribution: avoid hot spots
HBase distributes rows across Region servers by rowkey range (lexicographic order). If many writes share a common prefix — for example, a timestamp-first key like 2024-01-01T00:00:01 — all writes land on the same Region server. This creates a hot spot that degrades write throughput and leaves other servers idle.
Use one of the following techniques to spread writes across Region servers:
Salting with a hash prefix
Prepend the first few characters of an MD5 hash to the rowkey. Because the hash is deterministic, the same input always maps to the same prefix, so reads remain efficient.
[md5(user_id).subStr(0, 4)][user_id][order_id]Tradeoff: rows for the same user are spread across different Regions. Scanning a range for a single user requires multiple targeted GETs or a scan with a filter.
Reversing the key
Reverse the high-cardinality prefix field. For example, reversing a user ID that increments over time randomizes the leading bytes.
[reverse(user_id)][order_id]Tradeoff: natural ordering is lost, so range scans on the reversed field are not meaningful.
Bucketing with modulo
Assign each row to a bucket using a modulo operation, then prepend the bucket number. This is effective for time series data where timestamps are monotonically increasing.
long bucket = timestamp % numBuckets;
[bucket][timestamp][hostname][log_event]Tradeoff: to retrieve all data for a time range, you must scan all numBuckets ranges and merge the results.
Adding a random suffix
Append a random number to distribute writes across multiple rows.
[user_id][order_id][random(100)]Tradeoff: reading a specific record requires knowing the random suffix. Point lookups are impractical without an index.
How to choose: If you need to scan across the distributed rows (not just look up individual records), use hashing rather than random suffixes — hashing is deterministic, so reads can be routed efficiently.
Rowkey length: keep it short
Rowkeys are stored with every column value in HBase. A long rowkey multiplies storage overhead across every column in every row. Keep rowkeys as short as possible:
Replace strings with numeric types. A
longtakes 8 bytes; the string"2015122410"takes 10 bytes, and an MD5 string takes 32 bytes. UseLong(2015122410)instead of"2015122410".Use codes instead of full names. For example, use
tbinstead of"Taobao".
Field boundary clarity: prevent partial matches
When a rowkey combines multiple fields without delimiters, a scan range may return extra rows. For example, if the rowkey is [column1][column2][column3] and you scan from host1 to host2, the row host12... also falls in that range.
Two approaches prevent this:
Fixed-length padding: Pad each field to a fixed width so boundaries are unambiguous.
[rpad(column1, 'x', 20)][column2]Delimiter: Separate fields with a delimiter character.
[column1][_][column2]Fixed-length padding is more efficient for scans. Delimiters are easier to read.
Descending order: use reverse timestamps
By default, HBase scans return rows in ascending key order. If you need the most recent entries first, two options are available:
Option 1: Reverse scan API (scan.setReverse(true))
Simpler to implement, but reverse scans perform worse than forward scans. Use this when descending order is an occasional requirement.
Option 2: Reverse timestamp in the rowkey
Store Long.MAX_VALUE - timestamp instead of the raw timestamp. This inverts the natural sort order so that newer entries appear first in a forward scan.
timestamp = Long.MAX_VALUE - timestamp;
[hostname][log_event][timestamp]Use this when descending order is the primary access pattern and scan performance is critical.
Design examples
The right rowkey design depends on your primary access patterns. The same dataset can require a different design depending on how it is queried. The examples below show how access patterns drive design decisions.
Log data and time series data
The data elements are: hostname, log_event, timestamp.
| Access pattern | Rowkey design | Notes |
|---|---|---|
| Query a metric for a host over a time range | [hostname][log_event][timestamp] | Efficient for range scans per host. May create hot spots if a single host dominates writes |
| Query the most recent records for a host | [hostname][log_event][Long.MAX_VALUE - timestamp] | Reverse timestamp puts the latest entries first in a forward scan |
| Distribute writes evenly across time (large data volumes or no dominant host) | [bucket][timestamp][hostname][log_event] where bucket = timestamp % numBuckets | Requires scanning all bucket ranges to aggregate results for a time range |
How to choose: Start with [hostname][log_event][timestamp] if per-host range queries are the primary use case. Switch to the bucket pattern if write hot spots appear or if time-range queries span many hosts.
Transaction data
A transaction involves three roles: a buyer, a seller, and an order number. Different access patterns require different rowkey designs — and often multiple tables.
| Access pattern | Table | Rowkey design |
|---|---|---|
| Query a seller's orders in a time range | Seller table | [seller_id][timestamp][order_number] |
| Query a buyer's orders in a time range | Buyer table | [buyer_id][timestamp][order_number] |
| Look up an order by order number | Index table | [order_number] |
Design all three tables to cover all three access patterns. Use the index table to look up the order_number, then query the buyer or seller table with that value.
What's next
HBase data model overview
Secondary indexes in HBase
Performance tuning for HBase tables