How do I design primary keys for Lindorm wide tables - Lindorm

LindormTable is a distributed data engine in which data is distributed based on primary keys. If the primary key of a table contains multiple columns, LindormTable uses the columns in a left-to-right order to query data. A primary key that is improperly designed cannot be effectively used in queries. In this case, a large number of queries may be performed on a small amount of hot spot data, which degrades the query performance. Therefore, the design of primary keys plays an important role in data partitioning and data queries in Lindorm. This topic describes the considerations for primary key design and provides examples.

Considerations

Are primary key values unique?

Different versions of a row use the same primary key value. By default, the latest version is returned when a query is performed. In most cases, primary keys must be unique unless multi-versioning is implemented.

Best practice: A primary key is a column or a set of columns. Each primary key value corresponds to a record.

[userid]: Only one column is specified as the primary key. Only one record is generated for each user.
[userid][orderid]: Two columns are specified as the primary key. Multiple records are generated for each user.

How do I design the primary key in different scenarios?

The primary key design limits the query methods that can be used. Lindorm supports SELECT statements that use the following methods:

GET method: The primary key is used to query data. Example: SELECT * FROM table WHERE userid='abc' AND orderid=123.
Note
To use the GET method, you must specify all primary key columns. The values in all primary key columns must be explicit.
SCAN method: Data is queried based on the primary key range. Example: SELECT * FROM table WHERE userid='abc' AND 123<orderid<456.
Note
To use the SCAN method, you must specify the range that you want to scan in the first primary key column. By default, if you do not specify the range, Lindorm rejects queries that must scan all data in the table because full table scans are inefficient. For more information, see SELECT.

Best practice: How do I submit complex queries by using the preceding query methods?

Create an index table.
Specify the columns that you want to scan other than the primary key columns in the query conditions. Lindorm automatically filters out the irrelevant data.
Use secondary indexes.
Use ORDER BY to sort the data in descending order. This way, new records are sorted to the top rows of the table. Example: SELECT * FROM table WHERE userid='abc' AND 123<orderid<456 ORDER BY orderid DESC.
Note
In scenarios in which most queries are submitted to retrieve the up-to-date data, you can design the primary key as [userid][orderid DESC] to sort the data in descending order.

What factors should I consider when I design a primary key?

When you design a primary key, you must consider the length of values in the primary key columns and the number of primary key columns.

The length of values in primary key columns: Values in the primary key columns should be short in length. We recommend that you select columns that stores fixed-length values, such as long integers, as the primary key columns. If the length of values in a primary key column is not fixed, we recommend that limit the value length in the column within 2 KB to reduce storage costs and improve write performance.
The number of primary key columns: Less primary key columns can improve write performance and reduce storage costs. We recommend that you limit the number of primary key columns in a range from 1 to 3.

What should I avoid when designing primary keys?

Data stored in Lindorm is distributed based on primary keys. If the primary key of a table contains multiple columns, data is distributed based the columns in a left-to-right order. To avoid a large number of write operations from being performed on a small amount of hot spot data, note the following items when you define the primary key of a table:

The values in the first primary key column must be dispersed.
Do not specify a column that contains auto-incremental data or a column in which values have the same prefix, such as the timestamp column, as the first primary key column or the index column.
Do not specify a column that contains enumerated data, such as order types, or a column in which values have obvious prefixes as the first primary key column.

If you have to specify a column of the preceding type as the first primary key column, use the hash method to distribute data in the column.

For example, if you have to specify the column pk that contains auto-incremental strings as the first primary key column, you can create a new column named pk1 based on the pk column by using the following algorithm: pk1 = hash(pk).substring(0,4)+pk. The pk1 column is concatenated by the pk column and a prefix that is the first four digits of the result returned by the hash method based on the pk column.

Do stacked hot spots occur for fully distributed data?

The hash method is used to distribute data to different partitions. This prevents a server from being terminated by hot spots and the other servers from being idle. This way, the distributed architecture and concurrent processing are utilized in an efficient manner.

Best practice:

Design a MD5 hash algorithm. The primary key is [md5(userid).subStr(0,4)][userId][orderid].
Design a reverse index. The primary key is [reverse(userid)][orderid].
Design the modulo operation. The primary key is [bucket][timestamp][hostname][log-event]; long bucket = timestamp % numBuckets.
Add random numbers. The primary key is [userId][orderid][random(100)].

Can a primary key be simplified?

You can reduce the number of primary key columns to decrease the amount of data that is scanned and improve the efficiency of queries and insert operations.

Best practice:

Replace the STRING data type with the LONG or INT data type. Example: '2015122410' => Long(2015122410).
Replace names with codes. Example: 'taobao'=> 'tb'.

Commonly used designs

Primary key designs for log data and time series data.
- Query the data of a metric that is generated over a period of time. The primary key is designed as [hostname][log-event][timestamp].
- Query the most recent records of a metric. The primary key is designed as [hostname][log-event][timestamp DESC].
- Query data that contains only the time dimension or query data whose volume is large in a specific dimension. Design the primary key as long bucket = timestamp % numBuckets; [bucket][timestamp][hostname][log-event].
Primary key designs for transaction data.
- Query the transaction records of a seller within a specific period of time. Design the primary key as [seller_id][timestmap][order_number].
- Query the transaction records of a buyer within a specific period of time. Design the primary key as [buyer_id][timestmap][order_number].
- Query data based on order IDs. Design the primary key as [order_number].
- Join the three tables to perform queries. The primary key of the table that stores buyer data is designed as [buyer_id][timestmap][order_number]. The primary key of the table that stores seller data is designed as [seller id][timestmap][order number]. The primary key of the table that stores order IDs is designed as [order_number].