All About BTrDB: Berkeley's Tree Database

By Jiao Xian

This article studies and introduces the internal implementation details of BTrDB, an open-source time series database for Internet of Things (IoT) applications.

1. Scenarios

A practical project is introduced in the BTrDB paper, which can clearly explain the original design intention and applicable scenarios of BTrDB:

A large number of sensors of certain types are deployed in a power grid. Each sensor generates 12 timelines with 120 Hz frequency (that is, 120 points per second) and 100 NS time accuracy. For various reasons, problems, such as delays, and disorder (time), often occur in these sensors' data transmission. In this scenario, BTrDB can support 1000 similar sensors in a single unit, with a write rate of about 1.44 MB points/s.

This project has the following features:

(1) The timelines are invariant over a long period of time, and their life cycle are consistent with those of (IoT) devices.
(2) The data frequency is very high (120Hz) and fixed.
(3) The time resolution of the data is very high (100 NS level). In general, the time accuracy of time series databases, such as Druid and TimescaleDB, can reach MS level at most.
(4) Data transmission is often disordered.
(5) The number of timelines is limited.

To adapt to the preceding scenarios, BTrDB designs and proposes the "time-partitioning version-annotated copy-on-write tree" data structure, in which a tree is built for each timeline (refer to B+Tree). Data is sorted by the time stamp in the tree, and the leaf nodes store the time series data for a certain time period in an orderly way.

It is conceivable that the life cycle of this tree is directly linked to the life cycle of the device, so as time goes by, even if the tree contains only one timeline, it also occupies a considerable storage space (about 10 MB points/day). In addition, based on the tree structure and the version-annotated concept, BTrDB can support disordered data and time series data with arbitrary precision well.

The data structure is different from the previous variants of time series databases based on LSM-Tree, so BTrDB also provides a new time series data query interface, which is convenient for building various algorithms and applications on the upper layer of BTrDB.

2. Data Structure

2.1. Version-Annotated & CoW Features

When writing data, BTrDB first modifies the data blocks in the memory (create new blocks or modify existing blocks using CoW mechanism), and writes the data back to the underlying block storage when the data reaches a certain threshold. Because of the CoW mechanism, and because the underlying block storage (Ceph is used by default) cannot overwrite the update, only a new version of the data block can be created.

2.2. Leaf Nodes

Because of the fixed frequency of the data sent from IoT devices, the space occupied by these leaf nodes is basically the same.

When the leaf node has not been persisted to the underlying storage, the timestamp and double-precision floating point values are stored in memory as an array, respectively. When serialized to the underlying storage, delta-delta compaction is performed on the timestamp and values. Before the double-precision floating point values are serialized, the number of floating point data is split into the mantissa and the exponent, and delta-delta compaction is performed respectively.

2.3. Intermediate Node

An intermediate node is divided into multiple buckets. Each bucket contains the link (with version number) pointing to the child node, and the statistical data of the child node:

The time range of the child node
Aggregated data, such as sum, max, min, and count
The link address and version number of the child node

When processing a query, if the time precision of the intermediate node meets the query requirement, the query operation no longer reads the lower-level child nodes, so that the precision function is naturally realized. This implementation method can effectively process disordered and duplicate data, as well as delete operations, and can ensure data consistency better than other existing implementations.

2.4. Tree Split During Insertion

A new tree (corresponding to a new timeline) has only one root node. In the BTrDB implementation, the time span of the root node is about 146.23 years, thus the time span of each bucket is 146.23/64 ~ = 2.28 (years). According to the default configuration, the year 1970 is on the 16th bucket in the root node.

It can be seen that the root node has limited the time range of data when it is created, and subsequent data insertion is split layer by layer from top to bottom. When the timeline is incomplete due to data loss or other reasons, the depth of some nodes may be different, so it is not a strictly Balanced tree. The data insertion process is as follows:

Data insertion starts from the root node;
If the current node is an intermediate node, then each data point is traversed to find the corresponding bucket;
If the corresponding bucket does not exist, a new bucket and its associated child node are created:
- If the number of points to be inserted in the current bucket exceeds the maximum number of leaf nodes (1024 by default), an intermediate child node is directly created;
- Otherwise, a leaf node is directly created;
Insert data to the child node associated with the bucket to which it belongs;
If the current node is a leaf node, and the sum of existing data points and data points to be inserted in the node exceeds 1024, the current node is split to create a new intermediate node and the data is inserted into the new intermediate node. Otherwise, the data to be inserted is merged with the existing data in the current node and sorted by the timestamp;
After the current node is successfully inserted, the statistical data of the parent node is updated from bottom to top;

As can be seen from the above process, a node may be split in two places during insertion. One is a top-down split from the root node. The other is an upward split from the leaf node.

Although this tree is not a balanced tree, for IoT projects, the timeline life cycle and the frequency of data collection of the device are very stable. In most scenarios, the data in the nodes are evenly distributed.

2.5. Memory Space Occupied by Nodes

By default, a leaf node can store up to 1024 data points, and a intermediate node can store up to 64 child node pointers. Therefore:

For leaf nodes that have not been persisted, the memory space occupied is 1024 * 2 * 8 = 16 KB (see 2.2.1)
For intermediate nodes that have not been persisted, the memory space occupied is 64 * 6 * 8 + 64 * 2 * 8 = 4 KB(see 2.2.2)

3. Data Storage

3.1. Write to WAL

When inserting data, the data is written to WAL (Write Ahead Log) first;
Each time data is written to WAL, a check point is returned, which indicates the write location of the data in WAL;
After the data is successfully written into WAL, the original data and check points will be written into the buffer of the timeline;
The buffer of the timeline corresponds to the timeline one by one, with a maximum capacity of 32768 data points;
When the buffer is full, data is inserted into the tree structure, and the check points corresponding to the buffer are marked as deleted in WAL;
In the replay process of WAL, raw data is filtered based on the deleted check points.

The following schematic diagram shows the correlation between check points in WAL and the time series buffer. After the buffer is emptied, BTrDB writes the deleted check points into the metadata (block attributes) of the block file corresponding to the current WAL:

When all check points in a WAL block file are marked as deleted, the file can be deleted from Ceph. When the size of the current WAL file exceeds 16 MB, a new block file is created. Ideally, the block file can be promptly deleted. However, if some timelines are abnormal, and as mentioned earlier, the buffer cannot be recycled until 8 hours later, then the WAL files responsible for recording these timelines can only be recycled 8 hours later.

These retained WAL files are only 16 MB in size, and the number of these files is linearly related to the number of abnormal IoT devices. Therefore, more IoT device operation statistical data is required to measure the impact.

3.2. Write to the Block

After being persisted, the BTrDB tree structure generates two types of data files. One is called the superblock, which records the basic information of the current tree, such as the latest version, the update time, and the root node location. The other is called the segment, which uniformly contains the data of leaf nodes and intermediate nodes of the tree.

A superblock has a version. Each version of superblock only occupies 16 Bytes. The format is as follows:

{Root: root node location, 8 Bytes, timestamp: modification time, 8 Bytes}

The addressing method of the superblock in Ceph is as follows:

Block storage id = uuid.toString() + (version >> 20)
Offset in block storage = (version & 0 xFFFFF) * 16

When the BTrDB tree structure is persisted, leaf nodes and intermediate nodes are serialized into the segment file together. The addressing method for each node is as follows:

Block storage id = uuid.toString() + (address >> 24)
Node offset in block storage = (address & 0xFFFFFF)

The WAL file, the superblock block file and the segment block file are all 16 MB in size. In addition, in BTrDB, the compaction operation is not performed, and the data of the expired version is not cleaned. Only the WAL processing described above is performed, so the write amplification is obvious.

New Query Semantics

Here, I list and briefly introduce the new query semantics provided by BTrDB. These query semantics is closely related to the BTrDB data structure, either to take advantage of some tree structure features or to avoid some tree structure defects:

GetRange(UUID, StartTime, EndTime, Version) -> (Version, [(Time, Value)]) to query the detailed (original) data of a timeline within a certain time range;
GetLatestVersion(UUID) -> Version to query the latest version of the timeline;
GetStatisticalRange(UUID, StartTime, EndTime, Version, Resolution) -> (Version, [(Time, Min, Mean, Max, Count)]) to obtain aggregated data of a timeline, which is in a given time range, satisfies a certain time precision and greater than or equal to a given version;
GetNearestValue(UUID, Time, Version, Direction) -> (Version, (Time, Value)) to obtain the next vertex forward and backward;
ComputeDiff(UUID, FromVersion, ToVersion, Resolution) -> [(StartTime, EndTime)] to obtain the start and end time of all update nodes within the range of the two version numbers under the condition of satisfying the given time precision. It is suitable for incremental computation.

The Resolution parameter in the interface above has a significant impact on the performance of the interface As mentioned above, the time resolution of the root node is 2.2 years. From the root node to the underlying node of the tree, the time resolution of data in the node is getting higher and higher. When querying, low-resolution data has a high degree of aggregation, and few data blocks need to be scanned. High-resolution data has a low degree of aggregation, but many data blocks need to be scanned.

Summary

In BTrDB, the data structure is built for a single timeline, and a tree is built to store time series data based on the stability of IoT device data. The tree structure solves some problems that traditional TSDBs face with aspects such as disordered insertion, deletion, and pre-precision reduction.

Community

All About BTrDB: Berkeley's Tree Database

1. Scenarios

2. Data Structure

2.1. Version-Annotated & CoW Features

2.2. Leaf Nodes

2.3. Intermediate Node

2.4. Tree Split During Insertion

2.5. Memory Space Occupied by Nodes

3. Data Storage

3.1. Write to WAL

3.2. Write to the Block

New Query Semantics

Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

IoT Platform

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Cloud Data Transfer