By Jiao Xian
This article studies and introduces the internal implementation details of BTrDB, an open-source time series database for Internet of Things (IoT) applications.
A practical project is introduced in the BTrDB paper, which can clearly explain the original design intention and applicable scenarios of BTrDB:
A large number of sensors of certain types are deployed in a power grid. Each sensor generates 12 timelines with 120 Hz frequency (that is, 120 points per second) and 100 NS time accuracy. For various reasons, problems, such as delays, and disorder (time), often occur in these sensors' data transmission. In this scenario, BTrDB can support 1000 similar sensors in a single unit, with a write rate of about 1.44 MB points/s.
This project has the following features:
(1) The timelines are invariant over a long period of time, and their life cycle are consistent with those of (IoT) devices.
(2) The data frequency is very high (120Hz) and fixed.
(3) The time resolution of the data is very high (100 NS level). In general, the time accuracy of time series databases, such as Druid and TimescaleDB, can reach MS level at most.
(4) Data transmission is often disordered.
(5) The number of timelines is limited.
To adapt to the preceding scenarios, BTrDB designs and proposes the "time-partitioning version-annotated copy-on-write tree" data structure, in which a tree is built for each timeline (refer to B+Tree). Data is sorted by the time stamp in the tree, and the leaf nodes store the time series data for a certain time period in an orderly way.
It is conceivable that the life cycle of this tree is directly linked to the life cycle of the device, so as time goes by, even if the tree contains only one timeline, it also occupies a considerable storage space (about 10 MB points/day). In addition, based on the tree structure and the version-annotated concept, BTrDB can support disordered data and time series data with arbitrary precision well.
The data structure is different from the previous variants of time series databases based on LSM-Tree, so BTrDB also provides a new time series data query interface, which is convenient for building various algorithms and applications on the upper layer of BTrDB.
When writing data, BTrDB first modifies the data blocks in the memory (create new blocks or modify existing blocks using CoW mechanism), and writes the data back to the underlying block storage when the data reaches a certain threshold. Because of the CoW mechanism, and because the underlying block storage (Ceph is used by default) cannot overwrite the update, only a new version of the data block can be created.
Because of the fixed frequency of the data sent from IoT devices, the space occupied by these leaf nodes is basically the same.
When the leaf node has not been persisted to the underlying storage, the timestamp and double-precision floating point values are stored in memory as an array, respectively. When serialized to the underlying storage, delta-delta compaction is performed on the timestamp and values. Before the double-precision floating point values are serialized, the number of floating point data is split into the mantissa and the exponent, and delta-delta compaction is performed respectively.
An intermediate node is divided into multiple buckets. Each bucket contains the link (with version number) pointing to the child node, and the statistical data of the child node:
When processing a query, if the time precision of the intermediate node meets the query requirement, the query operation no longer reads the lower-level child nodes, so that the precision function is naturally realized. This implementation method can effectively process disordered and duplicate data, as well as delete operations, and can ensure data consistency better than other existing implementations.
A new tree (corresponding to a new timeline) has only one root node. In the BTrDB implementation, the time span of the root node is about 146.23 years, thus the time span of each bucket is 146.23/64 ~ = 2.28
(years). According to the default configuration, the year 1970 is on the 16th bucket in the root node.
It can be seen that the root node has limited the time range of data when it is created, and subsequent data insertion is split layer by layer from top to bottom. When the timeline is incomplete due to data loss or other reasons, the depth of some nodes may be different, so it is not a strictly Balanced tree. The data insertion process is as follows:
If the corresponding bucket does not exist, a new bucket and its associated child node are created:
As can be seen from the above process, a node may be split in two places during insertion. One is a top-down split from the root node. The other is an upward split from the leaf node.
Although this tree is not a balanced tree, for IoT projects, the timeline life cycle and the frequency of data collection of the device are very stable. In most scenarios, the data in the nodes are evenly distributed.
By default, a leaf node can store up to 1024 data points, and a intermediate node can store up to 64 child node pointers. Therefore:
1024 * 2 * 8 = 16 KB (see 2.2.1)
64 * 6 * 8 + 64 * 2 * 8 = 4 KB(see 2.2.2)
The following schematic diagram shows the correlation between check points in WAL and the time series buffer. After the buffer is emptied, BTrDB writes the deleted check points into the metadata (block attributes) of the block file corresponding to the current WAL:
When all check points in a WAL block file are marked as deleted, the file can be deleted from Ceph. When the size of the current WAL file exceeds 16 MB, a new block file is created. Ideally, the block file can be promptly deleted. However, if some timelines are abnormal, and as mentioned earlier, the buffer cannot be recycled until 8 hours later, then the WAL files responsible for recording these timelines can only be recycled 8 hours later.
These retained WAL files are only 16 MB in size, and the number of these files is linearly related to the number of abnormal IoT devices. Therefore, more IoT device operation statistical data is required to measure the impact.
After being persisted, the BTrDB tree structure generates two types of data files. One is called the superblock, which records the basic information of the current tree, such as the latest version, the update time, and the root node location. The other is called the segment, which uniformly contains the data of leaf nodes and intermediate nodes of the tree.
A superblock has a version. Each version of superblock only occupies 16 Bytes. The format is as follows:
{Root: root node location, 8 Bytes, timestamp: modification time, 8 Bytes}
The addressing method of the superblock in Ceph is as follows:
Block storage id = uuid.toString() + (version >> 20)
Offset in block storage = (version & 0 xFFFFF) * 16
When the BTrDB tree structure is persisted, leaf nodes and intermediate nodes are serialized into the segment file together. The addressing method for each node is as follows:
Block storage id = uuid.toString() + (address >> 24)
Node offset in block storage = (address & 0xFFFFFF)
The WAL file, the superblock block file and the segment block file are all 16 MB in size. In addition, in BTrDB, the compaction operation is not performed, and the data of the expired version is not cleaned. Only the WAL processing described above is performed, so the write amplification is obvious.
Here, I list and briefly introduce the new query semantics provided by BTrDB. These query semantics is closely related to the BTrDB data structure, either to take advantage of some tree structure features or to avoid some tree structure defects:
The Resolution parameter in the interface above has a significant impact on the performance of the interface As mentioned above, the time resolution of the root node is 2.2 years. From the root node to the underlying node of the tree, the time resolution of data in the node is getting higher and higher. When querying, low-resolution data has a high degree of aggregation, and few data blocks need to be scanned. High-resolution data has a low degree of aggregation, but many data blocks need to be scanned.
In BTrDB, the data structure is built for a single timeline, and a tree is built to store time series data based on the stability of IoT device data. The tree structure solves some problems that traditional TSDBs face with aspects such as disordered insertion, deletion, and pre-precision reduction.
2,599 posts | 758 followers
FollowAlibaba Clouder - July 31, 2019
Aliware - March 19, 2021
Alibaba Cloud Community - November 25, 2021
Alibaba Cloud Serverless - February 17, 2023
Alibaba Cloud Community - December 21, 2021
ApsaraDB - June 7, 2022
2,599 posts | 758 followers
FollowProvides secure and reliable communication between devices and the IoT Platform which allows you to manage a large number of devices on a single IoT Platform.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreUnified billing for Internet data transfers and cross-region data transfers
Learn MoreMore Posts by Alibaba Clouder