By Jiao Xian
TimescaleDB is a time series database developed by Timescale Inc. Founded in 2015, it claims to be fully SQL-compatible and is essentially an Extension based on PostgreSQL (hereinafter referred to as PG). Its main selling points are as follows:
The following describes the product TimescaleDB in detail. Unless otherwise stated, the TimescaleDB mentioned here refers to the standalone TimescleDB v1.1 that is open-source on Github.
TimescaleDB is based on PostgreSQL, so its data model is quite different from that of NoSQL time series databases (such as our TSDB, and InfluxDB).
In a NoSQL time series database, the data model is usually as follows, that is, a piece of data includes both the timestamp and the collected data, as well as the device metadata (usually represented by Tagset). The data model is as follows:
However, in TimescaleDB, the data model must be presented in the form of a two-dimensional table. This requires users to design and define a two-dimensional table by themselves in conjunction with their own business scenarios that use time series data.
In the official TimescaleDB documents, two paradigms are provided for how to design time series data tables:
A narrow table records metrics separately, and each row of records contains only one metricValue - timestamp. For example:
A wide table records multiple metrics of the same device in the same row with the timestamp as the axis. As for some properties (metadata) of the device, they are only used as auxiliary data of the record, and can even be directly recorded in other tables (when these data are needed later, you can query through the JOIN statement).
Basically, it can be considered that a narrow table corresponds to a single-valued model, while a wide table corresponds to a multi-valued model.
The relational table model of the traditional database is used, so the metric value of TimescaleDB must be strongly-typed, and can be the numeric type, the string type, and other types in PostgreSQL.
Based on PostgreSQL, TimescaleDB has made a series of extensions, mainly covering the following aspects:
The above-mentioned points 3 and 4 are micro-innovations for time series data scenarios based on the existing mechanisms of PostgreSQL. Therefore, points 1 and 2 will be mainly explained in the following.
In time series data application scenario, the number of records is often huge, which can easily reach several billion. For PG, a large number of B+tree indexes are used, therefore, when the data volume reaches a certain level, the write performance will obviously decrease (this is usually because the indexes themselves become very large and complex). This decline in performance is intolerable for application scenarios of time series data, and this is the problem that needs to be solved for the core automatic partitioning feature of TimescaleDB. This feature is intended to achieve the following goals:
The implementation for automatic partitioning of TimescaleDB is mainly based on the table inheritance mechanism of PG. The following figure shows an overview of the automatic partitioning mechanism of TimescaleDB:
Under this mechanism, after a user creates a common time series table and registers it as a hypertable through the interface of TimescaleDB, the subsequent data writing and query operations are actually taken over by TimescaleDB. In the figure above, the original table created by the user is generally called the "main table", while the hidden child table created by TimescaleDB is generally called the "chunk".
It should be noted that the chunk is automatically created along with data writing. Each time a new chunk is created, the timestamp range expected to be covered by this chunk is computed (the default is one week). In addition, considering that the writing speed and density of time series data are different under different application scenarios, the timestamp range of the new chunk is computed using an adaptive algorithm when a new chunk is created, to gradually compute the most suitable timestamp range in a specific application scenario. And PG 10.0
The detailed implementation of the adaptive algorithm lies in the
chunk_adaptive.c of TimescaleDB. The basic idea is to reasonably compute the timestamp range of the next chunk based on the timestamp filling factor and the file size filling factor of the historical chunk .
According to the official test results, with the transparent and automatic partitioning feature, the writing performance of TimescaleDB, compared with the traditional single-table writing of PG under the same data magnitude, can also maintain a relatively stable state even as the magnitude increases.
Note: The above benchmark test result is taken from the official Timescale website.
The external interface of TimescaleDB is SQL, which inherits 100% of all the SQL features supported by PG. In addition, for time series database usage scenarios, it also customizes some interfaces for users to use in applications, and these interfaces are all presented through SQL functions (the standard name is User-Defined Functions). The following are some examples of such interfaces:
This function is used when performing a downsampling query. By specifying a time interval through this function, time series data are downsampled at the specified interval and the required aggregate function is used, to implement the downsampling query. An example statement is as follows:
SELECT time_bucket('5 minutes', time) AS five_min, avg(cpu) FROM metrics GROUP BY five_min ORDER BY five_min DESC LIMIT 10;
The data points are downsampled in units of 5 minutes to find the mean value.
To analyze and query time series data in a diversified way, TimescaleDB provides the following new aggregate functions.
first()to find the first value in the aggregated set of data.
last()to find the last value in the aggregated set of data.
histogram()to find the histogram of the value distribution in the aggregated set of data.
Note: The newly added aggregate functions can also be used in non-time series scenarios.
To delete the data chunks before or after the specified time point. For example, delete all chunks more than three months old. It is similar to the Retention Policies feature of InfluxDB, but currently TimescaleDB has not implemented automatic chunk deletion. If the full Retention Policies feature is required, system-level timing tasks (such as crontab) and
drop_chunks() statements are required to implement it.
The example statement for
drop_chunks() is as follows. It means to delete all data chunks in the
conditions table that are between three and four months old:
SELECT drop_chunks(older_than => interval '3 months', newer_than => interval '4 months', table_name => 'conditions');
In addition, some interfaces customized by TimescaleDB are basically interfaces to facilitate the database administrator to manage metadata, and I will not go into details here. For definitions and examples including the above interfaces, see the official API documentation.
TimescaleDB has not made any changes to the PG storage engine, so index data and table data is stored in the same way as in PG. And, the default B+tree index is used when TimescaleDB indexes a chunk. Therefore, the data storage mechanism of each chunk can be shown in the figure below:
It is not necessary to explain this storage mechanism, because TimescaleDB has not changed it. However, considering the time series database usage scenarios, it can be found that it is more appropriate for the chunk of TimescaleDB to adopt this mechanism:
As for the problem that the size of tables and indexes increases after massive data is inserted, this is avoided by the above-mentioned automatic partitioning feature.
In addition, TimescaleDB is completely based on the storage engine of PG, so no changes have been made to WAL. Therefore, the high availability cluster solution of TimescaleDB can also be built based on the Stream Replication Technology of PG. TimescaleDB also has officially introduced some open-source components based HA solutions.
To sum up, TimescaleDB is built entirely on PostgreSQL, so it has several inherent advantages:
But its shortcomings are also obvious.
In any case, TimescaleDB is also an attempt launched from another perspective for the time series database. At present, the time series database is still in an emerging stage, and its future development direction is also worthy of our attention and reference.
digoal - May 16, 2019
digoal - May 28, 2021
digoal - May 17, 2021
digoal - May 16, 2019
Alibaba Cloud Storage - March 1, 2021
ApsaraDB - July 28, 2021
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
Supports data migration and data synchronization between data engines, such as relational database, NoSQL and OLAPLearn More
More Posts by Alibaba Clouder