SQL and TimescaleDB

By Jiao Xian

1. Overview

TimescaleDB is a time series database developed by Timescale Inc. Founded in 2015, it claims to be fully SQL-compatible and is essentially an Extension based on PostgreSQL (hereinafter referred to as PG). Its main selling points are as follows:

Full SQL compatibility
High reliability backed by PostgreSQL
High write performance of time series data

The following describes the product TimescaleDB in detail. Unless otherwise stated, the TimescaleDB mentioned here refers to the standalone TimescleDB v1.1 that is open-source on Github.

2. Data Model

TimescaleDB is based on PostgreSQL, so its data model is quite different from that of NoSQL time series databases (such as our TSDB, and InfluxDB).

In a NoSQL time series database, the data model is usually as follows, that is, a piece of data includes both the timestamp and the collected data, as well as the device metadata (usually represented by Tagset). The data model is as follows:

However, in TimescaleDB, the data model must be presented in the form of a two-dimensional table. This requires users to design and define a two-dimensional table by themselves in conjunction with their own business scenarios that use time series data.

In the official TimescaleDB documents, two paradigms are provided for how to design time series data tables:

Narrow table
Wide table

A narrow table records metrics separately, and each row of records contains only one metricValue - timestamp. For example:

A wide table records multiple metrics of the same device in the same row with the timestamp as the axis. As for some properties (metadata) of the device, they are only used as auxiliary data of the record, and can even be directly recorded in other tables (when these data are needed later, you can query through the JOIN statement).

Basically, it can be considered that a narrow table corresponds to a single-valued model, while a wide table corresponds to a multi-valued model.

The relational table model of the traditional database is used, so the metric value of TimescaleDB must be strongly-typed, and can be the numeric type, the string type, and other types in PostgreSQL.

3. TimescaleDB Features

Based on PostgreSQL, TimescaleDB has made a series of extensions, mainly covering the following aspects:

Transparent and automatic partitioning of time series data tables
Several special SQL interfaces for time series data application scenarios
Extension of PostgreSQL Planner for writing and querying time series data
Customized parallel query for time series data tables

The above-mentioned points 3 and 4 are micro-innovations for time series data scenarios based on the existing mechanisms of PostgreSQL. Therefore, points 1 and 2 will be mainly explained in the following.

Transparent and Automatic Partitioning

In time series data application scenario, the number of records is often huge, which can easily reach several billion. For PG, a large number of B+tree indexes are used, therefore, when the data volume reaches a certain level, the write performance will obviously decrease (this is usually because the indexes themselves become very large and complex). This decline in performance is intolerable for application scenarios of time series data, and this is the problem that needs to be solved for the core automatic partitioning feature of TimescaleDB. This feature is intended to achieve the following goals:

With the continuous increase of data writing, the data chunks in the time series data table are stored to ensure that the index of each chunk is maintained at a small scale, thus maintaining the writing performance.
In query scenarios based on time series data, the timestamp of time series data is used as the partition key during automatic partitioning to ensure that the required data chunk can be quickly located during query, ensuring query performance.
The partitioning process is transparent to users to achieve auto-scalability.

The implementation for automatic partitioning of TimescaleDB is mainly based on the table inheritance mechanism of PG. The following figure shows an overview of the automatic partitioning mechanism of TimescaleDB:

Under this mechanism, after a user creates a common time series table and registers it as a hypertable through the interface of TimescaleDB, the subsequent data writing and query operations are actually taken over by TimescaleDB. In the figure above, the original table created by the user is generally called the "main table", while the hidden child table created by TimescaleDB is generally called the "chunk".

It should be noted that the chunk is automatically created along with data writing. Each time a new chunk is created, the timestamp range expected to be covered by this chunk is computed (the default is one week). In addition, considering that the writing speed and density of time series data are different under different application scenarios, the timestamp range of the new chunk is computed using an adaptive algorithm when a new chunk is created, to gradually compute the most suitable timestamp range in a specific application scenario. And PG 10.0

The detailed implementation of the adaptive algorithm lies in the ts_calculate_chunk_interval() of chunk_adaptive.c of TimescaleDB. The basic idea is to reasonably compute the timestamp range of the next chunk based on the timestamp filling factor and the file size filling factor of the historical chunk .

According to the official test results, with the transparent and automatic partitioning feature, the writing performance of TimescaleDB, compared with the traditional single-table writing of PG under the same data magnitude, can also maintain a relatively stable state even as the magnitude increases.

Note: The above benchmark test result is taken from the official Timescale website.

Time Series Scenarios Oriented Customization

The external interface of TimescaleDB is SQL, which inherits 100% of all the SQL features supported by PG. In addition, for time series database usage scenarios, it also customizes some interfaces for users to use in applications, and these interfaces are all presented through SQL functions (the standard name is User-Defined Functions). The following are some examples of such interfaces:

time_bucket() function

This function is used when performing a downsampling query. By specifying a time interval through this function, time series data are downsampled at the specified interval and the required aggregate function is used, to implement the downsampling query. An example statement is as follows:

SELECT time_bucket('5 minutes', time)
  AS five_min, avg(cpu)
  FROM metrics
  GROUP BY five_min
  ORDER BY five_min DESC LIMIT 10;

The data points are downsampled in units of 5 minutes to find the mean value.

New aggregate functions

To analyze and query time series data in a diversified way, TimescaleDB provides the following new aggregate functions.

first() to find the first value in the aggregated set of data.
last() to find the last value in the aggregated set of data.
histogram() to find the histogram of the value distribution in the aggregated set of data.

Note: The newly added aggregate functions can also be used in non-time series scenarios.

drop_chunks()

To delete the data chunks before or after the specified time point. For example, delete all chunks more than three months old. It is similar to the Retention Policies feature of InfluxDB, but currently TimescaleDB has not implemented automatic chunk deletion. If the full Retention Policies feature is required, system-level timing tasks (such as crontab) and drop_chunks() statements are required to implement it.

The example statement for drop_chunks() is as follows. It means to delete all data chunks in the conditions table that are between three and four months old:

SELECT drop_chunks(older_than => interval '3 months', newer_than => interval '4 months', table_name => 'conditions');

In addition, some interfaces customized by TimescaleDB are basically interfaces to facilitate the database administrator to manage metadata, and I will not go into details here. For definitions and examples including the above interfaces, see the official API documentation.

4. Storage Mechanism of TimescaleDB

TimescaleDB has not made any changes to the PG storage engine, so index data and table data is stored in the same way as in PG. And, the default B+tree index is used when TimescaleDB indexes a chunk. Therefore, the data storage mechanism of each chunk can be shown in the figure below:

It is not necessary to explain this storage mechanism, because TimescaleDB has not changed it. However, considering the time series database usage scenarios, it can be found that it is more appropriate for the chunk of TimescaleDB to adopt this mechanism:

PG storage is characterized by only adding but not modifying, that is, whether the data is inserted or modified, it is embodied in Heap Tuple as an Append operation of Tuple. Therefore, when this storage engine is used for common data tables in OLTP scenarios, the problem of table expansion exists. However, in application scenarios of time series data, the time series data is not updated or deleted under normal conditions, so the problem with table expansion can be avoided (the time series data itself is written in a large amount, so it can also be considered that the single table actually still expands in the case where massive data is written, but this is not the problem discussed here).
In the native PG, the AUTOVACUUM mechanism exists in PG to solve the table expansion problem. With the mechanism, the dead tuples generated by the update or delete operation in the table are automatically cleaned up. However, this introduces a new problem, that is, when AUTOVACUUM executes, it adds a shared lock to the table, thus affecting the write performance. However, in application scenarios of time series data, the update or delete operation does not exist, so dead tuple does not exist either. Such Chunk tables will not become objects of AUTOVACUUM, so INSERT performance will not be affected from this aspect.

As for the problem that the size of tables and indexes increases after massive data is inserted, this is avoided by the above-mentioned automatic partitioning feature.

In addition, TimescaleDB is completely based on the storage engine of PG, so no changes have been made to WAL. Therefore, the high availability cluster solution of TimescaleDB can also be built based on the Stream Replication Technology of PG. TimescaleDB also has officially introduced some open-source components based HA solutions.

5. Summary

To sum up, TimescaleDB is built entirely on PostgreSQL, so it has several inherent advantages:

It inherits the PostgreSQL ecosystem 100%. And with full support for SQL, it is even more appealing to beginners who have never worked with time series data.
PostgreSQL is reliable in quality, so TimescaleDB has a brand advantage in quality and stability.
It has strong support for ACID.

But its shortcomings are also obvious.

It is only an Extension of PostgreSQL, so it cannot be optimized for time series database usage scenarios at the kernel or storage level.
From the current architecture, it is still a stand-alone library and cannot give full play to the advantages of distributed technology. Moreover, although data is partitioned automatically, I/O hotspots are easily formed because the timestamp determines the partitioning.
At the functional level, the features for time series database oriented scenarios are still limited. At present, it is more like a traditional OLTP database with partial time series features.

In any case, TimescaleDB is also an attempt launched from another perspective for the time series database. At present, the time series database is still in an emerging stage, and its future development direction is also worthy of our attention and reference.

Community

SQL and TimescaleDB

1. Overview

2. Data Model

3. TimescaleDB Features

Transparent and Automatic Partitioning

Time Series Scenarios Oriented Customization

4. Storage Mechanism of TimescaleDB

5. Summary

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

Time Series Database (TSDB)

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Data Transmission Service