PostgreSQL Time Series Database: How TimescaleDB Supports Compression Scheduling

By digoal

Background

TimescaleDB is a time series data plug-in for PostgreSQL. Its version 1.5 enables automatic compression.

As of version 1.5, TimescaleDB supports native data compression. This functionality does not require you to use any specific file system or external software. Also, it is simple to set up and configure.

For example, a sensor monitoring table has several fields, such as time, device id, monitoring dimension, and monitoring value.

time	device_id	CPU	disk_io	energy_consumption
8/22/2019 0:00	1	88.2	20	0.8
8/22/2019 0:05	2	300.5	30	0.9

You can set the compression segment field, sorting field, and compression cycle of the table through an SQL interface.

ALTER TABLE measurements SET (  
  timescaledb.compress,  
  timescaledb.compress_segmentby = 'device_id'  
);  
  
ALTER TABLE  measurements  
  SET (timescaledb.compress,  
       timescaledb.compress_orderby = 'device_id, time DESC');  
  
SELECT add_compress_chunks_policy('measurements', INTERVAL '7 days');

Why Compress Data?

There is a significant amount of time series data with high-concurrency and high-throughput recent data write. There may be updates and requirements for detailed queries as well. As time progresses, historical data has gradually become necessary for analysis. Compression helps to solve the storage space problem.

Timescale Compression Design

Compression needs to solve the storage space problem without affecting the efficiency of the writing, updating, or detail querying of recent data. In addition, historical data is usually queried on a large scale. Therefore, the compression shouldn’t affect the large-scale query efficiency.

1) Ensure high efficiency in the writing, updating, and detail querying of the recent data

2) Lossless compression

3) Compressed data in certain fields can be efficiently queried

TimescaleDB compression: Select the segment field, sorting field, and time field to determine the records before finalizing data in what time period should be compressed.

For example, compress the data in the time field 7 days ago. The segment field of the compression is device_id, and the sorting field is in reverse chronological order.

The compressed result is as follows. In fact, it uses an array to store multi-values and multiple records. Actually, some cases have been introduced before, about aggregation to improve the efficiency of range retrieval, the vops plug-in to improve the analysis efficiency, and so on. For reference, see the end of this article.

time	device_id	CPU	disk_io	energy_consumption
[12:00:02, 12:00:01]	1	[88.2, 88.6]	[20, 25]	[0.8, 0.85] -- a chunk
[12:00:02, 12:00:01]	2	[300.5, 299.1]	[30, 40]	[0.9, 0.95] -- another chunk

Multiple records are compressed into one record, usually with one chunk for every 1,000 records.

Querying by device_id, the query efficiency is guaranteed.

Why Compress Previous Data?

The timescale v1.5 does not support updating the data compressed to the table (which may be supported in the future).

Why compress previous data?

1) Ensure the efficiency of writing, updating, and querying recent data. Data cannot be written or updated after compression.

2) The arrival order of recent data may not be completely consistent with the time field. For example, data whose ts is 20:00 may not be written until 21:00 (due to program problems or other problems). Therefore, you should compress the data before reaching the timescale (data that does not change).

And consequently, you should compress the previous data rather than the recent data.

How to Select a Segment Field?

1) Each chunk must contain at least a certain number of records to guarantee high compression efficiency. For example, when you compress 1,000 records into one chunk, the compression efficiency is high (compressing the data on the pglz and gzip fields).

Therefore, the selected segment field cannot be pk because pk means only one record, and there is nothing to compress.

Typically, you should select the fields that contain query requirements (where xx) and duplicate values. For example, the device_id.

How to Select a Sorting Field?

You typically select the second condition of where, such as time.

Suppose to query the data of the specified device in the specified time period, for example, in a certain period of 5 minutes.

Then, you can select ts as the sorting field.

ALTER TABLE  measurements  
  SET (timescaledb.compress,  
       timescaledb.compress_orderby = 'device_id, time DESC');

Compression Method

Currently, you can use the built-in column compression (toast pglz) of PostgreSQL. Multiple records are converted into an array to store compressed data.

Current Compression Usage Limitations

One of the current limitations of TimescaleDB is that once you convert chunks into compressed columns, any further modifications of the data (e.g., inserts, updates, deletes) or the schema without manual decompression are not allowed.

In other words, chunks are immutable in compressed form. Attempts to modify the chunks' data will either result in an error or fail silently (as preferred by users). Future releases plan to remove this limitation.

What to Expect Next?

Compression tables supporting inserts, updates, and deletes.

References

https://docs.timescale.com/latest/using-timescaledb/compression

Community

PostgreSQL Time Series Database: How TimescaleDB Supports Compression Scheduling

Background

Why Compress Data?

Timescale Compression Design

Why Compress Previous Data?

How to Select a Segment Field?

How to Select a Sorting Field?

Compression Method

Current Compression Usage Limitations

What to Expect Next?

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

Time Series Database for InfluxDB®

Time Series Database (TSDB)

PolarDB for PostgreSQL

Database for FinTech Solution