×
Community Blog What Are Time Series Databases?

What Are Time Series Databases?

In this article, we will introduce the history and development of time series databases (TSDBs) and discuss the applications of TSDBs.

By Jiao Xian

Time has always been an intriguing concept throughout human history. Oftentimes, we see interesting applications in science and technology that involve the use time. In the data age, time has been combined with databases, which has given rise to the popular time series database.

A time series database is essentially a vertical database with timestamp property. Since 2014, DB-Engines, a database popularity ranking website, has classified and counted time series databases as an independent directory, and the growth rate of time series databases in recent years ranks first in all database classifications (see figure below).

1

Time Series Databases

A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range)

The above is the definition of the time series database on Wikipedia. It can be divided into three aspects: time series features, data features, and database features.

  • Time series features:
  • Timestamp: In general business scenarios, the timestamp mainly reaches the precision level of seconds and milliseconds. However, in some high frequency collection scenarios, such as remote sensing, the timestamp can reach the nanosecond level.
The timestamp is divided into the UNIX system timestamp and the calendar, and they support automatic adaptation of time zones.
  • Sampling frequency: Generally, two sampling frequencies are available. One is the periodic time sampling frequency, such as the periodic summary metrics related to server performance. The other is the discrete sampling, such as website access.

Data features:

  • Data is appended in sequence.
  • Data can be multi-dimensional correlated.
  • Hot data is typically accessed at high frequencies.
  • Cold data needs to be reduced in dimension and archived.
  • Data mainly covers values, states and events.

Database features (CRUD)

  • The write speed is stable and far greater than the read speed.
  • Access data by window of time.
  • A time series database has few updates and some overwrites in a window of time.
  • A time series database enables the batch delete feature.
  • A time series database features high availability, reliability, and scalability of general-purpose databases.
  • Generally, no transaction capability is required.

A Brief Development History of Time Series Databases

2

The First-Generation Time Series Data Storage System

Although general-purpose relational databases can store time series data, they cannot process time series data very efficiently due to a lack of optimizations for time like storing and retrieving data by time intervals.

The first-generation time series data is typically derived from the monitoring field. Simple storage tools based on flat files are the preferred storage for this type of data.

Systems like RRDTool and Whisper usually process simplex data models and have limited standalone capacity. These systems are usually embedded in monitoring and alerting scenarios.

Time Series Database Based on General-Purpose Storage

With the development of big data and Hadoop, the time series data volume begins to grow rapidly, and system services put more requirements on processing time series data, for example, higher scalability.

Dedicated time series databases based on general-purpose storage began to appear. Time series databases can efficiently store and process time series data by time intervals. These databases include OpenTSDB and KairosDB.

These time series databases inherit the advantages of general-purpose databases and use the characteristics to avoid the disadvantages of general-purpose storage. In addition, these databases have many innovations targeting time series in data models and aggregate analysis.

For example, OpenTSDB inherits the wide table of HBase, features the design of an offset storage model for time series and uses salt to alleviate the hot spot problem.

However, it also has many shortcomings, such as the inefficient global UID mechanism, uncontrollable loading of aggregated data, and the inability to process high-cardinality tag queries.

Birth of Vertical Time Series Databases

With the development of Docker, Kubernetes , microservices, and other technologies, the development expectations of IoT are getting stronger and stronger.

As data continuously grows over time, time series data is one of the fastest growing data types.

The high-performance and low-cost vertical time series database were developed. Data storage engines with time series features (InfluxDB is a typical example) are emerging and growing more important in the market.

These time series databases usually have more advanced data processing capabilities, more efficient compression algorithms, and storage engines that are more compliant with features of time series data.

For example, InfluxDB features the time-based TSMT storage, the Gorilla compression, and window functions like p99, rate, and automatic rollup.

At the same time, due to the separation of indexing in the architecture, these databases still face lots of challenges in expanded timelines, disorder, or other similar scenarios.

The Development Status Quo of Time Series Databases

Currently, DB-Engines collects and ranks time series databases separately. The following figure shows the rankings of time series databases by popularity in 2018 and the changing trends over the past five years.

3

Public cloud

  • AWS Timestream
  • Amazon announced Timestream (preview) at AWS re:Invent in November 2018. Timestream is applicable in scenarios such as IoT and application operations.
    It provides an adaptive query processing engine to quickly analyze data, automatically summarize, retain, layer and compress data. Timestream users are billed separately for writes, data stored, and data scanned by queries and can achieve efficient management with this serverless service.

Azure Series Insights

  • In April 2017, Microsoft released Time Series Insight (preview), which provides a fully managed and end-to-end storage and query solution for highly contextualized, time-series-optimized IoT-scale data. Its powerful visualization enables interactive ad-hoc data analysis and asset-based data insights.
    In addition, this service supports warm data analysis and raw data analysis by data types. Users are billed separately for storage and queries used.

Open-source projects

  • OpenTSDB
    OpenTSDB is a distributed, scalable time series database. This database features concepts like metrics and tag and designs a set of data models for time series scenarios. It uses HBase as the storage in the underlying layer and adopts the special rowkey design based on characteristics of time series scenarios to improve the capabilities of aggregating and querying time series data.
  • Prometheus
    Prometheus stores all collected sample data in the memory database by time series and regularly saves the data to the hard disk. Remote storage is required to ensure reliability and scalability.
  • InfluxDB
    InfluxDB is a standalone open-source time series database written in Go. InfluxDB is easy to use and has zero special environment dependencies. It uses the unique TSMT structure to implement high-performance reads and writes. Distributed support is available in the commercial version.
  • TimescaleDB
    TimescaleDB is a time series SQL database that features a specified schema and manages table chunks by time. The underlying layer of TimescaleDB is based on PostgreSQL.

Academic databases

  • BTrDB
    BTrDB is designed to store highly accurate time series data and uses the data structure of "time-partitioning version-annotated copy-on-write tree" that creates a tree for each timeline. BTrDB uses versioning to process data in out-of-order scenarios.
  • Confluo
    Confluo features a new data structure—Atomic MultiLog, and uses atomic directive sets supported by modern CPU hardware. It simultaneously supports high throughput concurrent writes of millions of data points, online queries at millisecond timescales, and CPU-efficient ad-hoc queries.
  • ChronixDB
    ChronixDB provides time series storage based on Solr and implements the unique lossless compression algorithms. ChronixDB can be integrated with Spark to enable rich time series analysis capabilities.

Commercial and industrial databases

  • PI
    PI is a large-scale real-time database developed by the OSI software company. It is widely used in power, chemical, and other industries. PI adopts the patented SDT (Swinging Door Trending) algorithm and the unique secondary filter technology to most efficiently compress data in a PI database and significantly save hard disk space.
  • kdb
    kdb is a time series database developed by Kx System that is mainly used to process transaction-related data. kdb supports stream and memory computing, real-time analysis of billions of records, and quick access to TB-level history data.
  • Gorilla
    Gorilla is a memory-based time series database from Facebook, which adopts a new time series compression algorithm.

This algorithm can compress 16 bytes into an average of 1.37 bytes, which is 12x reduction in size. Gorilla also has memory data structures for the compression algorithm. Gorilla allows quick and efficient scanning of all data while providing the ability to search for data in a single time series by periods of time.

By writing time series data to hosts in different regions, Gorilla tolerates single-node faults, network switches, and even faults in an entire data center.

  • Investment market
  • In 2018, time series database startups achieved two famous investments in the investment market.
Timescale received the \$12.4M Series A Round financing from Benchmark Capital.
InfluxDB the \$35M and C Round financing from Sapphire Ventures.

Analysis of Typical Time Series Databases

Time series databases have witnessed rapid development over the past two years. Major global cloud manufacturers have already begun to focus on different aspects of the time series ecosystem, form unique solutions and start to obtain the first-mover advantage.

Excellent time series databases like Facebook Gorilla go beyond meeting their own business development needs. Academically, lots of advanced technologies have emerged in the time series database field, pushing time series data technologies to a higher level.

The Alibaba TSDB team has gradually applied its time series database in its internal services such as DBPaaS and Sunfire since the implementation of its first TDSB version in 2016. After the public beta test in the middle month of 2017, Alibaba TSDB was commercialized at the end of March 2018.

During the entire process, Alibaba TSDB has continuously absorbed various strengths of other time series databases, opening the door to self-developed time series databases in China.

This series of articles aim to describe the technical progress of current time series databases.

0 0 0
Share on

Alibaba Clouder

1,446 posts | 232 followers

You may also like

Comments