By Jiao Xian
Time has always been an intriguing concept throughout human history. Oftentimes, we see interesting applications in science and technology that involve the use time. In the data age, time has been combined with databases, which has given rise to the popular time series database.
A time series database is essentially a vertical database with timestamp property. Since 2014, DB-Engines, a database popularity ranking website, has classified and counted time series databases as an independent directory, and the growth rate of time series databases in recent years ranks first in all database classifications (see figure below).
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range)
The above is the definition of the time series database on Wikipedia. It can be divided into three aspects: time series features, data features, and database features.
The timestamp is divided into the UNIX system timestamp and the calendar, and they support automatic adaptation of time zones.
Database features (CRUD)
Although general-purpose relational databases can store time series data, they cannot process time series data very efficiently due to a lack of optimizations for time like storing and retrieving data by time intervals.
The first-generation time series data is typically derived from the monitoring field. Simple storage tools based on flat files are the preferred storage for this type of data.
Systems like RRDTool and Whisper usually process simplex data models and have limited standalone capacity. These systems are usually embedded in monitoring and alerting scenarios.
With the development of big data and Hadoop, the time series data volume begins to grow rapidly, and system services put more requirements on processing time series data, for example, higher scalability.
Dedicated time series databases based on general-purpose storage began to appear. Time series databases can efficiently store and process time series data by time intervals. These databases include OpenTSDB and KairosDB.
These time series databases inherit the advantages of general-purpose databases and use the characteristics to avoid the disadvantages of general-purpose storage. In addition, these databases have many innovations targeting time series in data models and aggregate analysis.
For example, OpenTSDB inherits the wide table of HBase, features the design of an offset storage model for time series and uses salt to alleviate the hot spot problem.
However, it also has many shortcomings, such as the inefficient global UID mechanism, uncontrollable loading of aggregated data, and the inability to process high-cardinality tag queries.
With the development of Docker, Kubernetes , microservices, and other technologies, the development expectations of IoT are getting stronger and stronger.
As data continuously grows over time, time series data is one of the fastest growing data types.
The high-performance and low-cost vertical time series database were developed. Data storage engines with time series features (InfluxDB is a typical example) are emerging and growing more important in the market.
These time series databases usually have more advanced data processing capabilities, more efficient compression algorithms, and storage engines that are more compliant with features of time series data.
For example, InfluxDB features the time-based TSMT storage, the Gorilla compression, and window functions like p99, rate, and automatic rollup.
At the same time, due to the separation of indexing in the architecture, these databases still face lots of challenges in expanded timelines, disorder, or other similar scenarios.
Currently, DB-Engines collects and ranks time series databases separately. The following figure shows the rankings of time series databases by popularity in 2018 and the changing trends over the past five years.
Azure Series Insights
Commercial and industrial databases
This algorithm can compress 16 bytes into an average of 1.37 bytes, which is 12x reduction in size. Gorilla also has memory data structures for the compression algorithm. Gorilla allows quick and efficient scanning of all data while providing the ability to search for data in a single time series by periods of time.
By writing time series data to hosts in different regions, Gorilla tolerates single-node faults, network switches, and even faults in an entire data center.
Timescale received the \$12.4M Series A Round financing from Benchmark Capital. InfluxDB the \$35M and C Round financing from Sapphire Ventures.
Time series databases have witnessed rapid development over the past two years. Major global cloud manufacturers have already begun to focus on different aspects of the time series ecosystem, form unique solutions and start to obtain the first-mover advantage.
Excellent time series databases like Facebook Gorilla go beyond meeting their own business development needs. Academically, lots of advanced technologies have emerged in the time series database field, pushing time series data technologies to a higher level.
The Alibaba TSDB team has gradually applied its time series database in its internal services such as DBPaaS and Sunfire since the implementation of its first TDSB version in 2016. After the public beta test in the middle month of 2017, Alibaba TSDB was commercialized at the end of March 2018.
During the entire process, Alibaba TSDB has continuously absorbed various strengths of other time series databases, opening the door to self-developed time series databases in China.
This series of articles aim to describe the technical progress of current time series databases.
Alibaba Cloud Storage - May 8, 2019
Alibaba Cloud Storage - November 8, 2018
Alibaba Cloud Storage - April 25, 2019
Alibaba Cloud Storage - April 25, 2019
Alibaba Clouder - May 27, 2019
ApsaraDB - April 19, 2019
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
Supports data migration and data synchronization between data engines, such as relational database, NoSQL and OLAPLearn More
Realtime Compute offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.Learn More
SDDP automatically discovers sensitive data in a large amount of user-authorized data, and detects, records, and analyzes sensitive data consumption activities.Learn More
More Posts by Alibaba Clouder