By Zhaofeng Zhou (Muluo)
Time series data is a set of data indexed by time. Simply put, this type of data describes the measurements of a measured subject at each time point of a time range.
Modeling of time series data includes three important parts: subject, time point, and measurements. Applying this model, you will find that you are in constant contact with this type of data in your daily work and life.
The world is made up of data, and every object in the world is producing data all the time. The exploitation and use of these types of data is silently changing people's lifestyles in this era. For example, the core of the personal health management feature of wearable devices is to continuously collect your personal health data. Such data includes heart rate and body temperature. After collecting such data, the device uses a model to calculate and evaluate your health status.
If you free your vision and your imagination, you will find that all data in your daily life can be exploited and used. Objects that generate data include your mobile phone, car, air conditioner, refrigerator, and so on. The core idea of the currently hot Internet of Things (IoT) technology is to build a network that collects data generated by all objects, and exploits the value of the collected data. Data collected by this network is typical time series data.
Time series data is used to describe the state change information of an object in the historical time dimension. The analysis of time series data is the process of trying to understand and master the rules of the changes. Time series data experiences explosive growth with the development of IoT, big data, and artificial intelligence (AI) technologies. To better support the storage and analysis of such data, a variety of database products have come into being and are available on the market. The invention of this kind of database products is to solve the shortcomings and drawbacks of the conventional relational databases in terms of time series data storage and analysis. These products are uniformly classified as time series databases (TSDBs).
As can be seen from the ranking of the most popular database management systems on DB-Engines, the popularity of TSDBs has maintained a high growth rate over the last two years.
Later on, I will write a few articles to analyze:
The analysis of the characteristics of time series data will be explained from three dimensions: data writing, query, and storage. We will figure out the basic requirements for time series database by analyzing these characteristics.
Based on the analysis on the above characteristics of time series data in terms of data writing, query, and storage, we can summarize the following basic requirements for TSDBs:
According to the analysis of the characteristics of time series data and the basic requirements for TSDB, NoSQL databases that use LSM-tree-based storage engines (such as HBase, Casandra, and Alibaba Cloud TableStore) have significant advantages over databases that use B+ tree-based relational database management systems (RDBMSs). The basic theory of the LSM tree is not described here. LSM tree is designed to optimize the write performance. The write performance of LSM-tree-based TSDBs is ten times higher than that of B+ tree-based TSDBs. However, their read performance is far poorer than that of B+ tree-based TSDBs. Therefore, LSM-tree-based TSDBs are particularly suitable for scenarios with more writes than reads. Currently, among the several well-known open source TSDBs, OpenTSDB uses HBase as the underlying storage engine, BlueFlood and KairosDB use Cassandra, InfluxDB uses the self-developed TSM storage engine which is similar to LSM, and Prometheus directly uses the LevelDB-based storage engine. We can see that all mainstream TSDBs use the LSM-tree-based distributed architecture for underlying storage. The difference is that some products directly use the existing mature databases, and some use the self-developed or LevelDB-based databases.
The LSM-tree-based distributed architecture can easily meet the writing requirements of time series data, but it is rather weak in terms of data query. These databases can meet the needs for multi-dimensional aggregate query of a small amount of data. However, for multi-dimensional aggregate query of a large amount of data without indexes, their performance is rather poor. Therefore, in the open source world, there are some other products that focus on solving such query and analysis problems. For example, Druid mainly focuses on solving OLAP requirements for time series data, and allows fast query analysis of massive amounts of data without pre-aggregation. It also supports drilling down on any dimensions. Our community also provides ElasticSearch-based solution for analysis-oriented scenarios.
In short, the diversified TSDBs come with their own benefits and drawbacks. There is no best solution that works for all scenarios. You can only choose the one that best fits your service needs.
A data model of time series data mainly consists of the following parts:
Currently, mainstream TSDBs use two modeling methods: modeling by data source and modeling by metrics. I will use two examples to illustrate the difference between these two methods.
The above is an example of modeling by data source. Measurements of all metrics of the same data source at a certain time point are stored in the same row. This model is used by Druid and InfluxDB.
The above is an example of modeling by metrics, where each row of data represents a measurement of a certain metric of a data source at a certain time point. This mode is used by OpenTSDB and KairosDB.
There is no clear distinction between these two models. If the underlying layer architecture adopts columnar storage and there is an index on each column, modeling by data source may be better. If the underlying layer architecture is similar to HBase or Cassandra, storing multiple metric values on the same row may affect the query or filter efficiency on one of the metrics. Therefore, we typically choose to model by metrics.
This section mainly describes the processing of time series data. In addition to the basic data writing and storage, query and analysis are the most important features of a TSDB. The processing of time series data mainly includes filter, aggregation, GroupBy, and downsampling. To better support GroupBy queries, some TSDBs will pre-aggregate the data. Downsampling is done through rollups. To support faster and more real-time rollups, TSDBs usually support auto-rollups.
The above is a simple filter process. Simply put, it queries for all data that meets the given conditions of different dimensions. In the scenario of time series data analysis, filter usually starts from a high dimension, and then performs more detailed query and processing of data based on more-refined dimensional conditions.
Aggregation is the most basic function of time series data query and analysis. Time series data records the original state change information. However, when doing time series data query and analysis, we usually do not need the original information. Instead, we need the statistics based on the original information. Aggregation involves some basic computations for statistics. The most common computations are SUM, AVG, Max, and TopN. For example, when analyzing the server traffic, you would care about the average amount of traffic, the total amount of traffic, or the peak traffic.
GroupBy is the process of converting low-dimensional time series data into high-dimensional statistics. The above is a simple example of GroupBy. GroupBy is usually performed during query. After the original data is queried, we obtain the result through real-time computation. This process may be very slow, depending on the size of the originally queried data. Mainstream TSDBs optimize this process through pre-aggregation. After the data is written in real time, it will be pre-aggregated to generate the results after GroupBy according to the given rules. This allows us to directly query the results without re-computation.
Downsampling is the process of converting high-resolution time series data into low-resolution time series data. This process is called rollup. It is similar to GroupBy, but they are different. GroupBy is to aggregate data of different dimensions at the same time level based on the same time granularity. The time granularity of the converted data remains the same, but the dimension becomes higher. Downsampling is to aggregate data of the same dimension at different time levels. The time granularity of the converted data becomes coarser, but the dimension remains the same.
The above is a simple example of downsampling, which aggregates the 10-second resolution data to 30-second resolution data, to obtain the statistical average.
Downsampling is divided into storage downsampling and query downsampling. Storage downsampling is to reduce storage costs of data, especially historical data. The query downsampling is mainly for queries with a larger time range to reduce the returned data points. Auto-rollup is required for both storage downsampling and query downsampling. Auto-rollup automatically performs a data rollup, rather than when it's waiting for a query. Similar to pre-aggregation, this process can effectively improve query efficiency. It is also a feature that has been or plans to be designed for the currently mainstream TSDBs. Currently, Druid, InfluxDB, and KairosDB support auto-rollup. OpenTSDB does not support auto-rollup, but it provides an API to support the import of results after auto-rollup is performed externally.
This article mainly analyzes the characteristics, models and basic query and processing operations of time series data, reveals the basic requirements for TSDBs. In the next article, we will analyze the implementation of several popular open-source TSDBs. You may find that although there are many TSDBs, they have similar basic functions. All TSDBs have their own characteristics and implementation methods, but they are all designed based on the trade-off from such dimensions as writing, storage, query, and analysis of time series data. There is no one-size-fits-all TSDB that can solve all potential problems. It's important to choose the most suitable TSDB from a business perspective.
Wenson - August 4, 2020
Alibaba Clouder - July 31, 2019
Alibaba Cloud Storage - April 25, 2019
afzaalvirgoboy - February 25, 2020
Alibaba Clouder - July 22, 2020
Alibaba Clouder - July 21, 2020
A cost-effective online time series database service that offers high availability and auto scaling featuresLearn More
TSDB is a stable, reliable, and cost-effective online high-performance time series database service.Learn More
A fully managed NoSQL cloud database service that enables storage of massive amount of structured and semi-structured dataLearn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
More Posts by Alibaba Cloud Storage