This topic describes the index mechanism of time series tables and provides best practices for querying data in time series tables.
Background information
For information about the model of time series data, see Design a time series table. Data is organized by time series. In time series tables, all columns marked with TAG are defined as index columns. Each value of these columns specifies the time series to which a record belongs to identify the data source.
The following figure provides an example in which a time series table is used. The data records shown in the following figure belong to the same time series. For more information about time series tables, see Design a time series table.
If a value in a tag column changes, the data record that corresponds to the value falls in another time series. In most cases, a time series can identify a data source from which time series data is generated.
LindormTSDB creates indexes by using each tag column. Each of the indexes is an inverted index. The name and each value of a tag column are used as an index key to index all time series that contain the same key-value pair of the index key. The index key is used to identify the data range that corresponds to a tag from the time series dimension in a short period of time. When data is continuously written into the time series table, inverted indexes in the table change. The following figure shows the changed inverted indexes in the table.
By default, LindormTSDB creates a Block Range Index (BRIN) for a timestamp column. This ensures that the data range that you want to query can be located in a large amount of time series data in a short period of time. LindormTSDB does not create indexes for field columns.
Usage notes
The following suggestions are provided based on the index mechanism of LindormTSDB for querying data from a time series table stored in LindormTSDB:
We recommend that you add equivalent filter clauses for tag columns and filter clauses for your timestamp column in your query statement. This ensures that your query statement does not contain only filter clauses for field columns.
If a query statement contains filter clauses for multiple tag columns and an inclusion relationship between time series ranges that are mapped by the tag columns exists, we recommend that you retain the filter for the tag column that can identify a record in an efficient manner. For example, in the preceding figure, the id column can identify a unique record, but the city column cannot identify a unique record.
If the time range specified in your query condition is large, the system scans a large amount of data. As a result, the query speed is low. In this case, we recommend that you reduce the time range in which you want to query data to improve query performance.
Common scenarios and operations for time series queries
If you want to reproduce the results of the following query scenarios, download the SQL script and use sample data recorded in the SQL script. For more information about the SQL script, see SQL script.
Query values of a field reported within a specified time range
Execute the following statement to query all sulfur dioxide (SO2) values that the devices in the Yuhang district reported from 2019-04-18 10:00:00 to 2019-04-18 10:30:00:
SELECT id, so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00';The following figure shows the query result.
Query all values in a specific tag column in a time series table
When data that is generated by the devices of the same type is written to one time series table, you may want to query all values in a specific tag column in the time series table. For example, you can execute the following SQL statement to query all values in the id column of the aqm table for all devices that are used to monitor air quality:
SELECT DISTINCT(id) FROM aqm;The following figure shows the query result.
If values in a tag column of a table are queried, you do not need to worry about performance issues due to a full table scan. This is because LindormStore queries the values based on the indexes that are created for all tag columns in the table by default. Take note that an index-based query can be time consuming when the data of excessive devices is stored in your time series table.
Query data based on downsampling
Execute the following statement to query average PM2.5 and SO2 values based on PM2.5 and SO2 values reported by the devices in the Yuhang district from 2019-04-18 10:00:00 to 2019-04-18 10:30:00. The PM2.5 and SO2 values are split based on a 5-minute interval and the average PM2.5 and SO2 values at each interval are calculated.
SELECT id, time, avg(pm2_5) AS avg_pm2_5, avg(so2) AS avg_so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00' SAMPLE BY 5m;The following figure shows the query result.
Downsampling is used to aggregate data in a time series based on the time dimension. For information about how to perform a downsampling query, see Downsampling queries.
Use aggregation functions to query data across devices
Execute the following statement to query the maximum average PM2.5 and SO2 values based on PM2.5 and SO2 values reported by the devices in the Yuhang district from 2019-04-18 10:00:00 to 2019-04-18 10:30:00. The maximum average PM2.5 and SO2 values are calculated based on PM2.5 values and SO2 values that are reported in each 5-minute time window.
SELECT max(avg_pm2_5) AS max_avg_pm25, max(avg_so2) AS max_avg_so2 FROM (SELECT district, id, time, avg(pm2_5) AS avg_pm2_5, avg(so2) AS avg_so2 FROM aqm WHERE district='yuhang' AND time >= '2019-04-18 10:00:00' AND time < '2019-04-18 10:30:00' SAMPLE BY 5m) GROUP BY district;The following figure shows the query result.