Community Blog How to Optimize Data Queries for Time Series Database

How to Optimize Data Queries for Time Series Database

PostgreSQL can optimize data queries in a good way, and here you can get some query optimization tips on how to optimize queries.

Data merging and data cleaning are required in many scenarios. We can use window query for this kind of operation, but how can we make it faster and quickly retrieve batch data?

Here is a quick summary of the common methods of optimizing time sequence data querying:

  1. Recursion is used when there are few unique values and an unknown range.
  2. Use subquery when the number of unique values is relatively small and you know the specific range of the unique values.
  3. Window query is more appropriate than the above method when there are many unique values.
  4. However, stream computing is even better in the same scenarios.

Efficiency Comparison Table

Data volume Number of Unique Values Window Query (ms) Subquery (ms) Recursive Query (ms)
5 million 1 million 6,446 2,892 6,706
5 million 1,000 6,176 7 9

PostgreSQL is the best choice in open-source databases as it provides several solutions to the same problems. It leaves you free to choose the most appropriate solution for you and your individual needs.

  1. Use recursion when the number of unique values is relatively small and the range of the unique values is unknown.
  2. Use subquery when the number of unique values is relatively small and the range of the unique values is determined. For example, if the total range is 1 million pieces of data, but only 500,000 pieces of data are included in this batch, then the performance is optimal if you have the IDs for these 500,000 entries. Otherwise you need to scan 1 million pieces of data. Another example is that there are a total of 100 million users, but an interval includes only tens of thousands of active users.
  3. Window query is more appropriate if the number of unique values is relatively large.
  4. Steaming computing is better than method 3 if the number of unique values is relatively large.

For the detailed comparison information for recursion, subquery and window queries, please see Optimizing Time Series Querying on Alibaba Cloud RDS for PostgreSQL.

Related Blog Posts

Accelerating PostgreSQL Ad Hoc Query and Dictionary with RUM Index

This article discusses how you can accelerate PostgreSQL ad hoc query and dictionary (random field combination) through RUM index acceleration.

Solution 1 - Global Dictionary + Array Type + RUM Index

Global dictionary means that the value ranges of all fields constitutes a large value range, and the "field name + field value" is unique within the value range.

After making a dictionary, you can choose either INT4 or INT8 as the element type of the dictionary.

There are many advantages to using arrays. For example, adding fields will be a breeze, because you don't need to change the results. Instead, you only need to fill the contents of the newly added fields into the array.

The original AND queries are replaced with Array CONTAINS queries, and the original OR queries are replaced with ARRAY INTERSECT queries.

RUM indexes already support arrays. Support CONTAINS and INTERSECT queries.

PostgreSQL 11 HASH Partition Table Combined with RUM

PostgreSQL supports HASH partition tables. The performance might be better with smart parallel AGG.

  1. Create RUM indexes
  2. Create a partition table
  3. Create random functions
  4. Create a dictionary-generation function
  5. Write test data
  6. Ad hoc query performance is consistent with PostgreSQL 10

Supporting 200 Billion Data Records in a Single RDS PostgreSQL Instance

Alibaba Cloud ApsaraDB for PostgreSQL is capable of handling large amounts of records at a time. How much? Well, let's look at an example to find out. Let's assume we have 2 billion users, each with 1000 tags, and we need to perform user tagging and pivoting based on any combination of tags (the business requirement is to compute a combination of up to 100 tags at a time).

This is equivalent to handling 200 billion records at a time, each requiring real-time response.

You may think that this would need at least a hundred machines. But in fact, this amount of data only requires one ApsaraDB for RDS PostgreSQL instance. This article discusses the cutting-edge RDS PG technology that helps us solve this sort of business requirements while consuming minimum resources.

Optimization Solution to Improve Response Speed

  1. bitmap segmentation
  2. Use parallel computing to determine the USER COUNT which meets the tag conditions (using dblink asynchronous calls)
  3. Return streaming cursors when determining user IDs.

Related Documentation

Locate the SQL statements with the greatest resource consumption - ApsaraDB RDS for PostgreSQL

A database is a relatively large application. Busy databases consume many resources such as memory, CPU, I/O, and network resources. SQL optimization is a means of optimizing your databases. To achieve the best result of optimization, you must first locate the SQL statements with the greatest resource consumption, such as the SQL statements that consume the most I/O resources.

Database resources include CPU, memory, and I/O. To locate the SQL statements that consume the most database resources in each dimension, you can use the pg_stat_statements plugin to collect statistics on the resource overhead of the database and analyze the top SQL statements listed by resource consumption.

This article provides examples to illustrate how to create the pg_stat_statements plugin, analyze the top SQL statements, and reset statistics.

Create a PostgreSQL schema

In DLA, the statements for creating ApsaraDB RDS for SQL Server (SQL Server) and ApsaraDB RDS for PostgreSQL (PostgreSQL) schemas and tables are the same to those for ApsaraDB RDS for MySQL. Database is equivalent to schema in MySQL, and thus ApsaraDB RDS for MySQL (MySQL) contains only schema or database. However, SQL Server and PostgreSQL contain both databases and schemas. Therefore, the statement for creating schemas is slightly different from that for MySQL.

Related Products

Data Lake Analytics

Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.

Cloud native federation analytics across multiple data sources: OSS, PostgreSQL, MySQL (RDS), NoSQL (Table Store), etc.

Alibaba Cloud Databases

Alibaba Cloud offers fully managed database services. We monitor, backup, and recover your database automatically so that you can fully focus on your business development. To provide more stable and scalable database services, Alibaba Cloud optimized the source code based on the open-source database engines. Our database services, such as ApsaraDB RDS for MySQL and ApsaraDB RDS for PPAS have lower risk compare to the other services using the community edition.

0 0 0
Share on

Alibaba Clouder

2,600 posts | 753 followers

You may also like


Alibaba Clouder

2,600 posts | 753 followers

Related Products