Data merging and data cleaning are required in many scenarios. We can use window query for this kind of operation, but how can we make it faster and quickly retrieve batch data?
Here is a quick summary of the common methods of optimizing time sequence data querying:
Efficiency Comparison Table
Data volume | Number of Unique Values | Window Query (ms) | Subquery (ms) | Recursive Query (ms) |
---|---|---|---|---|
5 million | 1 million | 6,446 | 2,892 | 6,706 |
5 million | 1,000 | 6,176 | 7 | 9 |
PostgreSQL is the best choice in open-source databases as it provides several solutions to the same problems. It leaves you free to choose the most appropriate solution for you and your individual needs.
For the detailed comparison information for recursion, subquery and window queries, please see Optimizing Time Series Querying on Alibaba Cloud RDS for PostgreSQL.
This article discusses how you can accelerate PostgreSQL ad hoc query and dictionary (random field combination) through RUM index acceleration.
Solution 1 - Global Dictionary + Array Type + RUM Index
Global dictionary means that the value ranges of all fields constitutes a large value range, and the "field name + field value" is unique within the value range.
After making a dictionary, you can choose either INT4 or INT8 as the element type of the dictionary.
There are many advantages to using arrays. For example, adding fields will be a breeze, because you don't need to change the results. Instead, you only need to fill the contents of the newly added fields into the array.
The original AND queries are replaced with Array CONTAINS queries, and the original OR queries are replaced with ARRAY INTERSECT queries.
RUM indexes already support arrays. Support CONTAINS and INTERSECT queries.
PostgreSQL 11 HASH Partition Table Combined with RUM
PostgreSQL supports HASH partition tables. The performance might be better with smart parallel AGG.
Alibaba Cloud ApsaraDB for PostgreSQL is capable of handling large amounts of records at a time. How much? Well, let's look at an example to find out. Let's assume we have 2 billion users, each with 1000 tags, and we need to perform user tagging and pivoting based on any combination of tags (the business requirement is to compute a combination of up to 100 tags at a time).
This is equivalent to handling 200 billion records at a time, each requiring real-time response.
You may think that this would need at least a hundred machines. But in fact, this amount of data only requires one ApsaraDB for RDS PostgreSQL instance. This article discusses the cutting-edge RDS PG technology that helps us solve this sort of business requirements while consuming minimum resources.
Optimization Solution to Improve Response Speed
A database is a relatively large application. Busy databases consume many resources such as memory, CPU, I/O, and network resources. SQL optimization is a means of optimizing your databases. To achieve the best result of optimization, you must first locate the SQL statements with the greatest resource consumption, such as the SQL statements that consume the most I/O resources.
Database resources include CPU, memory, and I/O. To locate the SQL statements that consume the most database resources in each dimension, you can use the pg_stat_statements plugin to collect statistics on the resource overhead of the database and analyze the top SQL statements listed by resource consumption.
This article provides examples to illustrate how to create the pg_stat_statements plugin, analyze the top SQL statements, and reset statistics.
In DLA, the statements for creating ApsaraDB RDS for SQL Server (SQL Server) and ApsaraDB RDS for PostgreSQL (PostgreSQL) schemas and tables are the same to those for ApsaraDB RDS for MySQL. Database is equivalent to schema in MySQL, and thus ApsaraDB RDS for MySQL (MySQL) contains only schema or database. However, SQL Server and PostgreSQL contain both databases and schemas. Therefore, the statement for creating schemas is slightly different from that for MySQL.
Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.
Cloud native federation analytics across multiple data sources: OSS, PostgreSQL, MySQL (RDS), NoSQL (Table Store), etc.
Alibaba Cloud offers fully managed database services. We monitor, backup, and recover your database automatically so that you can fully focus on your business development. To provide more stable and scalable database services, Alibaba Cloud optimized the source code based on the open-source database engines. Our database services, such as ApsaraDB RDS for MySQL and ApsaraDB RDS for PPAS have lower risk compare to the other services using the community edition.
2,599 posts | 758 followers
FollowAlibaba Cloud Storage - April 25, 2019
Alibaba Cloud Storage - April 25, 2019
Alibaba Cloud Storage - April 10, 2019
digoal - July 24, 2019
Alibaba Clouder - May 27, 2019
digoal - May 17, 2021
2,599 posts | 758 followers
FollowAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMigrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMore Posts by Alibaba Clouder