Streaming Data Warehouse Storage
1. Calculation in the data warehouse
In computing, a data warehouse (DW or DWH), a system for reporting and data analysis, is considered a core component of business intelligence. It stores current and historical data in one place to create analytical reports for workers across the enterprise. [1]
A typical extract, transform, load (ETL) based data warehouse uses an ODS layer, a DWD layer, and a DWS layer to house its key functions. Data analysts can flexibly query each layer in the data warehouse to obtain valuable business information.
There are three key indicators in the data warehouse [2]:
*Data freshness: the length of time from when the data is generated to when it is available for user query after a series of processing in the warehouse. Usually ETL is a series of processes used to prepare data, and ETL is more accomplished by scheduling and running a series of stream computing or batch computing jobs.
*Data query delay: After the data is ready, the user queries the data in the table through Query. The length of time from the user sending the query to receiving the query result is the query delay. The query delay directly determines the end user's body feeling.
*Cost: The amount of resources required to complete a certain amount of data analysis (including various calculations such as ETL and query). Cost is also a key indicator in the data warehouse.
What is the relationship between these three indicators?
Enterprises need to achieve better query delay and freshness while controlling costs. Different data may have different cost requirements.
Freshness and query latency are also trade-offs in some cases. For example, if it takes longer to prepare data, clean and preprocess data, the query will be faster.
So these three constitute a triangle in the data warehouse Tradeoff [2]:
For this triangle Tradeoff, what kind of trade-offs does the industry's current mainstream architecture have?
2. Mainstream architecture of the industry
Typical offline data warehouse:
The offline data warehouse uses Batch ETL to overwrite (INSERT OVERWRITE) based on the granularity of the partition. While solving the scenario of very large data, it has good cost control.
But it has two serious problems:
Poor freshness: The data delay is generally T + 1, that is, the data generated in the business on the same day needs to be queried the next day.
It is not good at processing update streams (Changelog). All Append data is stored in the offline data warehouse. If you need to receive update streams similar to database change logs, you need to repeatedly merge full data and incremental data, and the cost will increase sharply.
In order to solve the above problems, real-time data warehouses are gradually emerging. A typical real-time data warehouse implementation is to use the Flink + Kafka solution to build the middle layer, and finally write to the online database or analysis system to achieve second-level full-link delay. Very good data freshness.
However, it also gradually exposed some problems.
Problem 1, the middle layer cannot be checked
The data query stored in Kafka is limited, OLAP query cannot be performed flexibly, and long-term historical data is usually not saved. This is very different from the widely used data warehouse. In a mature Warehouse system, each data set in the data warehouse should be a table abstraction that can be queried, and Kafka cannot meet all the needs of users for Table abstraction. For example:
Query capabilities are limited. The real-time data warehouse architecture requires that all queryable data sets be pre-calculated and finally written into the queryable analysis system, but not all calculations can be pre-defined in actual business, and the large demand of data analysts is temporary Ad For hoc query, if the intermediate data Queue cannot be checked, this will seriously limit the data analysis capability of the business.
Troubleshooting is difficult. In the real-time data warehouse, if there is a problem with the data, the user needs to check the data pipeline, but because the Queue storing the intermediate results cannot be checked, it is very difficult to check.
To sum up, we hope to have a unified architecture to obtain a real-time data warehouse that can be queried everywhere, rather than a data warehouse in which intermediate results are pipelined.
Problem 2: The cost of real-time links is high
There is no free lunch in the world, and building a real-time link is relatively expensive.
Storage cost: Whether it is Kafka or the ADS layer behind, they are all online services. Although they have very low latency, they have high storage costs.
Migration and maintenance costs: The real-time link is a new system that is independent of the offline system, and is not compatible with the offline tool chain. Migration and maintenance costs are very high.
Therefore, we hope to have a low-cost real-time data warehouse, which provides low operating costs and is compatible with offline tool chains, while accelerating the original offline data warehouse.
Because the current two architectures are oriented to different trade-offs and scenarios, the business usually can only maintain two architectures, and even requires different technical teams, which not only brings a lot of resource costs, but also brings expensive development costs and operation and maintenance costs.
So is it possible for us to provide a data warehouse that is relatively balanced in terms of freshness, query delay, query capability, and cost? In order to answer this question, we need to analyze the technical principles behind freshness and query latency, the different architectures caused by different tradeoffs, and the technical differences behind them.
3. ETL freshness
The first thing to consider is the freshness of the data: the freshness of the data measures the length of time from when the data is generated to when it is available for user query after a series of processing in the warehouse. Data is ingested into the data warehouse, and after a series of ETL processing, the data enters a usable state.
The traditional batch calculation is based on the caliber for ETL calculation, so its freshness is: caliber + ETL delay. The general caliber is one day, so the freshness of traditional offline data warehouses is at least one day. Calculated according to the caliber, the input and output of the calculation are the full amount. If the freshness is to be smaller than the caliber, the input and output of the computation are partial, that is, incremental. Typical incremental computing is stream computing, such as Flink Streaming.
Incremental computing is not completely equivalent to stream computing, for example, there can also be incremental computing in small batches. Full computing is not exactly equivalent to batch computing. For example, stream computing can also be used as a window to output in full (that is to say, the delay of stream computing can also be large, which can reduce costs);
4. Query Latency
Query delay will directly affect the efficiency and experience of data analysis. Queries are returned to people. This person is not a robot. The data he sees is filtered or aggregated data. In traditional offline data warehouses, it may take 10+ minutes to query large tables.
The most intuitive way to speed up the return of queries is pre-computation. In essence, the ETL of the data warehouse is doing pre-calculation. When the calculation of the data analyst query takes too long, he will notify the data warehouse personnel to establish the corresponding ETL Pipeline, after the data is ready, analysts can directly query the final result table. From one perspective, this is actually exchanging freshness for faster query latency.
However, in traditional offline data warehouses, there are a large number of ad hoc queries (Ad Hoc), and users can flexibly select query conditions according to their own needs. Queries involving large tables may often take 10+ minutes. In order to return results as soon as possible, major storage systems use various optimization methods.
For example, the storage is closer to the calculation, and the closer to the calculation, the faster the reading:
Some Message Queue and OLAP systems only provide local disk storage, which guarantees read performance, but also sacrifices flexibility. Expansion and migration costs are relatively high, and the cost is also higher.
Another direction is the architecture of separation of computing and storage, all data is remote, but the high cost of remote access to DFS/Object Store is reduced through the local cache.
For example, Data Skipping, combined with query conditions and fields, skips irrelevant data to speed up data search:
Hive: Query specific partitions through partition pruning, and skip irrelevant fields through column storage.
Lake storage: On the basis of using column storage, file statistics are introduced, and unnecessary reading of some files is minimized according to the file statistics.
OLAP system: On the basis of using column storage, such as using LSM structure to make the data as orderly as possible according to the primary key, ordering is one of the structures that are most conducive to query, such as Clickhouse.
KV system: Through the organizational structure of the data, the structure of the LSM is used to speed up the query.
Message Queue: Queue actually achieves the ability to quickly locate data through a special reading interface. It only provides positioning methods based on Offset / Timestamp to read data incrementally.
There are many optimization methods, which are not listed here. The storage uses various methods to cooperate with the calculation to accelerate the query, so that the query can be found and read quickly.
Through the above analysis, we can see that the underlying technologies of different systems are basically the same:
Stream computing and batch computing are different modes of computing, both of which can complete full or incremental computing.
The means of storage to speed up query performance are all about finding and reading quickly, and the underlying principles are the same.
In theory, it should be possible for us to build a certain architecture through a certain selection and combination of underlying technologies to achieve the Tradeoff we want. This unified architecture may need to solve the following scenarios according to different Tradeoffs:
Real-time data warehouse: the freshness is very good.
Near real-time data warehouse: As an acceleration of offline data warehouse, it can improve freshness without incurring too high costs.
Offline data warehouse: it has better cost control.
Offline OLAP: Accelerate the query performance of a certain part of the data warehouse, such as the ADS table.
The Streaming Warehouse aims to be a unified architecture:
An ideal data warehouse should allow users to adjust the tradeoff between cost, freshness, and query delay at will. This requires the data warehouse to fully cover all capabilities of offline data warehouses, real-time data warehouses, and OLAP. Streaming Data Warehouse has taken a step forward on the basis of real-time data warehouses, greatly reducing the cost of real-time data warehouses.
While Streaming DW provides real-time computing capabilities, it allows users to cover offline data warehouses under the same architecture. Users can make corresponding Tradeoffs according to business needs to solve problems in different scenarios.
5. Streaming Data Warehouse
Before looking at how the storage architecture of the Streaming Data Warehouse is designed, let’s review the two problems of the mainstream real-time data warehouse mentioned earlier. After solving these two problems, the architectural design of Streaming Data Warehouse is ready to come out.
5.1 Intermediate data cannot be checked
Since the Kafka storage in the middle cannot be checked, the idea of real-time offline integration is: real-time offline one-to-one running, the business layer should do as much packaging as possible, and try to let users see the abstraction of a set of tables.
Many users will use Flink plus Kafka for real-time data stream processing, and write the analysis results to the online service layer for display or further analysis by users. Real-time data is supplemented, and large-scale batch operation/full operation is performed regularly every day or historical data is regularly corrected. [3]
But there are several problems with this architecture:
The table abstraction is different: using different technology stacks, there are two sets of table abstractions for the real-time link and the offline link, which not only increases the development cost, but also reduces the development efficiency; the business layer tries to encapsulate as much as possible, but there will always be various This kind of bumpy problem has many misaligned pits.
The data caliber of real-time data warehouse and offline data warehouse is difficult to maintain natural consistency;
In the Streaming Data Warehouse, we hope that the data warehouse has a unified Table abstraction for queries, so that all flowing data can be analyzed without data blind spots. This requires this unified Table abstraction to support two capabilities at the same time:
Message Queue
OLAP query
That is to say, on the same Table, users can subscribe to the Change Log on this Table in the form of a message queue, or directly perform OLAP queries on this Table.
Next, let's look at the second problem of the classic real-time data warehouse.
5.2 High cost of real-time link
Although the unified Table abstraction provided by Streaming Data Warehouse can well solve the problems of freshness and query delay, its cost is higher than that of offline data warehouse. In many cases, not all business scenarios have high requirements for freshness and query latency, so it is still necessary to provide low-cost table storage capabilities.
Lake Storage is a good option here:
The storage cost of lake storage is lower: Lake storage is based on DFS/Object Store, without service, and has lower resource and operation and maintenance costs.
The local update of the lake storage is flexible: what should I do if there is a problem with the historical partition? What needs to be corrected? The calculation cost of lake storage is lower. Lake storage + offline ETL, INSERT OVERWRITE corrects historical partitions, which is much lower than real-time update cost.
Openness of lake storage: Lake storage can be open to various batch computing engines.
Therefore, the Streaming Data Warehouse needs to provide low-cost offline storage while maintaining the real-time flow of full-link data, and ensure that the architecture does not affect the real-time link. Since the SLA requirements of real-time links are generally higher than those of offline links, the design and implementation of Streaming Data Warehouse storage should take Queue writing and consumption as a high priority, and the storage of historical data should not affect other Ability to act as a Queue.
In computing, a data warehouse (DW or DWH), a system for reporting and data analysis, is considered a core component of business intelligence. It stores current and historical data in one place to create analytical reports for workers across the enterprise. [1]
A typical extract, transform, load (ETL) based data warehouse uses an ODS layer, a DWD layer, and a DWS layer to house its key functions. Data analysts can flexibly query each layer in the data warehouse to obtain valuable business information.
There are three key indicators in the data warehouse [2]:
*Data freshness: the length of time from when the data is generated to when it is available for user query after a series of processing in the warehouse. Usually ETL is a series of processes used to prepare data, and ETL is more accomplished by scheduling and running a series of stream computing or batch computing jobs.
*Data query delay: After the data is ready, the user queries the data in the table through Query. The length of time from the user sending the query to receiving the query result is the query delay. The query delay directly determines the end user's body feeling.
*Cost: The amount of resources required to complete a certain amount of data analysis (including various calculations such as ETL and query). Cost is also a key indicator in the data warehouse.
What is the relationship between these three indicators?
Enterprises need to achieve better query delay and freshness while controlling costs. Different data may have different cost requirements.
Freshness and query latency are also trade-offs in some cases. For example, if it takes longer to prepare data, clean and preprocess data, the query will be faster.
So these three constitute a triangle in the data warehouse Tradeoff [2]:
For this triangle Tradeoff, what kind of trade-offs does the industry's current mainstream architecture have?
2. Mainstream architecture of the industry
Typical offline data warehouse:
The offline data warehouse uses Batch ETL to overwrite (INSERT OVERWRITE) based on the granularity of the partition. While solving the scenario of very large data, it has good cost control.
But it has two serious problems:
Poor freshness: The data delay is generally T + 1, that is, the data generated in the business on the same day needs to be queried the next day.
It is not good at processing update streams (Changelog). All Append data is stored in the offline data warehouse. If you need to receive update streams similar to database change logs, you need to repeatedly merge full data and incremental data, and the cost will increase sharply.
In order to solve the above problems, real-time data warehouses are gradually emerging. A typical real-time data warehouse implementation is to use the Flink + Kafka solution to build the middle layer, and finally write to the online database or analysis system to achieve second-level full-link delay. Very good data freshness.
However, it also gradually exposed some problems.
Problem 1, the middle layer cannot be checked
The data query stored in Kafka is limited, OLAP query cannot be performed flexibly, and long-term historical data is usually not saved. This is very different from the widely used data warehouse. In a mature Warehouse system, each data set in the data warehouse should be a table abstraction that can be queried, and Kafka cannot meet all the needs of users for Table abstraction. For example:
Query capabilities are limited. The real-time data warehouse architecture requires that all queryable data sets be pre-calculated and finally written into the queryable analysis system, but not all calculations can be pre-defined in actual business, and the large demand of data analysts is temporary Ad For hoc query, if the intermediate data Queue cannot be checked, this will seriously limit the data analysis capability of the business.
Troubleshooting is difficult. In the real-time data warehouse, if there is a problem with the data, the user needs to check the data pipeline, but because the Queue storing the intermediate results cannot be checked, it is very difficult to check.
To sum up, we hope to have a unified architecture to obtain a real-time data warehouse that can be queried everywhere, rather than a data warehouse in which intermediate results are pipelined.
Problem 2: The cost of real-time links is high
There is no free lunch in the world, and building a real-time link is relatively expensive.
Storage cost: Whether it is Kafka or the ADS layer behind, they are all online services. Although they have very low latency, they have high storage costs.
Migration and maintenance costs: The real-time link is a new system that is independent of the offline system, and is not compatible with the offline tool chain. Migration and maintenance costs are very high.
Therefore, we hope to have a low-cost real-time data warehouse, which provides low operating costs and is compatible with offline tool chains, while accelerating the original offline data warehouse.
Because the current two architectures are oriented to different trade-offs and scenarios, the business usually can only maintain two architectures, and even requires different technical teams, which not only brings a lot of resource costs, but also brings expensive development costs and operation and maintenance costs.
So is it possible for us to provide a data warehouse that is relatively balanced in terms of freshness, query delay, query capability, and cost? In order to answer this question, we need to analyze the technical principles behind freshness and query latency, the different architectures caused by different tradeoffs, and the technical differences behind them.
3. ETL freshness
The first thing to consider is the freshness of the data: the freshness of the data measures the length of time from when the data is generated to when it is available for user query after a series of processing in the warehouse. Data is ingested into the data warehouse, and after a series of ETL processing, the data enters a usable state.
The traditional batch calculation is based on the caliber for ETL calculation, so its freshness is: caliber + ETL delay. The general caliber is one day, so the freshness of traditional offline data warehouses is at least one day. Calculated according to the caliber, the input and output of the calculation are the full amount. If the freshness is to be smaller than the caliber, the input and output of the computation are partial, that is, incremental. Typical incremental computing is stream computing, such as Flink Streaming.
Incremental computing is not completely equivalent to stream computing, for example, there can also be incremental computing in small batches. Full computing is not exactly equivalent to batch computing. For example, stream computing can also be used as a window to output in full (that is to say, the delay of stream computing can also be large, which can reduce costs);
4. Query Latency
Query delay will directly affect the efficiency and experience of data analysis. Queries are returned to people. This person is not a robot. The data he sees is filtered or aggregated data. In traditional offline data warehouses, it may take 10+ minutes to query large tables.
The most intuitive way to speed up the return of queries is pre-computation. In essence, the ETL of the data warehouse is doing pre-calculation. When the calculation of the data analyst query takes too long, he will notify the data warehouse personnel to establish the corresponding ETL Pipeline, after the data is ready, analysts can directly query the final result table. From one perspective, this is actually exchanging freshness for faster query latency.
However, in traditional offline data warehouses, there are a large number of ad hoc queries (Ad Hoc), and users can flexibly select query conditions according to their own needs. Queries involving large tables may often take 10+ minutes. In order to return results as soon as possible, major storage systems use various optimization methods.
For example, the storage is closer to the calculation, and the closer to the calculation, the faster the reading:
Some Message Queue and OLAP systems only provide local disk storage, which guarantees read performance, but also sacrifices flexibility. Expansion and migration costs are relatively high, and the cost is also higher.
Another direction is the architecture of separation of computing and storage, all data is remote, but the high cost of remote access to DFS/Object Store is reduced through the local cache.
For example, Data Skipping, combined with query conditions and fields, skips irrelevant data to speed up data search:
Hive: Query specific partitions through partition pruning, and skip irrelevant fields through column storage.
Lake storage: On the basis of using column storage, file statistics are introduced, and unnecessary reading of some files is minimized according to the file statistics.
OLAP system: On the basis of using column storage, such as using LSM structure to make the data as orderly as possible according to the primary key, ordering is one of the structures that are most conducive to query, such as Clickhouse.
KV system: Through the organizational structure of the data, the structure of the LSM is used to speed up the query.
Message Queue: Queue actually achieves the ability to quickly locate data through a special reading interface. It only provides positioning methods based on Offset / Timestamp to read data incrementally.
There are many optimization methods, which are not listed here. The storage uses various methods to cooperate with the calculation to accelerate the query, so that the query can be found and read quickly.
Through the above analysis, we can see that the underlying technologies of different systems are basically the same:
Stream computing and batch computing are different modes of computing, both of which can complete full or incremental computing.
The means of storage to speed up query performance are all about finding and reading quickly, and the underlying principles are the same.
In theory, it should be possible for us to build a certain architecture through a certain selection and combination of underlying technologies to achieve the Tradeoff we want. This unified architecture may need to solve the following scenarios according to different Tradeoffs:
Real-time data warehouse: the freshness is very good.
Near real-time data warehouse: As an acceleration of offline data warehouse, it can improve freshness without incurring too high costs.
Offline data warehouse: it has better cost control.
Offline OLAP: Accelerate the query performance of a certain part of the data warehouse, such as the ADS table.
The Streaming Warehouse aims to be a unified architecture:
An ideal data warehouse should allow users to adjust the tradeoff between cost, freshness, and query delay at will. This requires the data warehouse to fully cover all capabilities of offline data warehouses, real-time data warehouses, and OLAP. Streaming Data Warehouse has taken a step forward on the basis of real-time data warehouses, greatly reducing the cost of real-time data warehouses.
While Streaming DW provides real-time computing capabilities, it allows users to cover offline data warehouses under the same architecture. Users can make corresponding Tradeoffs according to business needs to solve problems in different scenarios.
5. Streaming Data Warehouse
Before looking at how the storage architecture of the Streaming Data Warehouse is designed, let’s review the two problems of the mainstream real-time data warehouse mentioned earlier. After solving these two problems, the architectural design of Streaming Data Warehouse is ready to come out.
5.1 Intermediate data cannot be checked
Since the Kafka storage in the middle cannot be checked, the idea of real-time offline integration is: real-time offline one-to-one running, the business layer should do as much packaging as possible, and try to let users see the abstraction of a set of tables.
Many users will use Flink plus Kafka for real-time data stream processing, and write the analysis results to the online service layer for display or further analysis by users. Real-time data is supplemented, and large-scale batch operation/full operation is performed regularly every day or historical data is regularly corrected. [3]
But there are several problems with this architecture:
The table abstraction is different: using different technology stacks, there are two sets of table abstractions for the real-time link and the offline link, which not only increases the development cost, but also reduces the development efficiency; the business layer tries to encapsulate as much as possible, but there will always be various This kind of bumpy problem has many misaligned pits.
The data caliber of real-time data warehouse and offline data warehouse is difficult to maintain natural consistency;
In the Streaming Data Warehouse, we hope that the data warehouse has a unified Table abstraction for queries, so that all flowing data can be analyzed without data blind spots. This requires this unified Table abstraction to support two capabilities at the same time:
Message Queue
OLAP query
That is to say, on the same Table, users can subscribe to the Change Log on this Table in the form of a message queue, or directly perform OLAP queries on this Table.
Next, let's look at the second problem of the classic real-time data warehouse.
5.2 High cost of real-time link
Although the unified Table abstraction provided by Streaming Data Warehouse can well solve the problems of freshness and query delay, its cost is higher than that of offline data warehouse. In many cases, not all business scenarios have high requirements for freshness and query latency, so it is still necessary to provide low-cost table storage capabilities.
Lake Storage is a good option here:
The storage cost of lake storage is lower: Lake storage is based on DFS/Object Store, without service, and has lower resource and operation and maintenance costs.
The local update of the lake storage is flexible: what should I do if there is a problem with the historical partition? What needs to be corrected? The calculation cost of lake storage is lower. Lake storage + offline ETL, INSERT OVERWRITE corrects historical partitions, which is much lower than real-time update cost.
Openness of lake storage: Lake storage can be open to various batch computing engines.
Therefore, the Streaming Data Warehouse needs to provide low-cost offline storage while maintaining the real-time flow of full-link data, and ensure that the architecture does not affect the real-time link. Since the SLA requirements of real-time links are generally higher than those of offline links, the design and implementation of Streaming Data Warehouse storage should take Queue writing and consumption as a high priority, and the storage of historical data should not affect other Ability to act as a Queue.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00