Streaming Data Warehouse Storage: Requirements and Architecture

By Jingsong Lee

1. Computing in Data Warehouses

In the computer field, a data warehouse (DW or DWH) is a system for data reporting and data analysis, which is considered a core component of business intelligence. It stores current and historical data, creating data analysis reports for workers across the enterprise. [1]

A typical data warehouse based on Extract-Transform-Load (ETL) relies on ODS, DWD, and DWS to implement its key functions. Data analysts can flexibly query data from each layer in a data warehouse for valuable business information.

There are three key factors [2] in the data warehouse:

The Freshness of Data: The time that data is available for users to query after a series of processing in the warehouse since its generation. Generally, ETL is a series of processes of data preparation. ETL is mostly completed by a series of stream computing or batch computing.
Latency of Data Query: After the data preparation, the user queries the data in the table. The time from when the user sends a query to when the query result is received is the latency of the data query. The latency of data queries directly determines the experience of the end user.
Cost: The number of resources required to perform a certain amount of data analysis (including ETL and query). Cost is also a key factor in data warehouses.

What is the relationship among these three factors?

Enterprises need to achieve lower latency of data queries and better freshness of data while controlling costs. Different data may have different cost requirements.
In some cases, freshness and query latency are also related to each other. For example, if you spend longer on data preparation, cleaning and data preprocessing, queries will be faster.

These three constitute a Tradeoff Triangle [2] in the data warehouse:

(Note: In the Tradeoff Triangle, closer to the vertex means better, and farther from the vertex means worse.)

As for this Triangle, what are the Tradeoffs in the field's current mainstream architecture?

2. The Mainstream Architecture in the Field

Typical Offline Data Warehouses:

Offline data warehouses use Batch ETL to overwrite data (INSERT OVERWRITE) based on partition granularity, which provides good cost control while processing extremely large data.

However, it has two serious problems:

Poor Freshness: The data latency is generally T + 1. This means data generated on the current day can be queried the next day.
Not Good at Processing Update Data Streams (Change Log): Offline data warehouses store Append data. When they receive update data streams (such as database change logs), they need to repeatedly merge full data with incremental data, thus the cost soaring.

The real-time data warehouse is gradually emerging to solve the problems above. A typical real-time data warehouse uses Flink + Kafka to build an intermediate layer and write it into an online database or analysis system, achieving a comprehensive procedure with second-level latency with good data freshness.

However, it also has some problems.

The first problem is that the intermediate layer cannot be queried.

Data queries in Kafka are limited, OLAP queries cannot be performed flexibly, and long-term historical data is usually not stored. This is very different from the widely used data warehouses. In a mature warehouse system, each data set in a data warehouse should be an abstract implementation of tables that can be queried, while Kafka cannot meet all user requirements for abstract implementation of tables. For example:

The data query is limited. The real-time data warehouse architecture requires all data sets available for query to be precomputed and written into the analysis system that can be queried. However, not all computing can be predefined in actual business. Data analysts require mostly ad-hoc queries. If the intermediate data queues cannot be queried, the data analysis for the business will be severely limited.
The troubleshooting is difficult. If there is a problem with the data in a real-time data warehouse, users need to troubleshoot the data pipelines. However, the queue that stores intermediate data results cannot be queried, which makes troubleshooting difficult.

In summary, we hope to have a unified architecture to get a real-time data warehouse that can be queried rather than a data warehouse where intermediate data are stuck in all kinds of complex pipelines.

The second problem is the high cost of the real-time procedure.

It is expensive to build a real-time procedure.

Storage Costs: Both Kafka and the following ADS provide online services for users with low latency but high storage costs.
Migration and Maintenance Costs: Real-time procedures are a new set of systems independent of offline procedures, and toolchains are incompatible with offline toolchains. Both migration and maintenance costs are high.

Therefore, we hope to have a low-cost real-time data warehouse that supports low operating costs and is compatible with offline toolchains while accelerating the original offline data warehouse.

Summary:

	Offline Data Warehouse	Real-Time Data Warehouse
Cost	Low	High
Freshness	Low	High
Query Latency of Staging Tables in Data Warehouses	High	N/A
Query Latency of Result Tables in Data Warehouses	Low	Low

Since the current two architectures are oriented to different Tradeoffs and scenarios, users have to maintain the two sets of architectures, which even require different technical teams. This brings high resource costs, high development costs, and high O&M costs.

Is it possible to develop a more balanced data warehouse in terms of freshness, query latency, query capability, and cost? We need to analyze the technical principles behind data freshness and query latency, the different architectures caused by different Tradeoffs, and the technical differences behind them to answer this question.

3. ETL Freshness

The first thing to think about is data freshness. It is the time that data is available for users to query after a series of processing in the warehouse since its generation. The data is entered into the data warehouse, and after a series of ETL processing, the data will be in an available state.

Traditional batch computing is based on the caliber (day) for ETL processing, so its freshness is the number of days + ETL latency. That means the data freshness of the traditional offline data warehouse is at least one day. According to the computing based on the caliber (day), the input and output data processing are full computing. If the data freshness is less than the caliber (day), the input and output data processing are incremental computing. Typical incremental computing is stream computing, such as Flink Streaming.

However, incremental computing is not the same as stream computing. For example, incremental computing can be performed in small batches. Full computing is not completely equivalent to batch computing. For example, Window can be used to output full data results in stream computing. (The latency of stream computing can also be very high, which can reduce the cost.)

4. Query Latency

Query latency directly affects the efficiency of data analysis and user experience. The query results are returned to a human. The data ze sees is filtered or aggregated. In traditional offline data warehouses, it may take more than ten minutes to query a large table.

The most effective way to accelerate the return of queries is precomputation. In essence, the ETL processing in data warehouses is precomputation. When the computing time of a project initiated by data analysts is too long, the ETL system will notify the data warehouse personnel to establish the corresponding ETL pipelines. After the data preparation, the analysts can directly query the final result table. This is trading freshness for lower query latency.

However, in traditional offline data warehouses, there are a large number of ad-hoc queries in which users can flexibly select query conditions according to their needs. Queries involving large tables may take more than ten minutes. Major storage systems perform various optimization methods to return the query results as soon as possible.

For example, bring the storage resources closer to the computation resources. The closer the storage resources are to the computation resources, the faster the reading speed is:

Some message queues and the OLAP system only provide storage on local disks, which ensures read performance but sacrifices flexibility. Capacity expansion and data migration costs are also relatively high.
Another optimization method is the computation-storage separation architecture. The data is all remote, but the local data cache is used to reduce the high cost of remote access to the DFS /Object Store.

For example, data skipping accelerates data search by skipping irrelevant data based on query conditions and fields:

Hive: It queries data in a specific partition by partition pruning and skips irrelevant fields using column-based storage.
Lake Storage: It introduces the statistical information of files based on column-based storage and minimizes the unnecessary reading of some files based on the statistical information.
OLAP System: For example, based on column-based storage, the OLAP system uses the LSM structure to make data as orderly as possible according to the primary key sorting. Ordering is one of the most favorable factors for queries (such as ClickHouse).
KV System: It uses the LSM structure to accelerate queries based on the data structure.
Message Queue: A queue can quickly locate data through a special read interface. It only provides a way of locating data based on Offset/Timestamp to read incremental data.

There are many other optimization methods, which are not enumerated one by one here. Storage systems use various methods to cooperate with computing systems to accelerate queries so queries can be more effective.

Through the analysis above, we can see the underlying technologies from different systems are the same.

Stream computing and batch computing are different modes of computing. Both can complete full or incremental computing.
All the methods that storage optimizes query performance focus on fast finding and fast reading. The basic principles are the same.

Theoretically speaking, we can build a unified architecture through the combination of underlying technologies to obtain our desired Tradeoff. This unified architecture may need to meet the following requirements based on different Tradeoffs.

Real-Time Data Warehouse: The data freshness is good.
Near Real-Time Data Warehouses: As an acceleration version of offline data warehouses, it improves freshness without bringing too high a cost.
Offline Data Warehouse: It has better cost control.
Offline OLAP: It accelerates the query performance of a certain part in a data warehouse, such as the tables on the ADS layer.

The goal of Streaming Warehouse is to build a unified architecture:

(Note: In the Tradeoff Triangle, closer to the vertex means better, and farther from the vertex means worse.)

In an ideal data warehouse, users can adjust the Tradeoff of cost, freshness, and query latency at will. This requires the ideal data warehouse to be fully equipped with all the functions of offline data warehouses, real-time data warehouses, and OLAP systems. Streaming Data Warehouse takes a step forward based on real-time data warehouses, significantly reducing the cost of real-time data warehouses.

Streaming Data Warehouse provides a real-time computing service and allows users to use the functions of offline data warehouses in the same architecture. Users can make corresponding Tradeoff according to business requirements, thus solving problems in different scenarios.

5. Streaming Data Warehouse

Before talking about how the Streaming Data Warehouse storage architecture is designed, let's review the two problems of the mainstream real-time data warehouses mentioned earlier. After solving these two problems, the architecture of Streaming Data Warehouses will be clear.

5.1 Intermediate Data Cannot Be Queried

Since the data in Kafka on the intermediate layer is not available, the integration of real-time and offline processing says real-time data processing and offline data processing run in parallel. The business layer does as much encapsulation as possible to show the users a set of abstract implementations of tables.

Many users use Flink + Kafka to process real-time data streams and write the analysis results to the online service layer for display or further analysis. At the same time, they import the data in Kafka in the real-time data warehouse to the asynchronous backstage offline data warehouse to supplement the real-time data. They also regularly do large-scale batch/full processing or correct historical data every day. [3]

However, there are several problems with this architecture:

The abstract implementation of tables is different. Using different technology stacks, real-time procedures, and offline procedures have two sets of abstract implementation of tables, which increases the development cost and reduces the development efficiency. The business layer does as much encapsulation as possible, but it may encounter various difficulties, such as data misalignment.
It is difficult to maintain consistency of data caliber between real-time data warehouses and offline data warehouses.

We hope the data warehouse has a query-oriented unified abstract implementation of tables in the Streaming Data Warehouse. All streaming data can be analyzed without data missing. This requires this unified abstract implementation of tables to provide both services.

Message Queue
OLAP Query

In other words, you can both subscribe to Change Logs of this table in a way of Message Queue and directly perform OLAP queries on this table.

Let's talk about the second problem of the classic real-time data warehouse.

5.2 High Cost of Real-Time Procedures

Although the unified abstract implementation of tables provided by Streaming Data Warehouse can solve the problems of data freshness and query latency, its cost is higher than offline data warehouses. In many cases, not all business scenarios have high requirements for data freshness and query latency. Therefore, it is still necessary to develop low-cost table storage.

Lake storage is a good choice.

Lower Storage Costs for Lake Storage: Based on DFS/Object Store, lake storage is serverless and has lower resource costs and maintenance costs.
Flexible Data Update of Lake Storage: What if there is a problem with historical data partitions? What if the historical partitions need to be corrected? The computing cost of lake storage is lower. You can use lake storage, offline ETL, and INSERT OVERWRITE to revise historical data partitions, which costs much less than real-time updates.
The Openness of Lake Storage: Lake storage is available to various batch computing engines.

Therefore, while maintaining the real-time flow of comprehensive-procedure data, Streaming Data Warehouse also provides low-cost offline storage and ensures that the architecture does not affect real-time procedures. Generally speaking, the requirements in SLA for real-time procedures are higher than offline procedures. Therefore, the writing and consumption of data queues in the design and implementation of Streaming Data Warehouse storage should be taken as a high priority. The storage of historical data should not affect operations on data queues.

6. Flink Table Store

Flink Table Store [4] provides the service of integrated streaming/batch storage designed for Streaming Warehouses.

Over the past few years, with the help of many developers and users, Apache Flink has become one of the best distributed computing engines, especially in large-scale stream processing with the local state. Nevertheless, there are still some challenges when people try to gain inspiration from data analysis. Among these challenges, a prominent one is the lack of storage systems that can meet all computing modes.

Until now, it is common to deploy some storage systems that worked with Flink for different purposes. A typical approach is to deploy a Message Queue for stream processing, a scannable file system/object storage service for batch processing and ad-hoc queries, and a KV storage for queries on nodes. Due to its complexity and heterogeneity, such an architecture faces challenges in terms of data quality and system maintenance. This has become a major problem that undermines the end-to-end user experience of streaming and batch-integrated processing by Apache Flink.

Flink Table Store is developed to solve the problems above. This is an important step for Flink, which extends the services of Flink from computing to storage. As a result, we can provide a better end-to-end experience for our users.

6.1 Architecture

6.1.1 Service

The coordinator is the control node of a cluster. It is mainly responsible for managing various executors. Its main responsibilities include:

The coordinator manages the lifecycle of executors. The client uses the coordinator to find the addresses of executors.
Data Manager:
- It is responsible for managing the Table version, communicating with Metastore, and checkpointing the version to metastore.
- It manages caches and indexes based on the pattern of written data and queries.
Resource Manager:
- It is responsible for managing the distribution of buckets of tables among executors.
- It dynamically allocates Buckets to Executors as required.

Metastore is an abstract node. It can connect to Hive Metastore, minimize dependencies based on Filesystem, or connect to your Metastore. It stores the most basic table information. You don't have to worry about performance issues because more detailed and complex table information is placed in the lake storage.

The executor is a separate compute node that serves as a cache for storage and an accelerator module for local computing:

It is responsible for receiving the updated data, writing the updated data to the local cache, writing the updated data to the local disk, and flushing it to the DFS.
It also supports real-time OLAP queries and queue consumption and performs some accelerated local computing.

Each executor processes one or more buckets. Each bucket has a corresponding change log. These change logs are stored in the Message Queue and are mainly used to:

Write Ahead Log: With the write-ahead log, the program can check the log to restore data after the Executor Failover.
Provide the Abstract Implementation of Queues: Provide the change log stream consumption of tables to downstream stream computing.

6.1.2 Lake Storage

After the Checkpoint operation on data in Executor, the data falls into the Lake Storage, which is based on the file format of column storage and the file system of shared DFS. Lake Storage provides a complete abstract implementation of the table format. Its main purpose is to support updates and reads at a lower cost:

LSM Structure: It is used for large amounts of data updates and high-performance data queries.
Columnar File Format: Apache ORC is used to support efficient queries.
Lake Storage: Metadata and data are stored on DFS and Object Storage.

6.1.3 Separation of Cold and Hot Databases

The read and write paths are divided into two:

Streaming Pipeline & Online OLAP Query: In this path, the system obtains metadata from the coordinator and then writes and stores data through the Executor.
Batch Pipeline & Offline Query: In this path, the system obtains metadata from Metastore and then writes and stores data through Lake Storage.

The data from the Service is up-to-date. It synchronized to Lake Storage after minutes of checkpoint operations. That's why the data the user reads from the lake storage is not exactly the same as the real-time data. Essentially, the data from both the Service and Lack Storage sides are the same.

Service and Lake Storage have these differences:

Service applies to the processing of the latest hot data, and it provides fast updated data writing and high-performance query with lower latency.
Service does not apply to the task of the offline query. One reason is that it affects the stability of online queries, and the other is that the cost will be higher.
Service does not support INSERT OVERWRITE for Batch Pipeline.

Therefore, storage needs to rely on lake storage to deal with these tasks. Then, how can we determine which data should be processed in Service and which data should be processed in lake storage?

INSERT OVERWRITE operation for Batch Pipeline can only be performed on archived partitions.

You can specify the partition in which the time can be archived automatically when you create a table.
You can also use DDL statements to archive a partition.

6.2 Short-Term Goals

6.2.1 Short-Term Architecture

The overall change in Streaming Data Warehouse is huge. There are outstanding players in the fields of OLAP, Queue, Lake Storage, Stream Computing, and Batch Computing. Currently, it is impossible to produce a perfect solution in a short term.

However, we are moving forward. In Apache Flink Table Store, we developed LSM-based lake storage and natively integrated Kafka as the Log System.

Compared with the complete architecture in the preceding section, the short-term architecture does not include Coordinators and Executors, which means:

It cannot provide the service of real-time OLAP. There is a certain degree of latency in file-based OLAP.
There is no service-oriented data control capability.

We will start from the bottom and lay a solid foundation. First, we propose a complete unified abstraction, accelerate storage, and provide real-time OLAP.

The current architecture provides two core values.

6.2.2 Value 1: Real-Time Intermediate Layer Can Be Queried

Table Store supports query service to the hierarchical storage Kafka of the original real-time data warehouse, and intermediate data can be queried.

Table Store still supports streaming real-time pipelines. It natively integrated Log, supports the integration of Kafka, and weakens the concept of streaming and batch. Users only see the abstraction of Table.

However, it is important to note that data writing into storage should not affect the stability of Kafka.

6.2.3 Value 2: Acceleration of Offline Data Warehouses

Table Store accelerates data processing of offline data warehouses and provides incremental data updates while being compatible with Hive offline data warehouses.

Table Store provides a comprehensive table format of lake storage and quasi-real-time OLAP queries. The structure of LSM is conducive to updating performance and helpful for better Data Skipping, thus accelerating OLAP queries.

6.3 The Follow-Up Plan

The community is currently working to strengthen core functions, stabilize storage formats, and optimize the remaining parts to prepare Flink Table Store for production.

In the upcoming version of 0.2.0, we hope to provide a Table Format with the function of integrated streaming/batch processing and gradually improve the integrated streaming/batch lake storage. You can expect the following additional functions:

The Flink Table Store Reader will support the Apache Hive engine.
The number of buckets can be adjusted.
It will support Append Only data. Table Store will not be limited to scenarios about updated data.
A complete Schema Evolution and better metadata management
Improvements based on feedback on the 0.1.0 preview

In the medium term, you can also expect:

The Flink Table Store Reader will support Presto, Trino, and Apache Spark.
Flink Table Store Service will accelerate updates, improve query performance, and support millisecond-level Streaming Pipeline and strong OLAP capabilities.

Please try the preview version of 0.1.0 and share your feedback on the Flink mailing list, contributing to the project optimization.

6.4 Project Information

Apache Flink Table Store project [4] is under development, and the first version has been released. You are welcome to try it and give feedback.

Reference

[1] Data Warehouse Wiki: https://en.wikipedia.org/wiki/Data_warehouse

[2] Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google: http://vldb.org/pvldb/vol14/p2986-sankaranarayanan.pdf

[3] Flink Next: Beyond Stream Processing: https://mp.weixin.qq.com/s/CxHzYGf2dg8amivPJzLTPQ

[4] https://github.com/apache/flink-table-store

Community

Streaming Data Warehouse Storage: Requirements and Architecture

1. Computing in Data Warehouses

2. The Mainstream Architecture in the Field

3. ETL Freshness

4. Query Latency

5. Streaming Data Warehouse

5.1 Intermediate Data Cannot Be Queried

5.2 High Cost of Real-Time Procedures

6. Flink Table Store

6.1 Architecture

6.1.1 Service

6.1.2 Lake Storage

6.1.3 Separation of Cold and Hot Databases

6.2 Short-Term Goals

6.2.1 Short-Term Architecture

6.2.2 Value 1: Real-Time Intermediate Layer Can Be Queried

6.2.3 Value 2: Acceleration of Offline Data Warehouses

6.3 The Follow-Up Plan

6.4 Project Information

Reference

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink

Data Lake Storage Solution