How to implement real-time data warehouse
Real-time data warehouses have been accepted by many companies, and many members of the About cloud community are in preparation for building real-time data warehouses.
1. So what are the characteristics of the real-time data warehouse:
• Data Arrival in Real Time: Arrives at the warehouse faster – streaming data of millions of events per second arrives continuously
• Ad-hoc queries: data can be optimally queried faster - queried as soon as it arrives, no processing, aggregation or compression required
• Faster queries: Queries run faster – small selective queries are measured in 10s or 100s of milliseconds; large, scan or computationally heavy queries are processed at very high bandwidth
• Efficient data change: when necessary, data changes quickly - if for some reason data needs to be corrected or updated, it can be done in place without extensive rewriting
2. What are the benefits of building a real-time data warehouse for the company?
Real-time data warehouse users, such as operations, management, or bosses, can see the detection data in real time, so it is much more convenient to see the detection data in real time:
Take the takeaway scene as an example:
(1) After doing marketing activities, what is the effect of the current activities? If it is not good, can it be remedied in time.
(2) The new business is launched, so whether you like the new business, according to the real-time detection and feedback of users, the new business can also be adjusted at any time
(3) If there is an abnormality in the order, merchant, and delivery, it can also be found and processed in real time
(4) For users who place orders, real-time recommendations can also be made according to user preferences.
Through the above, we can deal with the verification of enterprise ideas, detection of business anomalies, and recommendations of user hobbies in real time, instead of problems or business anomalies that can only be dealt with or recognized the next day. A real-time data warehouse can allow businesses to run more efficiently. If the offline data warehouse supports the company's operational strategic decisions, then real-time supports the company's strategic and tactical decisions.
3. How to build a real-time data warehouse:
In fact, if we don't know much about data warehouses, or have only done offline data warehouses, there may be such a problem?
Offline and real-time are independent or related. From an efficiency point of view, companies will not let them be separated independently. For real-time data, it will eventually flow into the data warehouse.
If you don’t understand here, we need further explanation. For real-time data warehouses, most of the technical architectures used are Flink stream processing and Kafka as storage. We know that kafka is generally used as a cache, and data generally has a validity period. Therefore, at a certain stage in the real-time data warehouse, data can be designed to flow to the offline data warehouse.
If we really want to build a real-time data warehouse, there may be the following questions?
1. As a data warehouse, does Kafka need to be layered? how to layer
The Kafka layer is divided by topic, and the table corresponds to the topic, for example, the form is as follows:
That is to say, through the above form, we have realized Kafka as a real-time data warehouse.
2. How to operate Topic
We know that the topic is actually full of news, if we want to integrate the news inside, how to do it. At this time, we use Flink Sql, Flink Sql reads Topic, and then performs various data operations, such as Join.
Above we have opened up the problem of real-time data warehouse storage and how to operate the data, so how to build a data warehouse according to our business?
In fact, as long as we understand the real-time data warehouse, there are various implementation methods and ideas. Generally speaking, the overall framework of the real-time data warehouse is divided into different levels according to the flow of data, and the access layer will be based on various data access tools. Collect the data of various business systems, such as buried business data or business background mergers and acquisitions, and put them in the [kakfa] message queue. The data in the message queue is not only the raw data of the offline data warehouse, but also the raw data calculated in real time, which ensures that the real-time and offline raw data are unified.
With the source data, some processing is done through the Flink+ real-time computing engine at the computing layer, and then landed in different storage media in the storage layer. Different storage media are selected according to different application scenarios. There is also the interaction between Flink and Kafka in the framework. A layered design is carried out on the data. The computing engine retrieves the data from Kafka for some processing and then puts it back into Kafka. The data put back here may be other layered data.
The data processed in the storage layer will pass through two services of the service layer (DWS or DM): unified query and index management. Unified query is a service that calls the data interface through the business side, and index management is the definition and definition of data indicators. management work. The service layer is applied to different data applications, which may be our official products or direct business systems.
If you don’t understand the above layering and data warehouse, you can refer to the following content
Detailed explanation of data warehouse: including concept, architecture and design
A series of articles such as E-commerce Data Warehouse (User Behavior Data Collection) (1) of Big Data Project.
4. Analysis of technical architecture:
If you don’t know much about data warehouses above, you may see them vaguely. Let’s go on to further elaborate. The difference between real-time data warehouses and offline data warehouses is actually in time. Real-time data warehouses are more timely in terms of collection and calculation. . So how to achieve it? For example, in terms of acquisition: some real-time acquisition frameworks canal, maxwell, and EPX can be used. For better comparison, if offline acquisition, possible plug-ins such as Sqoop, the bottom layer of Sqoop uses MapReduce. In terms of computing framework, Hive can be used offline , Real-time currently mostly uses Flink.
When we actually build a data warehouse, we may face the following problems
1. The process is not clear
2. Technology selection is not clear
Let's look at several technical architectures to help us choose a more suitable architecture for us to analyze from a purely technical perspective.
Analysis of real-time architecture 1:
We see that User Log and Server Log enter Kafka through the log collection tool, and Kafka data enters Hive and Kafka respectively.
We see that both Hive and Kafka are layered, that is, Hive is an offline database, and Kafka is a real-time data warehouse.
HIve layering: What needs to be explained here is that the layers actually have corresponding tables for each layer.
STG: It stores data integrated from heterogeneous source systems.
ODS: The layer closest to the data in the data source. The data in the data source is loaded into this layer after extraction, cleaning, and transmission, that is to say, after the legendary ETL. The data at this layer is generally classified according to the classification method of the source business system.
DW: Data warehouse, data warehouse layer. Here, the data obtained from the ODS layer establishes various data models according to themes.
RPT: It is oriented to the report layer, including summary tables (available when some query dimensions are small) and detailed tables used in report queries.
DIM: The public dimension aggregation layer (DIM) is based on the concept of dimensional modeling to establish consistent dimensions for the entire enterprise.
We see that the DW layer and DIM layer are regularly updated to Hbase/Redis.
Kafka real-time data warehouse:
We see that ODS, DWD, and DWS are:
DWD: Data Warehouse Detail (DWD)
DWS: Data Warehouse Summery (Data Warehouse Summery,) is the main content of the data platform. The data of these two layers is generated by ODS layer data after ETL cleaning, conversion, and loading. Of course, most of them widen the table to facilitate analysis and statistics.
We see that DWD and DWS enter ClickHouse/Doris, which is what we call OLAP here.
Analysis of real-time architecture 2:
This is oppo's real-time data warehouse. The data warehouse uses NiFi to collect logs, and then enters Kafka. The original table here should be the ODS layer, and then through Flink ETL cleaning and processing, etc., and then flows into Kafka as the DWD layer. Here, ad hoc query can be used, that is, OLAP. Kafka's detailed layer summary is the DM (ADS) layer of the data warehouse. The DM layer is used as a data source for report analysis, user portraits, and interface services.
About NiFi can refer to
Analysis of real-time architecture 3:
We look at this technical architecture, which is very different from the oppo diagram, but the content is basically the same, except that the DM layer is replaced by Hbase. It is worth noting here that there is a DIM dimension layer, which can be joined with other layer data.
Analysis of real-time architecture 4:
This architecture is separated from the real-time data warehouse. The real-time data warehouse uses Flink, and the offline data warehouse uses Hive/Spark.
Throughout the process, in the collection log:
Real-time data warehouse using Flume and Canal
Offline data warehouse using Flume and Sqoop
Real-time data warehouse storage using Kakfa
Offline data warehouse storage using Hive
Real-time data warehouse: ODS and DIM join, form a wide table and then enter Kafka, form a DWD layer, and then DWD further processing enters the DWS layer.
Offline data warehouse: storage first enters HDFS, and then enters Hive. The process here is the same as that of real-time data warehouses, widening tables and then forming layers.
Analysis of real-time architecture 5:
This is an offline and real-time integrated data warehouse. When many of our members create a real-time data warehouse, they may have such a question. If the data exists in Kafka, what should we do if the data is out of date?
In fact, each company according to its own situation, such as Kafka DWD layer data, flows into the offline data warehouse Hive, and then further processing.
In the whole process above, we can see that Flume and CDC are used to collect logs. After collection, they enter the real-time data warehouse Kafka, and then the ODS layer and DIM layer open wide tables as DWD. The data processed by DWD enters the offline data warehouse Hive and Kafka's DWS respectively. Layer, I believe there is no need to elaborate on these two lines. Finally, the DWS layer provides data for OLAP.
From the above we can see that whether it is a real-time data warehouse or an offline data warehouse, the layers of the database are similar. Regarding the number of layers, it needs to be based on our actual situation. Generally speaking, it is four to five layers, and JD.com has nine layers. Layering has many functions, and we believe that layering plays a big role in reusing each layer and saving time and improving efficiency.
Regarding the technical architecture, in fact, if we understand all the above, the technical architecture is no longer a problem. Which one to choose, which one not to choose, needs to consider our scene, the knowledge reserve of the team, etc.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00