Data Warehouse Helps Internet Decision Making and Precision Marketing
Online business and refined operations promote real-time and interactive data warehouses
Let's take a look at some trends in the development of data analysis in the past few years, as shown in the figure above.
It can be seen that the basic trend of data analysis is to evolve from batch processing to interactive analysis and stream computing. Ten years ago, big data was more about dealing with scale issues, and processing massive amounts of data through parallel computing technology. At that time, we were more concerned with data cleaning and data model design, and the demand for analysis was not high. too much.
Today, our big data team has basically become a data analysis team. There will be more and more requirements for the precipitation of data models, the ability to support interactive analysis, the efficiency of support for query response delays, and QPS. Data analysis is not just about storing data and then analyzing it, but there are also many pre-calculation scenarios, where there is logic first and then calculation. For example, during Double 11, how many trading volumes are there in how many seconds? In such a typical scenario, it is not a matter of transaction data first and then calculation volume. It must be a process of real-time calculation of results along with transactions.
Therefore, interactive analysis and stream computing are almost a parallel process, and these two new scenarios have many different requirements for the underlying technology. Batch processing pays more attention to parallelism. In the field of interactive analysis, we have started to have a lot of technologies such as precomputing, memory computing, and indexing, so this also promotes the evolution of the entire big data technology.
To sum up, data analysis supports more and more online businesses. Online businesses include products recommended and advertisements displayed on the screen when we turn on the mobile phone at any time. These all need to return results within a few milliseconds, relying on data Smart recommendation, if not recommended, the click rate and conversion rate must be very low.
Therefore, our data business is supporting more and more online businesses. Online businesses have very high requirements on query delay, QPS, and precision, which also promotes the evolution of data warehouses to real-time and interactive directions.
Alibaba's typical real-time data warehouse scenario
There are many usage scenarios of data warehouses in Alibaba, such as the real-time GMV large screen of Double 11. GMV is just a conclusive figure. In fact, for data analysts, this work is just the beginning. We need to analyze downwards, what kind of products, what channels, what kind of people are targeted at, what kind of promotion methods are used to achieve such conversion effects, and what conversion effects have not been achieved, etc. A series of analysis. These analyzes are actually very fine-grained, which is the result of refined analysis.
After the analysis, we will do some tagging on all products and people. Through tagging, we can guide online applications to make recommendations, analysis, circle selection, etc., so there will be a lot of business in the middle of the data. .
There is also a type of business that is partial to monitoring. Orders suddenly fluctuate and increase, network quality fluctuates, some monitoring of live video, etc. These are also typical application scenarios of real-time data warehouses.
"Complexity" of big data real-time data warehouse system
In the past, when we built real-time data warehouses, we referred to many companies, and Alibaba also took a very complicated route.
The above is the architecture diagram I drew. I was very excited when I saw it for the first time. At that time, I was a big data architect. Being able to draw so many arrows is a very skillful thing. But when I actually built such a system, I found that the development efficiency and operation and maintenance efficiency of this system are very maddening.
This system evolved from the upper left corner. Messages were collected from message middleware, followed by an offline processing process. At that time, we did not have many real-time technologies, so we had to solve the problem of scale first. Through offline processing, we will convert the processed result set into a small result set and store it in MySQL and Redis. This is a typical dual processing service system.
After turning the data into small data, the upper-level applications can be provided externally, including report applications, recommended applications, etc. Afterwards, data analysis needs to be done in real time, and this kind of "T+1" offline processing alone cannot meet the demand. The more real-time the data, the more contextual it is, and the greater the value. At this time, many pre-computing technologies have been adopted, such as Flink. Consume events in Kafka directly through Flink, and do calculations in an event-driven manner. Since Flink is distributed, it has very good scalability. It can pre-calculate and event-driven to reduce end-to-end delays. Do it to the extreme.
Then we will also store the result set in a medium. The result set processed by Flink has a very streamlined structure, generally based on the kv6 structure, and placed in systems such as HBase and Cassandra. Such systems provide Large screen reports are the fastest. For example, the large screen of Double 11 must not wait for tens of millions or hundreds of millions of records before making statistics, otherwise the query performance will definitely not be satisfactory. Therefore, we will process the original data set at the granularity of each channel per second at the beginning. When the original data set is analyzed on a large screen, it can change from a large data set to a small data set. , the performance reaches the extreme.
Now we see that there is processing scale and processing speed, both of which seem to meet certain requirements on the surface, but in fact the cost is not small. If you want to calculate fast enough, you need to pre-calculate, so the flexibility of the data model will be reduced, because we have aggregated all the data into a result set through Flink.
For example, if some business logic is not defined at the beginning, such as the aggregation of a certain three dimensions at the beginning, if you want to analyze the aggregation of the fourth dimension later, it cannot be analyzed because it has not been calculated in advance, so sacrifice here flexibility.
At this time, some technologies that are more flexible than HBase and better in real-time than Hive will be adopted, such as ClickHouse and Druid. Data can be written in real time, and after writing, it also provides certain interactive analysis and self-service analysis capabilities.
We will find that data processing disperses the same data in three different media, including offline processing media, near real-time processing media and full real-time processing media. The three media often require three teams to maintain. The three teams change personnel over time, and the data processing logic must be more or less adjusted. As a result, we will find that the same indicator passes through three different channels. The resulting calculation results are inconsistent, and this happens to almost every company.
This is only a superficial problem, and it is more painful for analysts. He needs to use different data, access different systems, learn different interfaces, and even have different access control mechanisms, which is very uncomfortable for analysts. convenient. Therefore, many companies have to set up a so-called data middle platform to shield the differences in the underlying physical engines. In this case, federated computing technologies such as Presto and Drill will be used.
Federal computing technology has a development history of more than 20 years. From the earliest data information system integration to data virtualization, the technology has been developing. This technology is a technology that does not move data but moves computing. It sounds beautiful, but when it is actually used in production, you will find that the query experience of this system is unpredictable. We do not know how to query data through the system. Whether it is fast or slow, because it pushes the query to different engines. If it is pushed to Hive, it is not so fast. If it is pushed to ClickHouse, it may be faster. So for an analyst, he doesn't know what the expectations of the system are, which is called unpredictability. For example, it may have taken 5 seconds to open the report before, and it may have taken 5 minutes another time. This situation will make the analyst wonder whether it is normal or abnormal by the time of 5 minutes. If it runs on Hive, it is called normal, if it runs on Drill, it is called abnormal. The unpredictability also makes it difficult for this system to enter the production field, so at this time we have to do a collection of data and turn it into a small result set for analysis.
We have seen that the entire link is actually composed of multiple components nested layer by layer and layer by layer dependency. In addition to the inconsistency of data caliber caused by different team maintenance just mentioned, the cost of data maintenance will also become very high.
We often encounter situations where we see that a certain number on the report may be incorrect. For example, on a certain day, this indicator suddenly increases or declines a lot. At this time, no one can confirm whether there is a problem with the data quality or the processing logic. , or the data synchronization link is wrong, because the intermediate link is too long. If the data source modifies a field and fills in a new data, then every link must be re-run, so if this architecture is running normally, there is no problem, but once there is a problem with data quality or data scheduling, Huanhuan relies on , which makes the entire operation and maintenance cost very high.
At the same time, it is difficult and expensive to find talents who understand so many technologies. What often happens is that a company has worked hard to cultivate such talents, and then they are poached by other big factories. Such talents are very difficult from recruitment to training.
The above is the status quo of this structure.
The complexity of big data today is reminiscent of the 1960s
The above architecture reminds me of the 1960s. At that time, the database was basically not born, and the relational database was born in the late 1970s.
How did we manage the state of data in the 1960s?
In the 60s, every programmer had to write their own state maintenance. Some people use the file system, but the file system alone will be very discrete, and it is very difficult to maintain, because there is a relationship between data. At this time, there are still some network-like systems that can jump to another file through one file, but the management of the network structure is relatively complicated because of loops, nesting, and so on. In addition, there is a hierarchical structure, which is also a way of describing data types.
So it can be seen that the programmers in the 1960s managed the data state by themselves, but the cost was actually very high.
In the 1980s, we basically stopped doing this, because we knew that all states should be stored in the database as much as possible, and it was also because relational databases made this a lot easier. Although we have many kinds of relational databases, most of them are based on SQL interface, which makes our entire data storage, query, management and other costs drop sharply.
This matter has undergone many changes in the past 20 years. From the era of big data to NoSQL to NewSQL, various technologies have been born, such as Hadoop, Spark, Redis, etc., which make the data field flourish.
But I always feel that there are some problems hidden in this place, which may not necessarily be the case in the future. At present, the development of big data is vigorous but not uniform, so I wonder if in the future, big data technology can be relatively gathered, and a more descriptive way can be used to solve problems, instead of letting every programmer learn different components, learn dozens of different interfaces, and configure thousands of parameters. In order to improve development efficiency, it may be necessary for us to further simplify these technologies in the future.
The core requirement of real-time data warehouse: timeliness
Let's look back and see what business requirements real-time data warehouses have without these technical components, and what kind of data architecture can be required for these business requirements.
What is real time? Many people think that real-time analysis means that the time from data generation to analysis is short enough, but this is not entirely accurate. I think real-time data warehouses are divided into two types of real-time, one is short end-to-end delay, and the other real-time can also be called on-time, that is, when we actually analyze the data, we can get valid data and draw conclusions. In conclusion, this can be called punctuality.
The first kind of real-time is a more technical concept, but does our business necessarily need such a low delay?
Usually, business is not making decisions every second. When we open the report and look at the data, the data we care about at this moment may be the past day, the last hour, 5 minutes ago, or a day seconds ago. But experience tells us that if 99% of the data analysis can be delayed by 5 minutes, it can basically meet the data analysis needs of the business. This is the case in the analysis scenario. If it is an online service scenario, this is definitely not enough.
So most companies may not require such real-time, because the costs behind it are different.
Because once the end-to-end delay is required, a lot of pre-computing business logic must be required, and we cannot wait for the data to be stored before querying, because the delay is required to be very short, and we must aggregate, process, and process a lot of data in advance. Only by widening can the amount of calculation during data analysis be small enough, and the delay can be short enough.
So if you are pursuing end-to-end delay, you have to do a lot of pre-calculation work. As we mentioned just now, if all the calculations are pre-calculated, the flexibility will be lost, so this matter has a cost. Relatively speaking, if you are pursuing a punctual system, you can rearrange part of the calculation logic in exchange for a flexible analysis scenario.
For example, during Double 11, it was only for the pursuit of delay, so we would only save the last GMV. This is a scenario, and the matter is over. But this matter does not meet the company's requirements. The company must issue a detailed report and need to know when it is selling well and when it is not selling well. Therefore, in this case, it is definitely not suitable to pre-calculate all in advance. We need to save more detailed data, and we can do more interactive analysis, correlation analysis, and exploratory analysis. Therefore, the two The requirements behind each system are different.
Relatively speaking, I think what most companies want is a just-in-time system. It needs to have the ability to calculate post-processing, real-time writing, and the ability to analyze after writing, even if the analysis efficiency is not so high. Have the ability to analyze flexibly.
Real-time data warehouse core requirements: data quality
Data quality is a very important part in the construction of real-time data warehouses. As mentioned earlier, if you do not pursue data quality but only pursue timeliness, a result set is pre-processed through calculation at the beginning. This result set tells us that the GMV has reached 100 100 million, most bosses can’t believe it, because behind the 10 billion may be data quality problems, or the calculation logic may be wrongly written, so the system must be able to guarantee data quality.
Data quality is divided into two aspects, one is how often a quality problem is discovered, and the other is how often a quality problem is corrected. These two solutions are not the same.
If you want to find data quality problems, you need to make the state of the calculation process persistent, and you hope that the data warehouse engine can have detailed and summarized data that can be placed on the disk, and these data can be checked. In this way, when the boss asks why the indicator has increased so much and which customer brought the increase, you can check the reason through these detailed data. If you find a mistake when analyzing the data, can you correct it? This is also a key question. .
Some systems can only be viewed but not changed. This is a common problem in many big data systems. Big data systems are very good at dealing with large-scale problems, but they are particularly difficult to deal with small problems, such as correcting data. Because it requires a large data block every time it is corrected, or the entire file is replaced without a primary key, so it is really difficult to update.
Compared with finding problems and correcting them, I hope that a system can have the ability to correct data problems.
The problem of correction is that when the data quality is found, the status of the data can be easily updated, such as the update of single-row data, single-column data update, batch update, etc. There is a very simple way to refresh the data. The status of data refresh often happens, for example, there is a problem with the quality of upstream data, and the processing logic is wrongly written, etc., all of which require data refresh.
Secondly, I also hope that when data is corrected, only one system can be corrected as much as possible.
We saw just now that after the data source of a piece of data is wrong, it has to be circulated repeatedly in the back-end 4~5 links, which means that once the data is wrong, data correction must be done in 4~5 links Work, there is a lot of redundancy and inconsistency in the data. If it is to be corrected, every link must be corrected, which is also a particularly complicated matter. Therefore, we hope that the state of the data warehouse can be stored in only one place as much as possible, so that I can only modify one place.
Real-time data warehouse core requirements: cost optimization
The cost is divided into three parts, namely development cost, operation and maintenance cost and labor cost.
The development cost indicates how soon we want the business requirements to go live. Do the same thing require a team or a person, a week or a day, so the development cost is a matter of great concern to many companies. If it cannot be developed, it means that the needs of many businesses are suppressed or Inhibition.
We hope that IT students don't need to be exhausted to respond to the business team's request for data retrieval. What often happens is that after the data is actually processed, the business team reports that the data processing is incorrect. After the IT students finally correct it, the business team says that the activity is over and the data they want to see is meaningless.
Therefore, a good data warehouse system must be decoupled from technology and business. The technical team guarantees the stable and reliable operation of the data platform, while the data retrieval of the business team should be self-service as much as possible, and generate interesting reports by dragging and dropping. This is We think good system. Only in this way can the threshold of development be lowered, so that more people can access data by themselves, realize the reusability of data assets, and self-service business development.
At the same time, in order to speed up development efficiency, the link must be short enough.
Just like the architecture diagram we saw at the beginning, if there are four or five links in the middle, any link must be configured and scheduled, and any link must be monitored after an error occurs, then the development efficiency must be very heavy. If the development link is short enough, the transmission of data is reduced, and the development efficiency will be much higher. Therefore, we need to improve development efficiency, hoping to achieve decoupling of technology and business, and to achieve a sufficiently short data link.
Simple translation of operation and maintenance costs means that in the past, there were too many clusters and too much money was spent.
We need to open four or five sets of clusters and repeatedly schedule and monitor them. Therefore, if there is an opportunity to re-select a new data warehouse in the future, you should consider how to reduce costs, use fewer clusters, and use less operation and maintenance to provide more capabilities, including some elastic capabilities. At the end of the month or when there is a demand for the amount of calculation and analysis required for promotional activities, the company can do some flexible expansion and contraction to adapt to different calculation load changes. This is also a very important capability that can save the company's operation and maintenance costs.
The third cost is human cost, including recruitment cost and learning cost.
Big data is a relatively complex system. Those who have done Hadoop technology should know that thousands of parameters and more than a dozen components depend on each other. Any node down may cause various problems in other systems.
In fact, learning and operation and maintenance costs are relatively high. The example of the database just mentioned is a good example. To reduce the cost of development, you can use a descriptive language, that is, SQL, which can drastically reduce the entire development threshold. Most of the students have already studied database courses during their undergraduate years, so using SQL is more efficient than those who need to learn API and SDK. In this way, it is easier to find talents in the market, and it also allows the company to shift more energy from bottom-level platform operation and maintenance to high-level data value mining.
It is more convenient to interface with other systems through standard SQL, whether it is a development tool or a BI display tool.
The above are some technical requirements derived from business requirements.
The first generation of real-time data warehouse: the database stage
Next, let's take a look at whether some past real-time data warehouse development technologies can meet the above requirements.
The first generation of real-time data warehouse technology is called the database stage, which is a typical Lamda stage.
This architecture basically has a business requirement, and there is a set of data links, which are divided into real-time and offline parts. The data is collected by the message middleware, and part of it is processed in real time, and the result set is processed and stored in MySQL/HBase. The other part goes offline through Hive/Flink, processes the result set, and saves it in MySQL/HBase, which converts big data into small data and then provides services to the Internet.
Many people have already analyzed the problem of this architecture. The two architectures have data redundancy with each other, resulting in data inconsistency. This is relatively straightforward, but the more important issue here is the chimney-style development.
When the business side puts forward a new requirement, the programmer has to find out which data source the data comes from, which third-party data sources it should associate with, what kind of aggregation and processing to do, and then generate a result set, and finally We will find that this result set or hundreds of reports are generated, and 80% of them have great redundancy with each other. Some business departments look at these three indicators, and another department looks at two of them, and maybe only one field is changed in the middle. This is a common situation. The original data is the same, but you have one more statistical field and I have one less, but we have to re-develop end-to-end, which seriously reduces development efficiency. Operation and maintenance is also very difficult. Thousands of reports have been developed. We don't know whether these reports are being used by anyone, and we dare not go offline easily.
In addition, once the fields of any data source increase or decrease, adjustment, operation and maintenance, and modification are required, which is almost a disaster. If it is developed by one person, it is not a big problem. We have seen many development teams. Four or five students frequently write scripts every day. Then some of these personnel leave and some join. In the end, no one dares to delete the code of the old colleagues, and finally becomes Scheduling thousands of scripts, checking and correcting errors every day, it may be the fault of the script, or the fault of the data quality, which makes the operation and maintenance cost very high.
Therefore, this kind of chimney development is quick to get started, but it is unsustainable in actual operation and maintenance. The way we solve the chimney problem is to settle the shared part, which enters the second stage of the real-time data warehouse: the traditional data warehouse stage.
The second generation of real-time data warehouse: the stage of traditional data warehouse
Data warehouse is a very good concept, which is to precipitate those calculation indicators that can be reused. Therefore, there are three layers in the data warehouse, DWD, DWS, and ADS. Through layer-by-layer precipitation, the shared parts are lowered and the different parts are moved up to reduce the problem of repeated construction. This is also a set of basic methodology accumulated by the data warehouse after decades.
In this way, Kafka drives Flink, and Flink performs some dimension table associations during the calculation process. Dimension table association is basically associating Flink with a KeyValue system to widen the dimension table. After the widening, the result will be rewritten into another Kafka topic, and then secondary aggregation and summary will be performed to generate some DWS. Or ADS, and finally store the results in the OLAP/HBase system.
Why is part of the result storage in this place an OLAP system and part of it an HBase system?
The OLAP system is a very good table structure for data warehouse layering, but this type of system cannot support online applications. Online applications require QPS of tens of thousands of queries per second. The query mode is relatively simple, but the requirements for QPS are very high. It is difficult for most data warehouse systems to meet this requirement. Therefore, at this time, the system has to be stored in HBase to provide millisecond-level ability to respond.
This system solves the previous chimney problem, but we see that data redundancy still exists in the OLAP/HBase system. The same piece of data describes the same piece of logic, and there is still redundancy in the data warehouse and KeyValue system. The company's business team will choose according to the difference in SLA. If it is very sensitive to delay, put the result in HBase. If it is not sensitive to delay but requires query flexibility, it will put it in OLAP.
On the other hand, the development of the HBase system is still inconvenient, because it is a KeyValue system with relatively simple results. All queries need to be accessed based on the Key, and the Value part does not have a Scheme. Once a business unit finds that there is a problem with data quality, there is no way to simply check the value of a certain row or column, and it cannot coordinate and update it at any time. This is some limitation of the Schema Free system, and metadata management is relatively difficult. convenient.
The third generation of real-time data warehouse: analysis service integration stage
The third stage of the real-time data warehouse is the analysis service integration stage. This stage has been realized within Ali, and most external companies are also exploring the road.
There are two differences between this stage and the previous stage. On the one hand, the unification of the server, whether it is an OLAP system or an inventory system, can be unified through one system to reduce data fragmentation, reduce interface inconsistencies, and reduce a copy of data in different systems. Passing back and forth between them achieves the effect of unified storage, which makes it easier to check and correct our data status. The interfaces are unified into one system, the interfaces of security, access, control, and application development can be unified, and some optimizations have been made on the server side.
On the other hand, the data processing link has also been optimized to a certain extent, and there is no Kafka in it. In fact, whether there is Kafka is an option, because some systems have certain event-driven capabilities, such as Hologres, which has built-in Binlog event-driven capabilities, so Hologres can be used as a Kafka. Through Binlog and Flink, the real-time aggregation of DWD, DWS, and ADS is realized, which is also a very good capability.
With such an architecture, only Flink and Hologres are left as components, and there are no other external systems in between. They can still be driven through real-time links. The data is not split, and all data is stored in the database.
The key benefit is that the development link is shortened, which reduces the cost of error adjustment. Secondly, the detailed state of the data is stored, because it is difficult for HBase to store the detailed state. If the details are stored in the service system, the cost of data error checking and repairing becomes very low. In addition, there are fewer components, and the corresponding operation and maintenance costs are also reduced.
Let's take a look at some trends in the development of data analysis in the past few years, as shown in the figure above.
It can be seen that the basic trend of data analysis is to evolve from batch processing to interactive analysis and stream computing. Ten years ago, big data was more about dealing with scale issues, and processing massive amounts of data through parallel computing technology. At that time, we were more concerned with data cleaning and data model design, and the demand for analysis was not high. too much.
Today, our big data team has basically become a data analysis team. There will be more and more requirements for the precipitation of data models, the ability to support interactive analysis, the efficiency of support for query response delays, and QPS. Data analysis is not just about storing data and then analyzing it, but there are also many pre-calculation scenarios, where there is logic first and then calculation. For example, during Double 11, how many trading volumes are there in how many seconds? In such a typical scenario, it is not a matter of transaction data first and then calculation volume. It must be a process of real-time calculation of results along with transactions.
Therefore, interactive analysis and stream computing are almost a parallel process, and these two new scenarios have many different requirements for the underlying technology. Batch processing pays more attention to parallelism. In the field of interactive analysis, we have started to have a lot of technologies such as precomputing, memory computing, and indexing, so this also promotes the evolution of the entire big data technology.
To sum up, data analysis supports more and more online businesses. Online businesses include products recommended and advertisements displayed on the screen when we turn on the mobile phone at any time. These all need to return results within a few milliseconds, relying on data Smart recommendation, if not recommended, the click rate and conversion rate must be very low.
Therefore, our data business is supporting more and more online businesses. Online businesses have very high requirements on query delay, QPS, and precision, which also promotes the evolution of data warehouses to real-time and interactive directions.
Alibaba's typical real-time data warehouse scenario
There are many usage scenarios of data warehouses in Alibaba, such as the real-time GMV large screen of Double 11. GMV is just a conclusive figure. In fact, for data analysts, this work is just the beginning. We need to analyze downwards, what kind of products, what channels, what kind of people are targeted at, what kind of promotion methods are used to achieve such conversion effects, and what conversion effects have not been achieved, etc. A series of analysis. These analyzes are actually very fine-grained, which is the result of refined analysis.
After the analysis, we will do some tagging on all products and people. Through tagging, we can guide online applications to make recommendations, analysis, circle selection, etc., so there will be a lot of business in the middle of the data. .
There is also a type of business that is partial to monitoring. Orders suddenly fluctuate and increase, network quality fluctuates, some monitoring of live video, etc. These are also typical application scenarios of real-time data warehouses.
"Complexity" of big data real-time data warehouse system
In the past, when we built real-time data warehouses, we referred to many companies, and Alibaba also took a very complicated route.
The above is the architecture diagram I drew. I was very excited when I saw it for the first time. At that time, I was a big data architect. Being able to draw so many arrows is a very skillful thing. But when I actually built such a system, I found that the development efficiency and operation and maintenance efficiency of this system are very maddening.
This system evolved from the upper left corner. Messages were collected from message middleware, followed by an offline processing process. At that time, we did not have many real-time technologies, so we had to solve the problem of scale first. Through offline processing, we will convert the processed result set into a small result set and store it in MySQL and Redis. This is a typical dual processing service system.
After turning the data into small data, the upper-level applications can be provided externally, including report applications, recommended applications, etc. Afterwards, data analysis needs to be done in real time, and this kind of "T+1" offline processing alone cannot meet the demand. The more real-time the data, the more contextual it is, and the greater the value. At this time, many pre-computing technologies have been adopted, such as Flink. Consume events in Kafka directly through Flink, and do calculations in an event-driven manner. Since Flink is distributed, it has very good scalability. It can pre-calculate and event-driven to reduce end-to-end delays. Do it to the extreme.
Then we will also store the result set in a medium. The result set processed by Flink has a very streamlined structure, generally based on the kv6 structure, and placed in systems such as HBase and Cassandra. Such systems provide Large screen reports are the fastest. For example, the large screen of Double 11 must not wait for tens of millions or hundreds of millions of records before making statistics, otherwise the query performance will definitely not be satisfactory. Therefore, we will process the original data set at the granularity of each channel per second at the beginning. When the original data set is analyzed on a large screen, it can change from a large data set to a small data set. , the performance reaches the extreme.
Now we see that there is processing scale and processing speed, both of which seem to meet certain requirements on the surface, but in fact the cost is not small. If you want to calculate fast enough, you need to pre-calculate, so the flexibility of the data model will be reduced, because we have aggregated all the data into a result set through Flink.
For example, if some business logic is not defined at the beginning, such as the aggregation of a certain three dimensions at the beginning, if you want to analyze the aggregation of the fourth dimension later, it cannot be analyzed because it has not been calculated in advance, so sacrifice here flexibility.
At this time, some technologies that are more flexible than HBase and better in real-time than Hive will be adopted, such as ClickHouse and Druid. Data can be written in real time, and after writing, it also provides certain interactive analysis and self-service analysis capabilities.
We will find that data processing disperses the same data in three different media, including offline processing media, near real-time processing media and full real-time processing media. The three media often require three teams to maintain. The three teams change personnel over time, and the data processing logic must be more or less adjusted. As a result, we will find that the same indicator passes through three different channels. The resulting calculation results are inconsistent, and this happens to almost every company.
This is only a superficial problem, and it is more painful for analysts. He needs to use different data, access different systems, learn different interfaces, and even have different access control mechanisms, which is very uncomfortable for analysts. convenient. Therefore, many companies have to set up a so-called data middle platform to shield the differences in the underlying physical engines. In this case, federated computing technologies such as Presto and Drill will be used.
Federal computing technology has a development history of more than 20 years. From the earliest data information system integration to data virtualization, the technology has been developing. This technology is a technology that does not move data but moves computing. It sounds beautiful, but when it is actually used in production, you will find that the query experience of this system is unpredictable. We do not know how to query data through the system. Whether it is fast or slow, because it pushes the query to different engines. If it is pushed to Hive, it is not so fast. If it is pushed to ClickHouse, it may be faster. So for an analyst, he doesn't know what the expectations of the system are, which is called unpredictability. For example, it may have taken 5 seconds to open the report before, and it may have taken 5 minutes another time. This situation will make the analyst wonder whether it is normal or abnormal by the time of 5 minutes. If it runs on Hive, it is called normal, if it runs on Drill, it is called abnormal. The unpredictability also makes it difficult for this system to enter the production field, so at this time we have to do a collection of data and turn it into a small result set for analysis.
We have seen that the entire link is actually composed of multiple components nested layer by layer and layer by layer dependency. In addition to the inconsistency of data caliber caused by different team maintenance just mentioned, the cost of data maintenance will also become very high.
We often encounter situations where we see that a certain number on the report may be incorrect. For example, on a certain day, this indicator suddenly increases or declines a lot. At this time, no one can confirm whether there is a problem with the data quality or the processing logic. , or the data synchronization link is wrong, because the intermediate link is too long. If the data source modifies a field and fills in a new data, then every link must be re-run, so if this architecture is running normally, there is no problem, but once there is a problem with data quality or data scheduling, Huanhuan relies on , which makes the entire operation and maintenance cost very high.
At the same time, it is difficult and expensive to find talents who understand so many technologies. What often happens is that a company has worked hard to cultivate such talents, and then they are poached by other big factories. Such talents are very difficult from recruitment to training.
The above is the status quo of this structure.
The complexity of big data today is reminiscent of the 1960s
The above architecture reminds me of the 1960s. At that time, the database was basically not born, and the relational database was born in the late 1970s.
How did we manage the state of data in the 1960s?
In the 60s, every programmer had to write their own state maintenance. Some people use the file system, but the file system alone will be very discrete, and it is very difficult to maintain, because there is a relationship between data. At this time, there are still some network-like systems that can jump to another file through one file, but the management of the network structure is relatively complicated because of loops, nesting, and so on. In addition, there is a hierarchical structure, which is also a way of describing data types.
So it can be seen that the programmers in the 1960s managed the data state by themselves, but the cost was actually very high.
In the 1980s, we basically stopped doing this, because we knew that all states should be stored in the database as much as possible, and it was also because relational databases made this a lot easier. Although we have many kinds of relational databases, most of them are based on SQL interface, which makes our entire data storage, query, management and other costs drop sharply.
This matter has undergone many changes in the past 20 years. From the era of big data to NoSQL to NewSQL, various technologies have been born, such as Hadoop, Spark, Redis, etc., which make the data field flourish.
But I always feel that there are some problems hidden in this place, which may not necessarily be the case in the future. At present, the development of big data is vigorous but not uniform, so I wonder if in the future, big data technology can be relatively gathered, and a more descriptive way can be used to solve problems, instead of letting every programmer learn different components, learn dozens of different interfaces, and configure thousands of parameters. In order to improve development efficiency, it may be necessary for us to further simplify these technologies in the future.
The core requirement of real-time data warehouse: timeliness
Let's look back and see what business requirements real-time data warehouses have without these technical components, and what kind of data architecture can be required for these business requirements.
What is real time? Many people think that real-time analysis means that the time from data generation to analysis is short enough, but this is not entirely accurate. I think real-time data warehouses are divided into two types of real-time, one is short end-to-end delay, and the other real-time can also be called on-time, that is, when we actually analyze the data, we can get valid data and draw conclusions. In conclusion, this can be called punctuality.
The first kind of real-time is a more technical concept, but does our business necessarily need such a low delay?
Usually, business is not making decisions every second. When we open the report and look at the data, the data we care about at this moment may be the past day, the last hour, 5 minutes ago, or a day seconds ago. But experience tells us that if 99% of the data analysis can be delayed by 5 minutes, it can basically meet the data analysis needs of the business. This is the case in the analysis scenario. If it is an online service scenario, this is definitely not enough.
So most companies may not require such real-time, because the costs behind it are different.
Because once the end-to-end delay is required, a lot of pre-computing business logic must be required, and we cannot wait for the data to be stored before querying, because the delay is required to be very short, and we must aggregate, process, and process a lot of data in advance. Only by widening can the amount of calculation during data analysis be small enough, and the delay can be short enough.
So if you are pursuing end-to-end delay, you have to do a lot of pre-calculation work. As we mentioned just now, if all the calculations are pre-calculated, the flexibility will be lost, so this matter has a cost. Relatively speaking, if you are pursuing a punctual system, you can rearrange part of the calculation logic in exchange for a flexible analysis scenario.
For example, during Double 11, it was only for the pursuit of delay, so we would only save the last GMV. This is a scenario, and the matter is over. But this matter does not meet the company's requirements. The company must issue a detailed report and need to know when it is selling well and when it is not selling well. Therefore, in this case, it is definitely not suitable to pre-calculate all in advance. We need to save more detailed data, and we can do more interactive analysis, correlation analysis, and exploratory analysis. Therefore, the two The requirements behind each system are different.
Relatively speaking, I think what most companies want is a just-in-time system. It needs to have the ability to calculate post-processing, real-time writing, and the ability to analyze after writing, even if the analysis efficiency is not so high. Have the ability to analyze flexibly.
Real-time data warehouse core requirements: data quality
Data quality is a very important part in the construction of real-time data warehouses. As mentioned earlier, if you do not pursue data quality but only pursue timeliness, a result set is pre-processed through calculation at the beginning. This result set tells us that the GMV has reached 100 100 million, most bosses can’t believe it, because behind the 10 billion may be data quality problems, or the calculation logic may be wrongly written, so the system must be able to guarantee data quality.
Data quality is divided into two aspects, one is how often a quality problem is discovered, and the other is how often a quality problem is corrected. These two solutions are not the same.
If you want to find data quality problems, you need to make the state of the calculation process persistent, and you hope that the data warehouse engine can have detailed and summarized data that can be placed on the disk, and these data can be checked. In this way, when the boss asks why the indicator has increased so much and which customer brought the increase, you can check the reason through these detailed data. If you find a mistake when analyzing the data, can you correct it? This is also a key question. .
Some systems can only be viewed but not changed. This is a common problem in many big data systems. Big data systems are very good at dealing with large-scale problems, but they are particularly difficult to deal with small problems, such as correcting data. Because it requires a large data block every time it is corrected, or the entire file is replaced without a primary key, so it is really difficult to update.
Compared with finding problems and correcting them, I hope that a system can have the ability to correct data problems.
The problem of correction is that when the data quality is found, the status of the data can be easily updated, such as the update of single-row data, single-column data update, batch update, etc. There is a very simple way to refresh the data. The status of data refresh often happens, for example, there is a problem with the quality of upstream data, and the processing logic is wrongly written, etc., all of which require data refresh.
Secondly, I also hope that when data is corrected, only one system can be corrected as much as possible.
We saw just now that after the data source of a piece of data is wrong, it has to be circulated repeatedly in the back-end 4~5 links, which means that once the data is wrong, data correction must be done in 4~5 links Work, there is a lot of redundancy and inconsistency in the data. If it is to be corrected, every link must be corrected, which is also a particularly complicated matter. Therefore, we hope that the state of the data warehouse can be stored in only one place as much as possible, so that I can only modify one place.
Real-time data warehouse core requirements: cost optimization
The cost is divided into three parts, namely development cost, operation and maintenance cost and labor cost.
The development cost indicates how soon we want the business requirements to go live. Do the same thing require a team or a person, a week or a day, so the development cost is a matter of great concern to many companies. If it cannot be developed, it means that the needs of many businesses are suppressed or Inhibition.
We hope that IT students don't need to be exhausted to respond to the business team's request for data retrieval. What often happens is that after the data is actually processed, the business team reports that the data processing is incorrect. After the IT students finally correct it, the business team says that the activity is over and the data they want to see is meaningless.
Therefore, a good data warehouse system must be decoupled from technology and business. The technical team guarantees the stable and reliable operation of the data platform, while the data retrieval of the business team should be self-service as much as possible, and generate interesting reports by dragging and dropping. This is We think good system. Only in this way can the threshold of development be lowered, so that more people can access data by themselves, realize the reusability of data assets, and self-service business development.
At the same time, in order to speed up development efficiency, the link must be short enough.
Just like the architecture diagram we saw at the beginning, if there are four or five links in the middle, any link must be configured and scheduled, and any link must be monitored after an error occurs, then the development efficiency must be very heavy. If the development link is short enough, the transmission of data is reduced, and the development efficiency will be much higher. Therefore, we need to improve development efficiency, hoping to achieve decoupling of technology and business, and to achieve a sufficiently short data link.
Simple translation of operation and maintenance costs means that in the past, there were too many clusters and too much money was spent.
We need to open four or five sets of clusters and repeatedly schedule and monitor them. Therefore, if there is an opportunity to re-select a new data warehouse in the future, you should consider how to reduce costs, use fewer clusters, and use less operation and maintenance to provide more capabilities, including some elastic capabilities. At the end of the month or when there is a demand for the amount of calculation and analysis required for promotional activities, the company can do some flexible expansion and contraction to adapt to different calculation load changes. This is also a very important capability that can save the company's operation and maintenance costs.
The third cost is human cost, including recruitment cost and learning cost.
Big data is a relatively complex system. Those who have done Hadoop technology should know that thousands of parameters and more than a dozen components depend on each other. Any node down may cause various problems in other systems.
In fact, learning and operation and maintenance costs are relatively high. The example of the database just mentioned is a good example. To reduce the cost of development, you can use a descriptive language, that is, SQL, which can drastically reduce the entire development threshold. Most of the students have already studied database courses during their undergraduate years, so using SQL is more efficient than those who need to learn API and SDK. In this way, it is easier to find talents in the market, and it also allows the company to shift more energy from bottom-level platform operation and maintenance to high-level data value mining.
It is more convenient to interface with other systems through standard SQL, whether it is a development tool or a BI display tool.
The above are some technical requirements derived from business requirements.
The first generation of real-time data warehouse: the database stage
Next, let's take a look at whether some past real-time data warehouse development technologies can meet the above requirements.
The first generation of real-time data warehouse technology is called the database stage, which is a typical Lamda stage.
This architecture basically has a business requirement, and there is a set of data links, which are divided into real-time and offline parts. The data is collected by the message middleware, and part of it is processed in real time, and the result set is processed and stored in MySQL/HBase. The other part goes offline through Hive/Flink, processes the result set, and saves it in MySQL/HBase, which converts big data into small data and then provides services to the Internet.
Many people have already analyzed the problem of this architecture. The two architectures have data redundancy with each other, resulting in data inconsistency. This is relatively straightforward, but the more important issue here is the chimney-style development.
When the business side puts forward a new requirement, the programmer has to find out which data source the data comes from, which third-party data sources it should associate with, what kind of aggregation and processing to do, and then generate a result set, and finally We will find that this result set or hundreds of reports are generated, and 80% of them have great redundancy with each other. Some business departments look at these three indicators, and another department looks at two of them, and maybe only one field is changed in the middle. This is a common situation. The original data is the same, but you have one more statistical field and I have one less, but we have to re-develop end-to-end, which seriously reduces development efficiency. Operation and maintenance is also very difficult. Thousands of reports have been developed. We don't know whether these reports are being used by anyone, and we dare not go offline easily.
In addition, once the fields of any data source increase or decrease, adjustment, operation and maintenance, and modification are required, which is almost a disaster. If it is developed by one person, it is not a big problem. We have seen many development teams. Four or five students frequently write scripts every day. Then some of these personnel leave and some join. In the end, no one dares to delete the code of the old colleagues, and finally becomes Scheduling thousands of scripts, checking and correcting errors every day, it may be the fault of the script, or the fault of the data quality, which makes the operation and maintenance cost very high.
Therefore, this kind of chimney development is quick to get started, but it is unsustainable in actual operation and maintenance. The way we solve the chimney problem is to settle the shared part, which enters the second stage of the real-time data warehouse: the traditional data warehouse stage.
The second generation of real-time data warehouse: the stage of traditional data warehouse
Data warehouse is a very good concept, which is to precipitate those calculation indicators that can be reused. Therefore, there are three layers in the data warehouse, DWD, DWS, and ADS. Through layer-by-layer precipitation, the shared parts are lowered and the different parts are moved up to reduce the problem of repeated construction. This is also a set of basic methodology accumulated by the data warehouse after decades.
In this way, Kafka drives Flink, and Flink performs some dimension table associations during the calculation process. Dimension table association is basically associating Flink with a KeyValue system to widen the dimension table. After the widening, the result will be rewritten into another Kafka topic, and then secondary aggregation and summary will be performed to generate some DWS. Or ADS, and finally store the results in the OLAP/HBase system.
Why is part of the result storage in this place an OLAP system and part of it an HBase system?
The OLAP system is a very good table structure for data warehouse layering, but this type of system cannot support online applications. Online applications require QPS of tens of thousands of queries per second. The query mode is relatively simple, but the requirements for QPS are very high. It is difficult for most data warehouse systems to meet this requirement. Therefore, at this time, the system has to be stored in HBase to provide millisecond-level ability to respond.
This system solves the previous chimney problem, but we see that data redundancy still exists in the OLAP/HBase system. The same piece of data describes the same piece of logic, and there is still redundancy in the data warehouse and KeyValue system. The company's business team will choose according to the difference in SLA. If it is very sensitive to delay, put the result in HBase. If it is not sensitive to delay but requires query flexibility, it will put it in OLAP.
On the other hand, the development of the HBase system is still inconvenient, because it is a KeyValue system with relatively simple results. All queries need to be accessed based on the Key, and the Value part does not have a Scheme. Once a business unit finds that there is a problem with data quality, there is no way to simply check the value of a certain row or column, and it cannot coordinate and update it at any time. This is some limitation of the Schema Free system, and metadata management is relatively difficult. convenient.
The third generation of real-time data warehouse: analysis service integration stage
The third stage of the real-time data warehouse is the analysis service integration stage. This stage has been realized within Ali, and most external companies are also exploring the road.
There are two differences between this stage and the previous stage. On the one hand, the unification of the server, whether it is an OLAP system or an inventory system, can be unified through one system to reduce data fragmentation, reduce interface inconsistencies, and reduce a copy of data in different systems. Passing back and forth between them achieves the effect of unified storage, which makes it easier to check and correct our data status. The interfaces are unified into one system, the interfaces of security, access, control, and application development can be unified, and some optimizations have been made on the server side.
On the other hand, the data processing link has also been optimized to a certain extent, and there is no Kafka in it. In fact, whether there is Kafka is an option, because some systems have certain event-driven capabilities, such as Hologres, which has built-in Binlog event-driven capabilities, so Hologres can be used as a Kafka. Through Binlog and Flink, the real-time aggregation of DWD, DWS, and ADS is realized, which is also a very good capability.
With such an architecture, only Flink and Hologres are left as components, and there are no other external systems in between. They can still be driven through real-time links. The data is not split, and all data is stored in the database.
The key benefit is that the development link is shortened, which reduces the cost of error adjustment. Secondly, the detailed state of the data is stored, because it is difficult for HBase to store the detailed state. If the details are stored in the service system, the cost of data error checking and repairing becomes very low. In addition, there are fewer components, and the corresponding operation and maintenance costs are also reduced.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00