One article to understand cloud native integrated data warehouse
1. The release background of cloud-native integrated data warehouse
1.1 Market situation
IDC's 2021 report shows that the global big data software market size is expected to reach 541.42 billion yuan in 2021, an increase of 12.5% compared with 481.36 billion yuan in 2020; China's big data platform software market size is expected to reach 12.58 billion yuan in 2021. Yuan Renminbi. Compared with 2020, it increased by 36.5%. The average compound growth rate is expected to exceed 30% in the next three years.
Alibaba Cloud will be No. 1 in China's big data public cloud service market in the first half of 2021.
my country's 14th Five-Year Plan also clearly mentioned that in order to accelerate the high-value transformation of data, the following conditions must be achieved:
①Large -scale data aggregation, data collection in all links, and construction of industrial-based big data, etc.
②Diversified data processing, including data processing of various data types, multimodalities and multi-industry.
③ Time-sensitive data flow, including dynamic update of data, establishment of data sharing space, etc.
④ High-quality data governance, well managed data assets and full life cycle.
⑤ High-value data transformation, including government governance, social governance, risk control, industrial upgrading, and financial technology upgrading through data.
Big data has more and more mature applications in different industries. The national plan also clearly stated that we should cultivate professional and scenario-based big data solutions, build a multi-level industrial Internet platform, and build a big data platform for the industry.
1.2 Challenges and Pain Points
At this stage, various industries and industries are using the capabilities of big data for industrial upgrading, which also puts forward more and higher requirements for the platform that carries the basic big data for the entire data analysis. Enterprises face many challenges when building a big data platform:
● There are strong demands for timeliness, accuracy and cost-effectiveness;
● More and more unstructured data is difficult to effectively support analysis and decision-making;
● How to separate and conduct global data analysis on heterogeneous big data platforms.
In response to market demands, Alibaba Cloud has launched a cloud-native integrated data warehouse to solve the pain points of enterprises in various industries in building big data analysis platforms.
2. What is cloud native integrated data warehouse?
The cloud-native integrated data warehouse is a one-stop big data processing platform that integrates the capabilities of Alibaba Cloud's big data products, MaxCompute , DataWorks , and Hologres . The integrated data warehouse can solve the global data analysis needs of enterprises in the construction of big data platforms for timeliness, accuracy, cost-effectiveness, unstructured data to support analysis and decision-making, and heterogeneous big data platforms.
Through the deep integration of MaxCompute and Hologres , it provides rich and flexible offline and real-time integration capabilities. Through more open support for data lakes and integrated capabilities of lakes and warehouses for diversified and unified management of data analysis, a single data foundation can continuously Pursue the combination of real-time and online capabilities of log warehouses, and finally help enterprises to feel more intuitively through the top-down and bottom-up bidirectional modeling capabilities of DataWorks , as well as the new capabilities of data governance and enterprise data evaluation models. to its own data maturity. The open DataWorks plug-in system also allows customers and industry ISVs to build more scene-based data analysis capabilities around their own data, thereby truly helping to upgrade their businesses intelligently.
Its core is three integrated and full-link data governance capabilities, including offline real-time integration, lake and warehouse integration, analysis service integration, and full-link data governance.
A. Offline and real-time integration ● The N-to-1 minimalist architecture with MaxCompute and Hologres as the core provides offline and real-time integration of massive cloud data warehouse services;
● High-speed native mutual access and deep integration between MaxCompute and Hologres 10X;
● MaxCompute released the fast query capability of EB-level massive cloud data warehouses.
B. Lake and warehouse integration
● Continue to improve the easy-to-use Hucang development experience;
● Added lake warehouse management capabilities for unstructured data;
● Extensive support for open source ecological docking.
C. Analysis service integration
● The trend of real-time, agile, online and integration of data warehouses is obvious;
● On one platform, one piece of data realizes flexible exploratory analysis and high concurrent online application query, while achieving good resource isolation and availability;
● Reduce data fragmentation, reduce data movement, and unify data service export.
D. Full Link Data Governance
● Top-down specification modeling of data warehouses from a business perspective;
● Problem-driven sustainable data governance and enterprise data governance effectiveness evaluation;
● A new upgrade of the DataWorks open platform.
3. The technical concept of cloud-native integrated data warehouse
1 offline real-time integration
Offline and real-time computing
In the early stage of development of big data technology, it was generated for massive-scale big data processing, but with the development of Internet applications and technologies, the demand for online business and refined operations has become stronger and stronger, such as real-time GMV large-screen, real-time Business data analysis, real-time user portraits and tagging systems, etc., so big data technology gradually evolves and develops from offline computing to real-time.
Offline data warehouses and real-time data warehouses have different characteristics in many scenarios, design concepts and product capabilities. Offline data warehouses are oriented to data processing scenarios, while real-time data warehouses are oriented to data analysis scenarios. The processing system is for scheduling services, and the analysis system is for human-computer interaction and online applications; the amount of data processed, the processing system belongs to big data input, the big data output, and the output is the processing result table, while the analysis system belongs to big data input, small data output. The output of data is reports and KPIs on the big screen; in terms of timeliness, the processing system completes data processing by adopting the batch processing concept and the T+1 method, while the analysis system hopes that the data can be written and used, and can be updated in real time ; In use, the processing system is an offline job submission, the job has progress, and the intermediate steps can be retried, the analysis system is an online system, and the query is a synchronous response, and the query has only two states of success and failure. Different demand scenarios determine different technical routes. For scalability, the offline system adopts asynchronous job scheduling, resource allocation during computing, and completely decoupled computing and storage design. For real-time performance, the real-time system uses RPC synchronous calls, and computing resources are pre-configured. Allocation, computing storage runtime binding and other technologies.
In the process of developing from offline to real-time, many excellent systems have emerged in the field of big data to deal with various analysis and query scenarios. For example, we can archive real-time data into offline data warehouses like Hive for offline processing of data, and then export the aggregated small-scale data to mysql for subsequent report query or data access . The calculation engine performs pre-real-time processing and calculation, and then summarizes the results to a KV system such as HBASE/ casandra for high-concurrency point checking, or directly writes real-time data to an mpp system such as clickhouse /druid for fast interactive query. There is also a federated query of multiple data sources through presto. In short, in order to realize the real-time data ingestion, processing, and analysis links, it is necessary to build and operate multiple systems or services, resulting in complex architecture, fragmented data storage, Inconsistent data, high development costs and many other problems.
Offline real-time integrated massive cloud data warehouse from N to 1
In order to solve this problem, Alibaba Cloud has launched an offline real-time integrated massive cloud data warehouse architecture with MaxCompute and Hologres as the core. It uses one architecture to meet the needs of N kinds of analysis scenarios. In the past, it was necessary to operate and maintain N kinds of components, develop N sets of systems, connect N kinds of interfaces, and N kinds of security policies. Now, only one system can be used to solve them, which solves the problems of data fragmentation and complex development, and makes the architecture very Simple.
In the data ingestion part, MaxCompute not only provides traditional batch writing, but also newly supports streaming write capability to improve the data writing efficiency of offline data links and the stability of data channels, while Hologres provides write-visible Real-time writing and updating capabilities to ensure real-time data writing and updating.
In terms of data computing, MaxCompute , as an EB-level massive cloud data warehouse, provides low-cost, massive-scale data storage and computing power. The high-throughput-oriented design can make a super-large-scale computing task stably output, and the complex UDF function can support users to perform complex logic data processing through flexible programming. The computing tasks in massive data warehouses generally run for a long time. From minutes to hours to days, MaxCompute's continuous performance optimization can now accelerate offline queries to seconds, which means it has broad-spectrum applicability from seconds to days.
Hologres provides real-time data writing, real-time update, and real-time analysis through many innovations in OLAP data warehouse technology, such as CPU vectorization technology, full-link asynchrony, and full use of SSD write-friendly features. The cloud data warehouse service supports extreme high concurrency and sub-second low latency.
MaxCompute and Hologres , complement each other in terms of scenarios and technologies, and complement each other, bringing out the ultimate experience in their respective fields. But they are two systems after all. In order to avoid data fragmentation, we have opened up the metadata and storage of the two systems by means of deep integration, which can achieve mutual access without data movement, and finally provide external services. and analysis capabilities to support the requirements of various scenarios such as online applications, large data screens, operational dashboards, and ad hoc queries.
1. Metadata Visibility Technology
Through the metadata visibility technology, data visibility between different systems is realized, thereby realizing bidirectional read and write capabilities. MaxCompute tables can be imported into the Hologres metadata database in batches, and new MaxCompute tables can be automatically synchronized to Hologres . Conversely, it is also supported to define Hologres tables as MaxCompute external tables. Through the metadata of the external appearance, the data is not relocated, and it supports bidirectional read, write, and perception. Metadata automatic discovery technology fully automates the creation and update of appearances, reducing a lot of manual operation and maintenance work. Users no longer need to periodically synchronize table structures, and no longer need to worry about data type misalignment.
2. Appearance acceleration technology
The ideal state is that the data processed by the offline system can be directly used for the interactive analysis of the real-time system. However, due to the limitations of the scheduling mechanism and resource allocation mechanism, a certain acceleration effect can be achieved only through the technical improvement of the offline system. Using the computing power of interactive analysis, through the surface acceleration technology of the real-time system, the accelerated analysis of offline data can be achieved with higher quality. In the appearance acceleration technology, data does not need to be relocated. When the query is running, the computing resources of the real-time system and a more efficient RPC scheduling mechanism are used to directly access the storage files in the offline system . Through the resident process of the real-time system, caching, prefetching, expression pushdown and other technologies, query acceleration is achieved, and it is widely used in BI interactive query and other scenarios.
More technical references: High-performance native acceleration MaxCompute core principles
3. High-speed direct reading and direct writing
There are two ideas for the implementation of the appearance, one is to connect through the query interface of the respective engine, and the other is to directly access the underlying storage system of the other system. Through the query interface connection, the isolation is good, the interface conforms to the specification, and the connection threshold is low, but the performance is not optimal, because the call path is longer, and more components are accessed; direct access to the underlying storage engine is highly intrusive and vulnerable to system technology iterations Incompatibility caused by changes. Therefore, most of the systems that support federated query use scheme 1, that is, the standard interface method, such as Presto. Alibaba Cloud MaxCompute and Hologres adopted the second solution because these two products come from the same core R&D team, so they are able to solve the problem of system incompatibility. The two systems share the basic storage engine Pangu, but retain their respective innovations in storage capabilities, such as index design, direct read and write, and relative interface mode. The performance is improved by more than 10 times, and supports MaxCompute to Hologres millions The data import scenario can be implemented every second, so that the data refresh and write-back take effect immediately.
Through the technical innovations from the above three perspectives, the data connection between the real-time system and the offline system is realized, while retaining the scene capabilities of the respective advantages of the two systems.
MaxCompute's fast query capability
In addition to the deeply integrated integrated architecture of MaxCompute and Hologres , MaxCompute , as a massive cloud data warehouse, is also continuously working on offline acceleration. How to accelerate the offline massive warehouse in a low-cost way and balance the customer's contradictions in performance, delay and cost are the problems we need to solve.
MaxCompute extends the built-in query acceleration engine in the original architecture, which can accelerate offline queries to the second level. MaxCompute has always been an offline data warehouse for throughput optimization. Even some small query computing tasks often have problems such as long queue time and slow execution. The built-in query acceleration engine newly released by MaxCompute this time will optimize the latency of query tasks with small data volumes. It mainly adopts technologies such as high resource preemption priority, multi-level Cache, memory/network shuffle, and vectorized execution, which greatly reduces the overhead on the e2e link of small query tasks.
The query acceleration engine supports a variety of billing modes, and the post-paid mode supports automatic identification and acceleration, which can be completed without user attention. There is a set of automatic job feature identification algorithms behind this, which can be used for offline mode and complexity of query jobs of different scales. The selection of the acceleration mode enables simple queries to run fast and complex queries to be calculated more quickly; the prepaid mode will also soon support the mode of exclusive resource groups for the query acceleration engine, which can achieve stable offline acceleration effects.
The data channel newly supports the streaming write mode, which not only improves the writing efficiency and stability of the offline data link, but also cooperates with the query acceleration engine to realize near real-time data visibility, which can effectively shorten the insight time of offline business. .
The JDBC interface newly supports a variety of BI tools, such as Guanyuan BI, NetEase Youshu BI, Superset, etc.
2 lake warehouses
The development of big data for 20 years has formed two forms of data lake and data warehouse.
The past 20 years have witnessed the rapid development of big data technology. Throughout the field of computer science and technology, the technology for data processing is mainly divided into four stages, the database stage, the big data technology exploration stage, the big data technology development stage, and the big data inclusive stage. The database stage was mainly from the 1970s to the 1990s, and this stage was mainly the golden age of database plus stand-alone. The database system is mainly an operation-oriented, transaction-oriented, and online business system-oriented data system. In fact, around the 1990s, the concept of data warehouse has appeared. The data warehouse is oriented to the analysis and exploration of historical full data, but because the overall data volume at that time was not large, the expansion of some database technologies could support the needs of the data warehouse at that time.
Around 2000, with the explosion of Internet technology, we ushered in the era of big data. At this stage, it is difficult for us to use traditional database technology to meet the needs of massive data processing. Everyone should know that Google 's three papers, distributed storage, scheduling, and computing, laid the foundation for the entire big data technology. Basically in the same period, the Hadoop system appeared in 2006, Alibaba developed the Feitian system in 2009, and leading companies such as Microsoft have developed relatively excellent distributed systems. At this stage, the whole big data technology is actually to make the data, and let's talk about it when the data is bigger.
Around 2010, it entered a stage of vigorous development of big data. This stage was when we hoped that big data technology would change from being usable to being usable. At this stage, a series of engines mainly expressed in SQL appeared, including a series of engines such as Hive, Flink , and Presto developed by the Hadoop system . At this time, a system with HDFS as the unified storage and ORC and Parquet as the open file format has gradually formed. There are many open engines on it. This system is like the data lake system we are talking about today. At this stage, the essence of Hadoop is actually a data lake system. What is the essence of the data lake? The essence is unified storage, which can store original data and support multiple computing paradigms. This is the essence of data lakes.
MaxCompute based on the Feitian system , Google released Big Query, and AWS released Redshift. These systems can be called cloud data warehouses in the era of big data. What is the difference between the cloud data warehouse system and the above-mentioned Hadoop system? The cloud data warehouse does not expose the file system to the outside world, but exposes the description of the data, which is exposed in the form of tables and views. The storage engine and the computing engine are shielded in the system, so the storage engine and the computing engine can be deeply optimized, but the user has no way to perceive it. At this stage, it can be seen that the entire big data technology has begun to be subdivided, and the shape of the lake and the shape of the warehouse have been initially formed.
At this stage we are now, that is, around 2015, we have entered the stage of big data inclusion. At this stage we have observed two trends. The first trend is the development of big data technology in addition to the pursuit of scale and performance . It is more about enterprise-level capabilities such as data security, data governance, stability, and low cost. We can also see that Alibaba has built a data middle-office system with Alibaba's characteristics based on MaxCompute . The open source system has also developed Atlas and Ranger, which mainly focus on open source projects such as blood, governance, and security. The second trend is that with the development of AI, IOT, and cloud-native technologies, the demand for unstructured data processing is getting stronger and stronger. There is a growing trend to use object storage on the cloud as unified storage. The Hadoop system has gradually changed from HDFS for unified storage to cloud storage such as S3 and OSS on the cloud as a unified storage data lake system. At the same time, there have been many data lake constructions, products like AWS Lake Formation and DLF released by Alibaba Cloud. In the direction of data warehouse, in order to adapt to such a trend, we are also working closely with the data lake to develop an appearance, through which the data in the database can be federated computing.
Throughout the 20 years of development, with the evolution of big data technology, two systems of warehouse and lake have actually been developed.
Definition and difference between data lake and data warehouse
We can use the table below to compare the difference between a data lake and a data warehouse.
On the whole, the data lake is a wide-in and wide-out system that is relatively loosely coupled. A data warehouse is a system that is strictly in and out and relatively tightly coupled. The data lake is that the data comes first and then starts to be used, so it is an after-the-fact modeling. Structured, semi-structured, and unstructured data can be stored. The data lake provides a set of standard open interfaces to support more engines, like pluggable into this system, so it is open to all engines. But it should be noted here, precisely because it is a pluggable method, computing and storage are actually two independent systems. In fact, they cannot understand each other, and there is no way to achieve deep optimization. In fact, the optimization of the engine can only achieve moderate and limited optimization. Data lakes are easy to start, but as the scale of data grows, a series of governance and management problems arise, which are more difficult to operate and maintain later. Because the data lake does not perform strong and consistent data checking of schemas, data governance is relatively low, and it is difficult to manage and use. Because the data of the data lake is used first, it is more suitable for solving unknown problems, such as exploratory analysis, scientific computing, data mining and other computing processing.
The data warehouse is basically the opposite in the comparison dimension. The data warehouse is a strict system, so it needs to be modeled in advance, the data is transformed and cleaned into the warehouse, and the storage type becomes structured or semi-structured. Because the data warehouse is a relatively closed system and a self-closed system, the data warehouse is open to a specific engine, but precisely because the data warehouse is a self-closed system, its computing engine, storage engine, and metadata can communicate with each other. Doing very deep, vertical optimization can get a very good performance. Because data warehouses are modeled in advance, data can only come in, so it is difficult to start, and the start-up cost is relatively high. However, once the data enters the data warehouse, the high quality of the entire data makes it easy to manage. At this time, its overall cost will be reduced, and even a state of free operation and maintenance will be achieved. The schema of the data warehouse will be checked for strong consistency, so the data quality is high and it is easy to use. Therefore, the computing load of the data warehouse is naturally suitable for offline computing, interactive computing, BI and visualization.
The flexibility of the data lake and the growth of the data warehouse
Overall, data lakes are more flexible, and data warehouses are more enterprise-level capabilities. So what do these two characteristics mean for enterprises in the end? We use the following picture to represent it.
The horizontal axis represents the business scale of the enterprise, and the vertical axis represents the overall cost of building a big data platform for the enterprise. In the early stage of enterprise development, the business scale is still small, and data is still in a stage of innovation and exploration from generation to consumption. The data lake architecture is more suitable. It is not only easy to start and get started, but also can be quickly added for temporary data processing needs. Or deploy new services, and there are many articles referenced by the open source community. When the enterprise gradually matures, the scale of data becomes very large, the number of people and departments involved continues to increase, and the requirements for data governance, refined permission control, and cost control become more and more critical, then continue to use at this time. Data lake, data processing and management overhead will increase significantly. The data warehouse architecture is more applicable, and its high data quality assurance, strong control and other capabilities are more suitable for the growth and development of enterprises. Since data lakes and data warehouses play a key role in different stages of enterprise development, is there a technology or architecture that can take advantage of both at the same time? Through our insights into the industry and Alibaba Cloud's own practice, we believe that the integration of lakes and warehouses is taking place, and the new data management architecture integrating lakes and warehouses can solve this problem well.
The data warehouse is a strict system, so the data warehouse is more suitable for transaction support, strong schema consistency check and evolution, natural support for BI, and easier real-time performance. For data lakes, the advantage lies in the rich data types, support for multiple computing modes, open file systems and open file formats, and an architecture that separates storage and computing.
Therefore, the evolution from the data warehouse to the lake warehouse needs to develop the characteristics of the data lake from its own characteristics. In fact, it is necessary to do a good job of linkage and integration with systems such as HDFS and OSS, so the structure of the data warehouse is more skewed to the left and right. For the evolution of data lake to lake warehouse integration, it is necessary to stand more on the basis of HDFS and OSS to make the characteristics of strong warehouse. So the structure of the data lake is more like a top-down structure. So, DeltaLake and Hudi are actually inserting a layer in the upper and lower structure, making a file type above the lake that can support strong warehouses.
But whether it is the integration of the data warehouse and the lake warehouse, or the integration of the data lake and the lake warehouse, in the end, the evolution direction of everyone is the same, that is, the integration of the lake and the warehouse. The characteristics of the integration of lake and warehouse are unchanged, the characteristics of four partial warehouses, and the characteristics of four partial lakes.
Alibaba Cloud Lake and Warehouse Integrated Architecture Introduction and Latest Release
Alibaba Cloud first proposed a new architecture of lake and warehouse integration at the 2020 Yunqi Conference, and is continuing to upgrade the architecture and optimize the technology. The left side of the above figure is the overall architecture of Alibaba Cloud Lake and Warehouse. From the bottom up, the bottom layer is the network layer, the middle layer is the Lake warehouse engine layer, the upper layer is the DataWorks Lake warehouse data development layer, and the top is the business application layer. Let’s focus on the lower engine layer. The Alibaba Cloud lake warehouse is a left and right structure. The left side is Alibaba Cloud’s self-developed cloud data warehouse products represented by MaxCompute , the right side is the Alibaba Cloud EMR open source data lake product, and the middle is through the unification of metadata. , through open format compatibility, so that data and tasks can flow arbitrarily between the data warehouse and the data lake. Announced at the 2020 Yunqi Conference is support for Hadoop data lakes. Recently, we have supported the integration of Alibaba Cloud DLF and OSS data lakes.
On the right, we have highlighted the function points recently released by Alibaba Cloud Lake Warehouse Integration.
The first is an easier-to-use lake warehouse development experience. DataWorks has upgraded the integrated development and management of lake warehouses, supporting customers to connect lakes and warehouses by themselves in minutes, shielding many low-level configuration details, allowing customers to quickly Business Insights.
The second is a broader ecological connection. We can connect with Alibaba Cloud DLF metadata service to support OSS data lake query, and also support various open source file formats such as Delta lake and Hudi . At the same time, we will also expand to support multiple external federated data sources through the foreign server. For example, in the next two months, we will support federated mapping of the entire RDS database, which is more efficient than the previous single-table mapping.
The third is higher performance. MaxCompute's new support for intelligent cache and built-in query acceleration engine can improve data lake query performance by more than 10 times.
The fourth is a richer data type. We are about to support the lake warehouse management capability for unstructured data. This is a new function that we are developing recently. The lake warehouse integration mentioned earlier is mainly for the structured data in the lake. The next release will be aimed at the unstructured data in the lake. We provide customers with a very simple operation method to map the unstructured data in the lake into a special object in the MaxComput data warehouse, and then customers can operate like To operate this object in the form of a table, the advantage is that the management capabilities of MaxCompute+DataWorks' strong data warehouse can be projected to unstructured data, so as to improve the management and even governance capabilities of unstructured data.
Alibaba Cloud Lake and Warehouse Integration Key Technologies
Regardless of whether it is an integrated lake and warehouse that has evolved from a top-bottom structure or a left-right structure, it should ultimately be a simple and easy-to-use system. There are four key features of Alibaba Cloud Lake and Warehouse integration. These four key features focus on how to make the data lake and data warehouse easier to use.
1. Quick access
There are two main layers, one is the network layer, and the other is the opening layer that integrates the lake and warehouse. MaxCompute supports the connection of the Hadoop system in any environment on and off the cloud. Because of MaxCompute 's own multi-tenant system, how to connect with a specific user environment is a big technical challenge. We developed the PrivateAccess network connection technology to achieve this goal. The second is that DataWorks has upgraded the integrated development and management of lakes and warehouses, supporting customers to connect lakes and warehouses by themselves in minutes, shielding many low-level configuration details, allowing customers to achieve rapid business insights
2. Unified data/metadata
The key technology is that there is a database-level metadata mapping, that is, we can map the Database on the data lake to a Project in MaxCompute . The data on the data lake does not need to be moved, and MaxCompute can consume it just like accessing and operating ordinary projects. At the same time, the data/metadata of the data lake and the data warehouse can be synchronized in real time. If the data or schema of a table in the data lake changes, it can be reflected on the MaxCompute data warehouse side in time. At the same time , MaxCompute has a built-in storage file format. We are also continuing to follow up on the file formats in the open source ecosystem, and widely support the open source data file formats Delta Lake and Hudi .
3. Provide a unified development experience
The data lake and the data warehouse are two different data processing systems with their own database object schema designs. Last year, we did a lot of work to unify the database object models on both sides, and MaxCompute 's SQL and Spark languages are highly compatible with the ecosystem. The script can be highly compatible on both sides, and we can switch seamlessly in some customer cases. Dataworks has the development and scheduling capabilities of multiple engines. On this basis, we provide more unified development and management functions for Hucang. And the unstructured data lake management capability that we will support will further unify the development and management experience of structured data and unstructured data.
4. Automatic number warehouse
This is an area we have been focusing on. The MaxCompute cache technology combined with the offline query acceleration engine can accelerate data lake query scenarios by more than 10 times. At the same time, we can also implement intelligent caching based on policies dynamically adjusted according to business scenarios to realize the hot and cold layering of data in the lake warehouse architecture. Our Cache itself needs to be deeply coupled with storage and computing, so the number of warehouses can be more extreme by doing this layer of Cache. In addition, we also try to mark and identify the data in the data lake. From the perspective of data modeling, we can determine which data is more suitable for storage in the warehouse and which data is more suitable for storage in the lake. For example, some structured table data that is repeatedly accessed and relatively frequent are more suitable for being placed in the data warehouse. If it is unstructured/semi-structured low-frequency data, it is more suitable to be placed in the data lake. The ultimate goal is to achieve an optimal balance between performance, cost and business effectiveness.
3 Analysis services in one
The integration of analytical services is an important capability innovation in Alibaba Cloud's integrated data warehouse. It is called Hybrid Serving and Analytical Processing, or HSAP in English. It is an architectural trend first proposed by Alibaba Cloud. Analysis is the process of making decisions through data. There are common multi-dimensional analysis, exploratory analysis, interactive analysis, and Ad Hoc analysis, such as Presto, Greenplum, ClickHouse and other systems, which are usually used in internal business reports, leadership In the field of cockpit and indicator library platform, he is good at handling complex and flexible queries. A service is a data service, usually a term in the TP field. It indicates the high-performance, high-QPS data read and write requirements that support online services. The single request for data is not large, but it has high requirements for SLA, availability, and latency. , the core difference from traditional TP is that the requirements for transactions are weaker than the requirements for throughput and performance, and more flexible consistency protocols can be adopted, such as monotonic incrementality that only requires access, which reduces the overhead of distributed locks. Common in NoSQL systems such as HBase and Redis usually serve scenarios such as toC online recommendation, online marketing, and risk control.
The underlying data sources of the two scenarios are unified (business database + behavior log), and they also support each other. The data generated by the online service needs to be subjected to secondary analysis, and the result data of the analysis is used for the online service. Through the integrated structure of the analysis service, it can be simplified Data exchange between systems improves development efficiency, provides a unified data service export for upper-layer applications, and ensures the consistency of data calibers.
Real-time data warehouse trends: agility, online, and integration
What is an efficient, high-quality and reliable real-time data warehouse? Based on the observation and technical practice over the past years, we have found three trending characteristics in the field of real-time data warehouse, agility, online and integration.
● In the field of processing, the processing methodology has been agilely upgraded, including the lightweight and real-time processing of processing scripts, the weakening of data layers, the reduction of layers, and the reduction of scheduling, so as to make the data link from production to consumption more compact and simple, thus shortening the The wait time for data to be available.
● In the service field, the big data team directly serves the company's core online business, shifting from a cost center to a profit center, ensuring the stability and high efficiency of online business, improving marketing efficiency through data intelligence, and improving the accuracy of risk control , etc. The transformation of big data technology from internal analysis tools to online production systems requires support for higher reliability, stability, and production-level operation and maintenance capabilities at the system design level.
● In the field of architecture, by analyzing the integrated and integrated architecture of services, data fragmentation is reduced and a unified data service layer is formed, which can improve development efficiency, reduce operation and maintenance costs, and ensure the consistency and freshness of data calibers.
Data processing agility
Traditionally, building a reasonable real-time data warehouse system for big data is a complicated task. The Lambda architecture is basically used, and there are real-time processing layers, offline processing layers, and even near-real-time processing layers. Data storage is divided into three types according to different access characteristics. Offline storage and online storage, the online part is also subdivided into OLAP system and Key/Value system, which respectively provide flexible analysis and online high-performance check. On the application side, online systems are mostly accessed through API, and analysis systems are accessed through SQL. Different systems are connected to different storage engines, adopt different protocols, and use different access control policies.
This architecture is effective when there are few business changes and high data quality, but the reality is much more complicated. Business changes are becoming more and more agile, and the quality of data is even more uneven. The data structure is frequently adjusted on a daily basis. Correcting heavy brushing, these are high frequency and time consuming jobs. However, at present, the data is scattered in many different systems, and the data is repeatedly synchronized between the storage systems, which makes business agility impossible. units, or even longer.
Therefore, data silos in the architecture will inevitably lead to difficult data synchronization, high resource consumption, high development costs, and more difficulty in recruiting talents.
If we want to simplify the complex architecture and realize the agility of data processing, the core is two points, one is to simplify state storage and reduce data redundancy, so that data development and data correction are only on one piece of data; the other is the processing chain Lightweight road.
In state storage, Hologres provides high-throughput and low-latency real-time write and update capabilities. Write can be analyzed. Whether it is a single flexible update or a batch of hundreds of millions of flashbacks, it can be supported. Based on Hologres to build a unified data A state layer that significantly reduces data relocation.
In data warehouse processing, the methodology of data warehouse layering is adopted to support the precipitation and reuse of indicators. Processing can be divided into public layer processing and application layer processing. Common layer processing adopts Flink+Hologres The Binlog method realizes the full-link event-driven development of ODS->DWD->DWS in real time, and supports data writing and processing. In the processing of the application layer, the business logic is encapsulated through the view to reduce the management of intermediate tables, and the distributed query capability of Hologres provides the business layer with good analysis flexibility, and returns the flexibility from the data engineer to the business analyst to realize self-service analysis. , exploratory analysis.
Online data service
A core trend of real-time data warehouses is the onlineization of data services. The data is expanded from the internal decision-making scenario for ToB to the online business scenario that supports ToC , supports real-time user portrait, real-time personalized recommendation, real-time risk control, etc., and realizes the efficiency improvement of online transformation through data. This puts forward higher requirements for the execution efficiency and stability of the system. From a slightly marginal analysis system to a mission-critical online business system, the data platform needs to have high availability, high concurrency, low latency, and low jitter. The cloud-native elastic capability supports hot service upgrades and hot expansions, as well as complete observability and O&M capabilities.
In response to these requirements, Hologres has made a lot of innovations in storage engine, execution engine, and operation and maintenance capabilities, including storage. On the basis of the original row storage and column storage, it supports the row-column coexistence structure, allowing the same table to simultaneously It has two advantageous scenarios of OLAP and KeyValue . At the same time, it introduces Shard-level multi-copy capability to realize the ability of linear growth of QPS by increasing the number of copies within a single instance. By combining row-column coexistence and shard copy capabilities, it can support new non-primary key check capabilities, which are widely used in scenarios such as order retrieval.
The system will inevitably have the need for operation and maintenance upgrades. Hologres has introduced the capability of hot upgrade, during the upgrade process, the service will not be interrupted, reducing the impact of system operation and maintenance on online business; through the capabilities of physical backup of metadata and lazy open of data files, The speed of failure recovery is optimized, and actual business verification shows that the recovery speed is more than 10 times faster, and the fault is automatically recovered in minutes, which minimizes the impact of failures.
At the same time, for enterprise-level security scenarios, Hologres provides data encryption storage, data desensitization access, self-service analysis of query logs and other capabilities, supporting complete enterprise-level security capabilities.
Data architecture analysis service integration
The integration of analysis services is an important trend to simplify the data platform and unify the export of data services. It is also a capability innovation of the storage query engine. Within one architecture, it supports two typical data scenarios, which can perform complex OLAP analysis, or It meets the high QPS and low latency of online services. In terms of business, it creates a unified data service outlet for users, realizes agile business response, supports independent data analysis, avoids data islands, and simplifies operation and maintenance. This is a high challenge to the technical architecture. Therefore , Hologres has designed row storage and column storage to support service scenarios and OLAP scenarios respectively in terms of storage for different scenarios. In terms of computing, based on data sharing, it needs to efficiently support fine-grained isolation.
Hologres has a multi-instance high-availability deployment model based on shared storage. In this solution, users can create multiple instances, these instances represent different computing resources, but all instances share the same data, one instance serves as the main instance, supports data read and write operations, and the other instances serve as sub-instances. It is read-only. The data memory status between different instances is synchronized in milliseconds in real time, and there is only one copy on the physical storage. In this solution, the data is unified, and the permission configuration is also unified, but the computing load is distinguished by physical resources, which achieves 100% isolation. Read and write requests do not compete for resources, support read and write isolation, and also reflect better fault isolation capabilities. A master instance currently supports mounting up to 4 sub-instances. If it is deployed in the same region, the storage is shared. If it is deployed in different regions, the data needs to be copied and stored in multiple copies to achieve disaster tolerance. In the big promotion scenario, this solution has been repeatedly verified by multiple core businesses within Alibaba, with high reliability. Usually, we recommend a main instance for data writing and processing, a sub-instance for internal OLAP business analysis, and a sub-instance for external data services. Different computing specifications can be allocated according to the computing power requirements of different scenarios.
Through the event-driven processing and view agility capabilities of the data processing layer, through the multiple storage structures and multi-instance shared storage architectures in the data storage layer, and through the fine-grained resource isolation, read-write separation, etc. The warehouse solution provides users with a more streamlined architecture to meet flexible data analysis and online service scenarios, a more efficient development methodology, and basic components that are easier to manage and cost-effective.
4 Full Link Data Governance
In the early stage of enterprise development or the early stage of enterprise data warehouse construction, everyone is more concerned about how to run in small steps, first quickly build the overall framework of the data warehouse, quickly meet business needs, and pursue lower costs and shorter delivery time. At this stage, most companies choose to build data warehouses from a bottom-up perspective from a development perspective, that is, basic ETL work. With the gradual development and maturity of enterprises or enterprise data warehouses, the advancement of digital transformation of traditional enterprises, and the gradual entry of data middle-stage construction into the deep-water area, the original "lean production" method to build data warehouses has been unable to meet the standardization and sustainability of enterprise data warehouses. To meet the requirements of development, the construction of enterprise data warehouses has begun to transform to "agile manufacturing", with more emphasis on standardization, processization, methodological guidance and organizational management, and with the help of modern technologies and tools to maximize the value of people.
this background, Alibaba Cloud DataWorks has been committed to the construction of a full-link data governance product system for the past few years, hoping to create a one-stop platform for enterprises that integrates data development and data governance. MaxCompute and Hologres together form a cloud-native integrated data warehouse product solution. In terms of data governance, Alibaba Cloud DataWorks has recently made efforts to build intelligent data modeling and data governance center products based on basic capabilities such as data quality, data security, and stability assurance. It has also upgraded its open platform so that users and partners can Implement custom data governance plugins to help enterprises achieve personalized data governance.
Smart Data Modeling
Cloud DataWorks has launched a new intelligent data modeling product, which is based on Kimball's dimensional modeling theory and Alibaba's data middle platform construction methodology. It forms a synergy with DataWorks' mature bottom-up data development (ETL) capabilities to help enterprises build standardized and sustainable data warehouses.
1. Data warehouse planning
Data warehouse planning is the foundation of data warehouse construction. The data warehouse planning tool of Alibaba Cloud DataWorks intelligent data modeling can support data warehouse top-level design from business abstraction, including data warehouse layering, defining data domains, business processes, and data dimensions, etc. . This effectively solves the problems of the chaotic structure of the enterprise data warehouse and unclear rights and responsibilities.
2. Data Standards
Without data standards, data models are unfounded. The data standard tool of Alibaba Cloud DataWorks provides definitions such as data dictionary, standard code, unit of measure, naming dictionary, etc., and supports seamless connection with data quality rules, so as to realize rapid label drop inspection.
3. Dimensional Modeling
Before using professional modeling tools, most enterprises may use a document-based form to design and record data models, which can effectively solve the problem at the beginning, but the document faces the problem of difficult to maintain and update continuously. The online system is disconnected, and the data model of the online system will gradually get out of control. Alibaba's early data warehouse construction also faced this problem, and it has been proved through practice that it is difficult to ensure the strong consistency of the data warehouse model only by relying on the organizational system. To this end, Alibaba Cloud DataWorks built a dimensional modeling tool based on Kimball's dimensional modeling theory. Provides visual forward modeling and reverse modeling. Through reverse modeling, the table in the existing data warehouse can be effectively reversed into a data model, and model iteration can be performed on top of this, thereby helping enterprises solve the problem of cold start of data warehouse modeling. At the same time, in order to improve efficiency, Alibaba Cloud DataWorks also provides a SQL-like data modeling language, so that data engineers who like to write code can quickly perform data modeling, and it also greatly facilitates the import, export, backup and recovery of data models.
4. Data indicators
Similarly, before there were professional data indicator management tools, you might use manual SQL code to create and manage data indicators, which would bring about different indicators and difficult reuse of indicators. The newly launched data indicator tool of Alibaba Cloud DataWorks can provide the definition of atomic indicators and derived indicators, thereby effectively ensuring the uniformity of business indicators, realizing efficient production and reuse of indicators, and meeting the frequent needs of enterprises to view and use data.
Data Governance Center
Cloud DataWorks has accumulated a lot of data governance capabilities, including data quality management, data rights management, sensitive data protection, metadata management, data lineage, impact analysis, baseline assurance, etc. The good use of these tools still depends on human experience. In the process of data governance, many enterprises also face difficulties in assessing the effectiveness of data governance, and the performance of governance teams is not well measured. As a result, the data governance process often becomes project-based, sports-style, and unsustainable. In order to solve such problems, Alibaba Cloud DataWorks has newly launched the data governance center product. Through a problem-driven approach, it helps enterprises to proactively discover problems to be managed, then guide users to optimize and solve problems, and then provides a scoring model for data governance effectiveness to help enterprises. Quantitatively assess the health of data governance to achieve an effective and sustainable data governance process.
Cloud DataWorks data governance center product provides five dimensions of discovery capabilities for issues to be governed, including R&D specifications, data quality, data security, computing resources, and storage resources. For these five dimensions, the product has built-in very rich governance item scanning mechanisms, which can identify problems after the fact. For example, discovering brute force scan tasks, tables that have not been accessed for a long time, etc., can greatly reduce the cost of computing and storage resources after optimization. At the same time, the product also has a built-in check item interception mechanism to detect and intercept problems in advance and during the event. For example, in the task release phase, by issuing check items, tasks that do not conform to the pre-defined code specifications can be intercepted, thereby ensuring the implementation of enterprise R&D specifications.
For these five dimensions, Alibaba Cloud DataWorks has designed a set of health score evaluation models based on Alibaba's internal practice, which can effectively and quantitatively measure the effectiveness of data governance. Enterprises can quickly identify their own shortcomings through the data governance health score, and then conduct targeted governance, and achieve evaluation and assessment through the health score, so as to achieve sustainable and operational data governance, so that the data governance process can be targeted, and it is no longer impossible to start.
At the same time, the Alibaba Cloud DataWorks data governance center product provides management views from different role perspectives. Personal views allow data engineers to quickly identify problems with their tasks and data tables. Through the manager view, project administrators or team administrators can view the issues of the project or the team, so as to reasonably plan and promote data governance. Different members of the team have their own positions to achieve the unity of execution and management.
DataWorks Open Platform
The data governance process of an enterprise is not standardized, and the product capabilities provided by the Alibaba Cloud DataWorks Data Governance Center must not fully meet all the needs of enterprise data governance. Therefore, a complete data governance platform must support the plug-in mechanism, allowing enterprises to customize data governance plug-ins. The governance items used for problem discovery and the check items for problem interception in our data governance center can be regarded as data governance plug-ins, and DataWorks allows users to customize data governance plug-ins.
In order to implement a custom data governance plug-in, Alibaba Cloud DataWorks has newly upgraded its open platform. On the basis of the original OpenAPI , it has added the capabilities of Open Event, Hook, and Extensions. You can subscribe to open event messages in the DataWorks platform through Kafka. DataWorks provides an extension point mechanism for events in the core process, namely Hook. When an event occurs, the system will automatically interrupt the process, while waiting for you to receive the event message and customize the event message. Finally, your processing is processed through OpenAPI . The result is called back to DataWorks , and DataWorks will choose to execute or block the subsequent process according to your custom processing result, so as to realize your custom control over the DataWorks processing process. Program services that you subscribe to events, process events, and call back the results of event processing are called extensions, or plug-ins. In this way, you can implement various custom data governance plug-ins, such as task release check plug-ins, calculation cost consumption check plug-ins, and so on.
Of course, the applicable scenarios of the DataWorks open platform are far more than implementing data governance plug-ins. Through OpenAPI , open events, and extension program mechanisms, it can help you quickly realize the connection of various self-owned application systems to DataWorks , and easily and quickly manage and control custom data processes and customize data. Governance and operation and maintenance operations, and respond to business status changes in DataWorks in a timely manner in its own application system . You are welcome to use your imagination to realize various industrialized and scenario-based data applications through the DataWorks open platform, so as to better serve you or your customers for the construction of enterprise data middle-end.
Cloud DataWorks has accompanied Alibaba's 12-year development from data warehouse to data center since 2009. The products have been tested and polished for a long time, and the best practice of Alibaba's big data construction has been precipitated. Since 2015, Alibaba Cloud has provided services to the outside world. So far, it has supported the digital transformation of thousands of customers in many ministries, local governments, central enterprises, state-owned enterprises, private enterprises and organizations. Through the newly released data governance related products such as intelligent data modeling, data governance center, and open platform, Alibaba Cloud DataWorks will collaborate with MaxCompute and Hologres to form a cloud-native integrated data warehouse solution to further help enterprises build modern data warehouses and Through effective data governance, we can ensure that the enterprise data warehouse can develop in a standardized, safe, stable and sustainable manner, and at the same time effectively control IT costs, so that enterprises can truly turn data into enterprise assets, and let data create greater value for enterprises.
Alibaba Cloud Big Data is a simple, easy-to-use, fully managed cloud-native big data service for business agility. Activate data productivity and analyze to generate business value. For details, visit: https://www.aliyun.com/product/bigdata/apsarabigdata
Copyright statement: The content of this article is contributed by Alibaba Cloud's real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own the copyright and does not assume the corresponding legal responsibility. For specific rules, please refer to the " Alibaba Cloud Developer Community User Service Agreement " and " Alibaba Cloud Developer Community Intellectual Property Protection Guidelines ". If you find any content suspected of plagiarism in this community, fill out the infringement complaint form to report it. Once verified, this community will delete the allegedly infringing content immediately.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00