By Liu Tianyuan, Product Manager of DataWorks
This article is a part of the One-stop Big Data Development and Governance DataWorks Use Collection.
In the current business environment, an appropriate data governance policy is important for data value-added services. According to statistics, enterprise data has been growing at a 50% rate every year, so the difficulty of enterprise data governance and integration is increasing.
DataWorks has always been committed to becoming a good helper to develop and manage data more conveniently. The data modeling published this time is a complement to the existing data governance capabilities, bringing the ability to implement prior governance before data development.
The following quote is from The Road to Big Data: Big Data Practice of Alibaba. "If we regard data as books in the library, we hope that they will be sorted on the bookshelf. If we regard data as the architecture of a city, we hope that the planning layout is reasonable. If we regard data as computer files and folders, we hope we have a good organization according to our preference, instead of a messy desktop, where we are constantly struggling to find a file."
These words conclude that standardized and reasonable organization is essential. Only such an organization can avoid the overlapping of things and phenomena, which ultimately causes complex problems. The data model stored in the computing engine (data table) is no exception.
Regarding the common data middle platform and data governance project planning, model design is an unavoidable process after the business and requirements are sorted out. From experience, there are an average of 1,000-5,000 tables in middle platform construction and data warehouse construction projects and hundreds of thousands (or millions) of cases. Therefore, it is important to draw a design or relational network diagram for thousands of tables. How do you draw it? What tools do you use?
Currently, most enterprises draw the blueprint with Excel. During the implementation of many projects, the models are recorded on Excel. When there are few models, this method is indeed a lightweight and easy to implement choice, but when the enterprise's data business develops rapidly, reaching hundreds of scales, this management method will be very inefficient. The data model of an enterprise may grow and change fast like the business. Data personnel have no time to maintain and modify the contents of Excel. Even if they can spare time and energy to do this, the maneuverability is low and human errors are likely to occur. Once managers need to look at the whole picture of the data system from a global perspective, they will get information that is not time-sensitive with a high error rate.
In most instances, if there is a lack of a strong and operable standard tool, the phases of referring to predefined standards and implementing the standards that ought to precede the actual data output will be skipped directly. Therefore, it is difficult for enterprises to implement data standards. Most enterprises are adhering to the idea of treatment after development to build their data system, which leads to the consequences of the subsequent data becoming unavailable.
If an enterprise does not have clear data standards, there will be differences in the understanding of the business among developers, and they will build data models with different definitions and calibers. As a result, it will cause quality problems after data output, resulting in many expected requirements not being realized. Moreover, the data team needs to spend more time fixing the dirty data. Most of the quality problems of enterprise data are caused by a lack of standards or incomplete implementation of standards.
Regarding data governance, especially pre-event governance, enterprises meet two main problems:
According to the interpretation of major websites and various works, the data models are the modes of data organization and storage. They are the abstraction of data characteristics and emphasize reasonable access to data from the perspective of business, data access, and use. Without a data model, it is difficult for stakeholders to see the structure of existing databases and understand key concepts.
Data models include conceptual, logical, and physical models.
A conceptual model is derived by companies to sort out their workflows and describe the relationships among important entities. For example, an enterprise has multiple departments, each department has many employees, and each employee has a different salary. The conceptual model can demonstrate the relationship of each entity through the three entities of department, employee, and salary.
A logical model is a further refinement of the conceptual model to determine the attributes and descriptions of each entity. For example, the employee model contains multiple attributes such as name, ID number, gender, and birthday.
A physical model is an engine-oriented model, which is based on physical characteristics and takes various specific technical implementation factors into account. It can only be stored in the engine.
DataWorks data modeling supports relational modeling (ER and 3NF) and dimensional modeling (star and snowflake). The different types of models are not the best, only more suitable.
Relational modeling is ER or 3NF modeling, which abstracts numerous entities in the enterprise from an enterprise-wide perspective rather than a business analysis perspective. It designs a model conforming to 3NF and describes the business architecture of the enterprise in the form of entities plus relationships. This method is standardized, virtually without redundancy, which is suitable for the strategic planning of large enterprises. But it is not conducive to docking BI and drilling. Moreover, the physical model design does not usually match the model design required by the business. Thus, the implementation cycle of the later project will be long, but the cost is also high, and the requirements for modelers are also demanding.
The dimensional modeling mainly builds the model based on the analysis of decision-making requirements. Generally, it has good response performance for large-scale complex queries and can be directly oriented to business. Typical representatives are the well-known star models and snowflake schema, which is applicable in some special scenarios. The dimensional modeling is relatively fast to get started and delivered, but it is difficult to query across business units. At the same time, there are a large number of auxiliary tables and garbage tables, which are prone to table explosion problems and high subsequent maintenance costs.
Users can choose the modeling modes based on the actual scenario of the enterprise. According to experience, most enterprises will adopt the two modeling modes at the same time. The underlying model adopts relational modeling, striving to achieve data simplification. Dimensional modeling is more suitable in the upper layer, with data redundancy leading to availability, analysis, and maneuverability.
DataWorks data modeling is an open and flexible tool that allows users to freely choose modeling theories to plan and design models. Linus Torvalds, Founder of Linux, describes a good programmer saying, "Poor programmers care about code while good programmers care about data structure and the relationship between them." The same reason can be applied to the importance of data models. After understanding the enterprise workflow scenarios, big data can be used with high performance, low cost, high efficiency, and high quality only if data models are organized and stored in an ordered manner.
1. No matter what the project is, business research and demand analysis should be carried out before the project starts. Business research is aiming at understanding the enterprises' business better. For example, what are the similarities and differences among various business areas and lines? What business modules can each business line of the company be subdivided into? What are the specific processes of each business module? The information is important and will directly determine the success or failure of data warehouse construction.
Requirements research needs to be started with real work scenarios. For example, what statements do analysts and operations staff read every day? What is the KPI of the recommendation business system under construction? What data does this recommendation system need to be based on to achieve KPI?
2. The Outline Design Phase: At this stage, it is necessary to sort and describe the entity relationships in the business processes of the enterprise from a high dimension. That means describing the dimension table and the fact table graphically.
3. Determine the attributes of each dimension and the measurement of each fact table. Determine how attributes and dimensions should be filled into the summary model in the preceding step.
4. The coding phase is the process of transforming the physical model into a DDL statement.
5. Send the converted DDL statement to the development environment to verify compliance with the design specifications of the model. Once the test passes, the model can be published online to serve the data system of the entire enterprise.
6. The operation and maintenance phase is long-lasting. At this stage, the model has become a real table in the engine. If you want to detect unexpected changes to the engine from other channels in time, it is necessary to check and repair the model differences between the model base and the engine regularly.
If you want to manage the lifecycle of the data model better, the model management tools currently prevalent on the market provide the following capabilities:
Regarding enterprises that already have models offline, you can import models directly into the client through model files (such as PD and ERWin files) or import models from the big data engine directly into the client through reverse engineering capabilities.
You can plan and design the model through client or web client tools, and complete the model construction through standard reference in the design phase.
After the model design is completed in the development and testing phase, the model can be submitted to the development environment for testing. At the same time, you can carry out branch management of the historical version so they can roll back to the historical version at any time.
Finally, after the model is tested and reviewed by relevant personnel, it can be published to the online production environment and start serving the enterprise data business.
Baihua Movies Online (Baihua) is a fictitious online movie website that currently maintains its sales data in the MaxCompute database.
Recently, it decided to implement a data middle platform policy throughout the enterprise to improve business intelligence and obtain more reliable business decision-making assistance. The data team has experience in developing on-demand wide table bazaars. IT teams and technical experts warn them of the professionals and funds needed to execute and maintain the entire data middle platform. Rumors of a data middle platform flop have also been widely circulated in the industry. Baihua wanted to build a fast and efficiently reusable, high-quality data middle platform model, so they used Datablau Data Modeler to design, develop, deploy, and maintain their data middle platform model. Let's take a look at the process we followed to build a data model for them.
Currently, there is no data warehouse in Baihua. There is no layer, no subject domain, and no specification. The data team is small. Faced with the data demand, data analysts performed operations, such as demand analysis, index disassembly, dictionary selection, wide table establishment, and report development. Eventually, it led to confusion of various versions of wide tables, unclear temporary tables and result tables, and dimension confusion. At the same time, team members feel like they have become a number machine and cannot learn anything new. Brain drain occurs gradually, code becomes difficult to maintain, and data quality problems occur frequently.
After learning from old lessons and conducting research, we chose MaxCompute as the big data platform. Data standards, metrics, dimensions, data specifications, naming specifications, and data quality have been planned. There is not much content, but the project process aims to continue to enrich itself.
The planning solution made a layered data model and used DDM Mapping to manage the mapping logic to avoid the problem of abuse of wide tables, improving the maintainability and quality of the system.
Operational Data Store (ODS): The ODS layer is a data warehouse preparation area that provides basic raw data for the DWD layer. In terms of naming, table naming and field naming are as consistent as possible with the business system. However, additional identification is required to distinguish between incremental and full tables. "_delta" indicates that it is an incremental table.
Data Warehouse (DW) for redundant processing of some tables
Naming rules for inter-layer table names:
The customer operation department needs to count the amount of customer registration and age distribution in various regions in previous years to run target advertising.
The analysis requirements mainly require annual order data to count the number of users in the regional dimension. It performs secondary statistics according to the customer's age distribution as the dimension.
This requirement is not complex. We mainly need data from the sales database Baihua_Sale and profile data of regions and customer age groups to analyze the data source.
The first step in building a data middle platform model with Datablau Data Modeler is to identify and model the source data. We need to create a data model project and use the Reverse Database icon on the data model toolbar to perform reverse engineering on Baihua_Sale's sales database.
This is what it looks like after reverse engineering of the data source of Baihua_Sale:
Attention: The entity boxes in this model correspond to tables from the Baihua_Sale data source.
After examining the models, we can see that the dictionaries of the entities in the model are missing. (It is also one of the appearances of the absence of enterprise data governance work.) The relationship between the data is also missing. How do we find the data we need to analyze in these scattered tables?
The method aims to model the source system data and build the data model of the business system, which allows us to quickly understand the business and accurately analyze the business requirements.
The first step aims to complete the data dictionaries. This is the first step to solving problems. The physical naming of this library is fairly standard, but the situation you encountered is worse than mine. We collected the missing data dictionaries by accessing the source system development team. Then, we sorted them into DDM Excel data dictionary format and imported them. Fortunately, we completed 90% of the data. We sent a help email to the original system development team regarding the missing 10% of data. They were all very kind and gave feedback soon. At the same time, we informed the business module of the system to complete the data dictionaries.
In the second step, business theme modeling was carried out for this model, and three themes were built.
Product Data (Product): Identify master data and reference data related to products and business relationships between the data
Customer Data (Customer): Identify the customer-related master data and reference data and the business relationships between the data
Purchase Transaction (Business): Identify transaction data for online ordering and leasing by related customers. It is the source of the transaction fact table of this project.
(Logical Model of Subject Domain)
We have created, validated, and deployed the source data model of Baihua_Sale.
The next step in the process aims to design a dimensional model that will be used as a universal public model for the Baihua_Sale data middle platform model. You can use the Entity entity interface provided in the data model toolbox to design a model from start.
According to the requirements, we merge the order table and the payment table into the order fact table. The customer and movie tables are moderately redundant, and the address tables are also redundant. Finally, this star dimension model is designed.
You can change the type to fact or dimension by right-clicking the entity, hovering over the entity type in the context menu, and selecting the appropriate type from the given options.
According to the analysis requirements, in the order table, the field year is derived based on the order date, and the customer table is derived from the horoscope according to the customer ID card. This label analyzes the impact of the horoscope on the order.
Open the data mapping manager:
Step 1: Create a new data mapping:
Step 2: Select the destination table in the model library. The default is the Film table when entering:
Step 3: Source model. Select the model on the ODS side in the model library and select the association table related to Film:
Step 4: The tool follows up the relationship between model entities, build connection entities, and associated fields. It can also be adjusted manually.
Step 5: The target field mapping. The tool automatically maps according to the names, and the spare items that cannot be mapped need to be manually mapped in the editor.
Step 6: Generate a relation graph of SQL statements and mappings
SQL statements can be previewed and exported.
According to the preceding steps in the image, create mappings for the other four DWD tables.
The next step aims to design an analytical dimensional model that will be used as a target model for customer preference analysis.
According to requirements, we use the three attributes of city, horoscope, and year as dimension fields and the number of customers and amount as measurement metrics. Then, a wide table for customer analysis in this area is built.
Establish the mapping relationship according to the 4.1 method, as shown in the following figure:
The dimension character column in the layout builder provides a complete list of options for dimension entities. It contains the following information:
Now that the dimensional model is ready, we will validate and deploy it for further use.
The ETL scheduling tool imports new SQL statements for distributed loading and running of data.
Finally, the process of refreshing this data is automatically performed by the scheduling tool's built-in Job Scheduler. In the Scheduler tab, you can create a new plan to automatically execute the scheduling process at a given scheduling frequency. In this case, we have arranged to refresh the sales data every day.
Use BI tools to analyze their sales data and gain valuable business insights
The DataWorks intelligent data modeling module covers four aspects: data warehouse planning, data standards, data modeling, and data metrics. It can accelerate the data warehouse's design and dimensional modeling and improve the normalization and standardization of the data middle platform. Also, it can comprehensively complete the design and development of data warehouses.
The basic planning of data warehouse design includes data layering, business classification, data domain, and business process.
DataWorks provides a visual dimensional modeling tool that supports forward and reverse-assisted modeling for a variety of big data engines. It can send models that have been designed in the tool directly to the engine and extract models that already exist in the engine to the tool for re-editing and re-issuing. It provides overall management, modeling, and publishing capabilities and avoids the tedious operations of manual import and export of traditional tools. In terms of model implementing standards and monitoring, it supports baseline check on models that have fallen into the engine, helping you find the inconsistency between the table structure and the physical model structure for timely correction. DataWorks provides professional local clients and an online lightweight web page version client to help choose in different work scenarios.
Data standard is the basic guarantee of the data model and metric standardization. It can provide a unified specification that ensures the consistency, accuracy, and completeness of data definitions and usage. The DataWorks Data Standard module allows you to customize data dictionaries, standard codes, and measurement units. It allows enterprise managers to define their data standards and allows data modelers to refer to these standards when building models to solve the problem of inconsistent caliber caused by modelers' different data understanding.
Data metrics include atomic metrics, modifiers, time periods, derived metrics, and aggregate tables. DataWorks supports the design and definition of atomic metrics and derived metrics to ensure uniform business standards. The indicator query code is automatically generated to realize business aggregation and management Through the convergence of the aggregate table.
· Atomic Metric: It is a basic measure that cannot be split into a business process or activity. It is used to specify the statistical caliber and computational logic of the business and the value of the business status.
· Modifier: It indicates the business scope of statistics. It is a business filter for statistical indicators.
· Time Period: It indicates the time range of statistics, which is the time limit for statistical metrics.
· Derived Metric: It concludes atomic metrics, time periods, and modifiers. They are used to reflect the business status of a business activity in a specified time period and target scope.
· Aggregate Table: It is used to organize multiple derived metrics of the same time period and the same dimension in a data domain. The service aggregation logic is formed to provide a basis for subsequent service queries, OLAP analysis, and data distribution.
Before accessing data modeling, data is synchronized to the big data engine in offline or real-time mode. Then, data production, development, and scheduling are started, and data are produced regularly. Finally, the data is returned to some databases or OLAP engines, providing queries or building an API directly through the data service API to serve the business.
DataWorks' functions on data governance-related areas run through all three of these major processes. They include checking source data writing when data are migrated to the cloud, checking data quality and output timeliness during data production, and controlling permissions on data usage. DataWorks has provided users with comprehensive data governance capabilities. They are more about in-event and post-event governance.
The newly published data modeling module has added the main use process of defining data forms, with the purpose of complementing the pre-governance capability for users and bringing an overall model management solution. In this step, users can define enterprise business standards and specifications based on the understanding of enterprise workflows and demand research. Based on the data standards in modeling, users can predefine standards, implement standards, and generate table structure to realize the unified management of the model, putting the pre-event governance into practice.
Models already in the service-providing engine can also be loaded into the data modeling module by reverse modeling. Then, they are submitted back to the engine after unified correction and improvement. The entire process follows the development process specifications defined by DataWorks.
In the DataWorks standard mode space, developers develop and submit code and perform smoke testing in the development environment. After the test result is correct, the submitted code is reviewed by a third party or a third person familiar with the business, such as O&M and deployment personnel or administrators. If the submitted code does not affect the stability of the business system and meets business expectations, it can be published to the production environment and start recurring scheduling.
After publishing data modeling, DataWorks added model designers. People granted this role can log on to the DataWorks data modeling module. They are specialized in model development.
First, the data management team leader, who is also the space administrator, defines the data standard. Second, the model is designed by the data modeler. At the same time, the standards defined by the data manager are used predefine the standards, implement the standards and create table structure and each field of tables, forming a physical model that can be sent to the engine.
Then, the developer is responsible for converting the physical model into DDL statements and submitting it to the engine of the development environment. Finally, the O&M, deployment, or administrator roles review the DDL statements sent to the engine. Then, they approve to publish to the production environment once the statements are without errors.
At this point, a basic and relatively professional model has been designed and implemented, but the previous development process can still be reused. The difference is that the management of data table creation and modification will be implemented in a more rigorous way.
Alibaba Cloud MaxCompute - November 15, 2021
Alibaba Clouder - August 18, 2020
Alibaba Clouder - June 23, 2021
Alibaba Cloud Community - March 4, 2022
Alibaba Clouder - May 4, 2021
Alibaba Clouder - May 12, 2021
Alibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.Learn More
Alibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.Learn More
This all-in-one omnichannel data solution helps brand merchants formulate brand strategies, monitor brand operation, and increase customer base.Learn More
A real-time data warehouse for serving and analytics which is compatible with PostgreSQL.Learn More
More Posts by Alibaba Cloud Community