Annual release of DataWorks full link data governance

Based on Alibaba Cloud's powerful big data AI integrated platform capabilities, this paper introduces DataWorks's end-to-end full link data development and governance platform's new capabilities from the tool level, returns to the essence of tools serving people, and comprehensively improves the work efficiency of front-line data developers/business personnel.

Welcome to this year's Cloud Habitat Conference. In the field of big data and AI, we often pay attention to the performance of machines, how many AI training resources and how many big data computing resources are used. These efficiency improvements are very easy to be perceived by us, but to improve data efficiency in an all-round way, human efficiency and machine efficiency are also critical. Based on Alibaba Cloud's powerful platform capability of big data AI integration, today I will introduce the new capabilities of DataWorks end-to-end full link data development and governance platform from the tool level, return to the essence of tools serving people, and comprehensively improve the work efficiency of our front-line data developers/business personnel.

First of all, we will show you a group of figures. DataWorks has served more than 10,000 enterprise customers. Our customers cover industries such as industrial manufacturing, energy, automobile, finance, retail, government affairs, and the Internet. There are large central enterprises, state-owned enterprises, Fortune 500 enterprises, and small and medium-sized enterprises that have just started their businesses for 1-2 years. From the perspective of the universality of the platform, our tools can meet different industries in all aspects, Big data development and governance needs at different stages of enterprise development. At the same time, as the construction of big data continues to enter the deep-water area, data governance has become a topic of increasing concern. DataWorks will accumulate years of data governance experience in Alibaba Group to practice productization. At present, it has been output on Alibaba Cloud and has found more than 1 million data problems for customers. We will discuss this in detail later. In the traditional data development field, the number of tasks stably scheduled on the public cloud every day has exceeded 10 million, providing a strong guarantee for large-scale data production of enterprises.

Behind these figures, thanks to the full link data development and governance platform built by DataWorks, DataWorks is a product that has developed for more than ten years. We have been committed to building an enterprise level data warehouse and data lake, supporting the data platform architecture of lake warehouse integration, and speeding up the digital transformation of enterprises. Based on Alibaba's self-developed ODPS integrated big data intelligent computing platform (MaxCompute/Hologres), open source big data computing platform EMR/CDP and other big data engines, it provides unified full link big data development and governance platform services for solutions such as data warehouse/data lake/lake warehouse integration. This year, DataWorks supported the new Datalake cluster launched by EMR, which can complete the development and governance of the full link data lake from data into the lake, modeling, development, scheduling, governance, and analysis services, and become the data lake solution with full scores and ranking first in the evaluation of the Chinese Academy of Information and Communications.

The following will share the new features of DataWorks from four aspects: data development, data analysis services, data governance, and platform openness.

Standardized, real-time and intelligent data development platform

In the standardization part, we focus on the standardization capability in the data development process. Today, when an enterprise is building a data warehouse or data center, there is a lot of business knowledge in it, which may be stored in the minds of every employee. With the flow of personnel and the change of the team, knowledge will gradually be lost, or it will take a long time to complete the transfer of knowledge and information. For the enterprise level data platform, the precipitation is not only the data itself, but also the business knowledge behind the data. Last year DataWorks released data modeling products. This year, we not only upgraded our capabilities of forward modeling, reverse modeling, semantic modeling, etc., to solve the problem of cold start of digital warehouse and reduce the threshold of modeling, but also precipitated the experience of data models, data indicators, etc. into industry data model templates. We hope to integrate the experience of all walks of life together, through productization and systematization, Let the enterprise's data knowledge and assets have been deposited in the data platform to achieve sustainable development of data business.

In the real-time part, with the development of technology and computing power, real-time has become a necessary choice. DataWorks also added the ability of MySQL/Clickhouse/OceanoBase/Kafka and other data sources to write to Hologres in real time, and MySQL to write to OSS and other data to be stored in the lake in real time. And different from traditional offline synchronization, DataWorks data integration can achieve full incremental integrated synchronization, complete automatic data consolidation, and improve our data synchronization processing efficiency.

In the intelligent part, the whole big data system is very complex. The most work that data engineers do every day is data development and operation and maintenance. We are extremely concerned about how to improve the development, operation and maintenance efficiency of our data engineers through intelligent methods. On DataWorks products, we provide intelligent SQL programming deduction reminders, such as field association completion, code error prompt, and SQL logic visualization, which can improve our SQL programming efficiency by more than 35%. At the same time, operation and maintenance problems are often accompanied by serious impacts such as data errors and business alarms. DataWorks provides DAG aggregation analysis to visualize the waiting/running/running success of each cycle task and each cycle instance scheduling in a visual way, quickly check the upstream and downstream operation of the problem task, and at the same time, it has a full link intelligent diagnosis of the task, from dependency, regular inspection, scheduling resources, engine resources Data quality rules and other aspects help data engineers quickly locate and solve various operation and maintenance problems. Furthermore, DataWorks classifies all kinds of tasks through the baseline. In the case of resource contention, it prioritizes scheduling and computing resources for highly guaranteed core tasks to ensure timely and stable output of core tasks. Based on historical task operation, DataWorks intelligently monitors task operation and alerts task operation problems in advance. Data engineers gradually shift from passive processing to proactive prevention for operation and maintenance problems. In terms of data development, operation and maintenance, we have also accumulated a lot of capabilities. Today, a data engineer can do more work on DataWorks than ever before, and can spend more time on realizing business needs and values.

Low code, visual data analysis and service

After the processing and production of data is completed, data can be consumed, shared and applied to maximize the value of data. DataWorks released a new version of UI and interactive SQL data retrieval analysis tool, enabling data analysts and business personnel to achieve self-service data retrieval analysis, greatly reducing the burden on ETL engineers. In different data analysis scenarios, we provide various efficient data processing capabilities. First of all, the query results will be directly and automatically converted into visual charts to help business personnel quickly understand the data overview and trends and reduce the processing of various charts. If secondary processing is required, the spreadsheet can directly complete common data operations such as sorting and filtering on the page without downloading data. If complex processing is required, data analysis also provides convenient data upload and download, and has the ability to control data permissions.

Data service is an important link between data and upstream applications. For developers and data analysts, we provide a full set of tools for building data APIs with low coding. This year, we launched a new query acceleration service. Based on Hologres' powerful capabilities, we can directly accelerate the query of MaxCompute tables without exporting data to other online databases, simplifying the architecture and greatly reducing the additional storage and computing costs caused by data export.

Active and continuous full link data governance

Above all, we have completed the initial stage of big data, but when we have better tools and platforms and a powerful computing engine at the bottom, data will accumulate very quickly, and the cost of data on the entire platform will skyrocket. The next challenge for the data platform is how to govern and reduce costs. DataWorks Data Governance Center was officially released commercially this year, which actually has two core concepts.

First of all, we need to reduce all kinds of "pollution before governance" and "development before governance". DataWorks integrates the whole process of data governance into each specific link of data development, and has many built-in check item rules, such as "Disable SELECT *". In this way, when developers are running SQL, if they use "SELECT *", they will remind and prohibit the operation. Relatively complex rules, such as "table structure consistency check", will be intercepted if the table structures of the development environment and the production environment are inconsistent, so as to avoid error reports or data quality problems when the production task runs. By checking items, we can directly prevent some data governance problems from the source. At the same time, in the face of the fact that the enterprise does not know how data governance works, we will also take the initiative to help the enterprise find the existing data governance problems. These governance items are some data governance experiences we have accumulated from Alibaba Group, such as no data quality monitoring, no life cycle setting, no access for a long time, long waiting for tasks, etc., to guide the enterprise to gradually and item by item governance of various problems. At the beginning, we also mentioned that at present, DataWorks Data Governance Center has found more than 1 million data governance problems for cloud customers, of which more than 60% have been solved.

This paper introduces the problem of active data governance. It is found that the next problem is how to operate data governance for a long time and continuously to avoid data governance becoming a phased and dynamic work. For the big data team of an enterprise, data governance is not only a technical issue, but also an organizational and management issue. DataWorks Data Governance Center provides a complete set of data governance health sub models. This model is also precipitated from within Alibaba Group, involving five aspects of research and development, storage, computing, security, and quality. It has nearly 100 scoring dimensions, and can evaluate the work of enterprise data governance through quantitative means. Based on the health score, the enterprise's data governance committee (data platform team, business team, risk control, finance and other collaborative teams) can set a common goal, for example, to raise the health score from 80 to 90, not only from the business side and production side to carry out governance optimization, but also to provide data governance needs to the data platform team to coordinate with the health score to launch various data governance group campaigns Data governance competition, data governance college and other long-term operations. The organization has a quantifiable way, and departments and employees will also have common goals.

Through active data governance, it is found that with continuous data governance operation, DataWorks makes data governance no longer stay in written rules and regulations, but a tool product that can be actually operated, and closely combines with the actual work to achieve a positive cycle of enterprise data governance.

Open and scalable enterprise data platform

Finally, as an enterprise level data platform, we should maintain openness and scalability both for internal business teams and for external partners. This year DataWorks upgraded the entire open platform, and launched OpenEvent, Extensions, Migration Migration Assistant and other open capabilities based on OpenAPI. At present, DataWorks has provided more than 100 APIs, so that users can customize the platform capability of calling DataWorks, and realize the integration and interaction of enterprise internal applications and DataWorks. OpenEvent open event touches users with various status changes of DataWorks in the form of messages, which is convenient for users to subscribe and make personalized responses. For example, you can subscribe to table changes through OpenEvent, thus realizing real-time monitoring of core tables; You can also subscribe to the approval center event to integrate the internal approval process of the enterprise and complete the user-defined process approval capability. The core of Extensions is the ability to redefine. Some capabilities provided by DataWorks today may not meet the specific requirements of each enterprise. In this case, enterprises can use our extensions to define capabilities that meet their business conditions. The requirements of an Internet enterprise and traditional industries in the field of data governance must be different. At this time, you can define customized data governance capabilities that meet your own requirements through extension plug-ins. For example, some enterprises have strict code launch processes and need to add code review processes. When users click the submit node, the process can enter the custom code review process, which is not directly submitted to the development environment for verification. After the custom review process is passed, it can be submitted to the development environment. Finally, the Migration Assistant implements the ability to migrate all kinds of tasks. In addition to Oozie, Azkaban, Airflow and other scheduling engines, we have also added DolphinScheduler migration this year, and we are going to open source the Migration Assistant. Enterprises can easily move in and out from platform to platform, from cloud to cloud.

There is more than one way of data governance. DataWorks not only provides Alibaba's best practices of data governance, but also hopes to give our customers and partners stronger customization capabilities through the DataWorks open platform, so that all walks of life can complete data governance more efficiently through the tool platform.

At this year's Yunqi Conference, we also had many customers present their best practices of digital transformation by using DataWorks and various big data engines.

Based on Alibaba Cloud, AIA Life has built a financial data middle platform, and has undertaken a peak of 10 times of business traffic, which has improved the data processing efficiency by 20 times. The overall computing cost of the enterprise has been saved by millions.

"The King of Africa" Voice Connect has strongly supported the Group's Internet business. The efficiency of data governance has been improved by 2-3 times, enabling more than 95% of the Group's business growth, and leading more Chinese corporate brands to global emerging markets.

Nezha Auto will gradually improve its data governance and data lake capabilities. Relying on a stable and reliable big data platform with excellent performance and elastic expansion, it will support more than 600000 vehicles and petabytes of data analysis in the future.

Sanqi Mutual Entertainment uses the concept of DataOps to activate data value, build an automated, agile and value oriented data system, solve the pain points of data consumption such as difficult data acquisition, slow business response, and single data scenario, and use data to drive the refinement of operations.

Data governance is a huge topic, involving many aspects. But let's return to our theme, efficiency first, and return to the essence of tools serving people. This year, we released some new functions of full link data governance from the perspective of people. We hope that through the tool platform, enterprise developers can reduce inefficient repetitive work, keep the working efficiency of data personnel rising in a spiral way, improve enterprise data efficiency in an all-round way, and reduce costs and increase efficiency for enterprises.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us