DataWorks is a big data research and development platform, which uses MaxCompute as its main calculation engine. These two Alibaba Cloud products' interfacing also includes data integration, modeling, growth, operations tracking, data processing, and security, among other features. DataWorks provides a robust big data analytics approach with the algorithm application PAI, which covers everything from big data creation to Data Mining and Machine Learning.
Independently developed by Alibaba, DataWorks is used to build and administer 99% of the data-driven and data-focused business operations of Alibaba Group by tens of thousands of data and algorithm development engineers every day.
Initially released in 2010, DataWorks has undergone through many technological changes and architecture upgrades up to what is the current version, unfortunately resulting in a great deal of historical baggage. Technological innovation and business development often work well together and complement each other, but they can also restrict each other and cause various problems. The latter is the case with DataWorks. The big data product has some long-standing problems, of which include slow access, extensive code changes required to fix a single bug, and environmental complexity. Problematically, previous iterations have not fundamentally upgraded DataWorks and resolved all of these problems. Rather, they have only improved performance, optimized the underlying engineering structures, and reduced repeated code.
This article will take a look at how we can resolve some of the problems that have plagued DataWorks by adopting the wildly popular microservice architecture and explore how we can transform the technical architecture of DataWorks in a practical manner while avoiding jumping through several complicated engineering hoops.
The DataWorks R&D platform provides a range of functions to assist with daily development work. Users can experience the design features of various functions when using the platform. This is something that is still lacking in platform R&D in general. The PD and the user experience designer (UED) collect requirements and try out the functions themselves. However, without a background in data development, the PD and UED cannot experience the subtle disappointment that is unique to data developers after long-term use. The usage of the DataWorks R&D platform varies greatly in different sectors, like finance, banking, government, large state-owned enterprises, Internet companies, traditional enterprises, private enterprises, and education. Some customers may not know how to use DataWorks. Moreover, users' needs vary and they have different knowledge and skills.
After frontline delivery teams or companies apply DataWorks in fields we have not considered, requirements are collected from these industries and sent to the PD for analysis. Frontline teams can package some DataWorks APIs and provide them as products to customers in specific industries to help solve their problems.
New products are being planned. The engine team uses DataWorks to improve the user-friendliness of designed products. It is difficult to scale up DataWorks to meet the requirements of product planning and improvement if only developers are working according to the schedule. Considering the frontend and backend architectures and countless instances of cooperation and competition, we need to achieve a technical revolution to break away from the SOA and introduce more user-side R&D capabilities. We hope this will allow us to make DataWorks more robust.
In this blog, you will work on a massive log data analysis task. By doing so, you will learn how to synchronize data from different data sources to MaxCompute, how to quickly trigger task runs, and how to view task logs.
This blog introduces how to process log data that has been collected into MaxCompute through DataWorks. That is, through this section, you will learn how to run a data flow chart, how to create a new data table, and how to configure periodic scheduling properties.
This blog introduces how to perform data quality monitoring. This section will mainly go over how you can monitor the data quality in the process of using the data workshop, set up quality monitoring rules, monitor alerts and tables.
MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
DataWorks is a Big Data platform product launched by Alibaba Cloud. It provides one-stop Big Data development, data permission management, offline job scheduling, and other features.
DataWorks works straight ‘out-the-box’ without the need to worry about complex underlying cluster establishment and Operations & Management.
DataWorks provides workspaces in basic mode and standard mode for you to develop data under different security control requirements. This topic describes the differences between and access accounts for workspaces in basic mode and standard mode.
2,599 posts | 763 followers
FollowAlibaba Clouder - March 10, 2021
Wenson - August 4, 2020
Alibaba Clouder - August 12, 2020
Alibaba EMR - July 9, 2021
JDP - January 14, 2022
Alibaba Clouder - December 30, 2020
2,599 posts | 763 followers
FollowA secure environment for offline data development, with powerful Open APIs, to create an ecosystem for redevelopment.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreMore Posts by Alibaba Clouder