Community Blog A Deep Understanding of DataWorks

A Deep Understanding of DataWorks

DataWorks supports over 99% of Alibaba's data development, learn how it achieves data acquisition, data processing and data quality monitoring.

DataWorks is a big data research and development platform, which uses MaxCompute as its main calculation engine. These two Alibaba Cloud products' interfacing also includes data integration, modeling, growth, operations tracking, data processing, and security, among other features. DataWorks provides a robust big data analytics approach with the algorithm application PAI, which covers everything from big data creation to Data Mining and Machine Learning.

How Does DataWorks Support More Than 99% of Alibaba's Data Development?

Independently developed by Alibaba, DataWorks is used to build and administer 99% of the data-driven and data-focused business operations of Alibaba Group by tens of thousands of data and algorithm development engineers every day.

Initially released in 2010, DataWorks has undergone through many technological changes and architecture upgrades up to what is the current version, unfortunately resulting in a great deal of historical baggage. Technological innovation and business development often work well together and complement each other, but they can also restrict each other and cause various problems. The latter is the case with DataWorks. The big data product has some long-standing problems, of which include slow access, extensive code changes required to fix a single bug, and environmental complexity. Problematically, previous iterations have not fundamentally upgraded DataWorks and resolved all of these problems. Rather, they have only improved performance, optimized the underlying engineering structures, and reduced repeated code.

This article will take a look at how we can resolve some of the problems that have plagued DataWorks by adopting the wildly popular microservice architecture and explore how we can transform the technical architecture of DataWorks in a practical manner while avoiding jumping through several complicated engineering hoops.

Cooperation and Competition

The DataWorks R&D platform provides a range of functions to assist with daily development work. Users can experience the design features of various functions when using the platform. This is something that is still lacking in platform R&D in general. The PD and the user experience designer (UED) collect requirements and try out the functions themselves. However, without a background in data development, the PD and UED cannot experience the subtle disappointment that is unique to data developers after long-term use. The usage of the DataWorks R&D platform varies greatly in different sectors, like finance, banking, government, large state-owned enterprises, Internet companies, traditional enterprises, private enterprises, and education. Some customers may not know how to use DataWorks. Moreover, users' needs vary and they have different knowledge and skills.

After frontline delivery teams or companies apply DataWorks in fields we have not considered, requirements are collected from these industries and sent to the PD for analysis. Frontline teams can package some DataWorks APIs and provide them as products to customers in specific industries to help solve their problems.

New products are being planned. The engine team uses DataWorks to improve the user-friendliness of designed products. It is difficult to scale up DataWorks to meet the requirements of product planning and improvement if only developers are working according to the schedule. Considering the frontend and backend architectures and countless instances of cooperation and competition, we need to achieve a technical revolution to break away from the SOA and introduce more user-side R&D capabilities. We hope this will allow us to make DataWorks more robust.


Related Tutorials

Data Acquisition with DataWorks

In this blog, you will work on a massive log data analysis task. By doing so, you will learn how to synchronize data from different data sources to MaxCompute, how to quickly trigger task runs, and how to view task logs.

Data Processing with DataWorks

This blog introduces how to process log data that has been collected into MaxCompute through DataWorks. That is, through this section, you will learn how to run a data flow chart, how to create a new data table, and how to configure periodic scheduling properties.

Data Quality Monitoring with DataWorks

This blog introduces how to perform data quality monitoring. This section will mainly go over how you can monitor the data quality in the process of using the data workshop, set up quality monitoring rules, monitor alerts and tables.

Related Products


MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.


DataWorks is a Big Data platform product launched by Alibaba Cloud. It provides one-stop Big Data development, data permission management, offline job scheduling, and other features.

DataWorks works straight ‘out-the-box’ without the need to worry about complex underlying cluster establishment and Operations & Management.

Related Documentation

Basic mode and standard mode

DataWorks provides workspaces in basic mode and standard mode for you to develop data under different security control requirements. This topic describes the differences between and access accounts for workspaces in basic mode and standard mode.

0 0 0
Share on

Alibaba Clouder

2,600 posts | 753 followers

You may also like