What are the core capabilities of DataWorks? - DataWorks

DataWorks is an all-in-one big data development and governance platform that supports end-to-end data processing. Use DataWorks to integrate, develop, model, analyze, monitor, serve, and govern data across your organization, and build an enterprise-level data middle platform.

Module overview

Module	Description
Data Integration	Synchronize data across 50+ heterogeneous sources in offline, real-time, or integrated modes
Data Studio and Operation Center	Develop, orchestrate, deploy, and monitor data processing tasks across multiple compute engines
Data modeling	Plan data warehouse layers, define standards, build dimensional models, and manage metrics
Data Analysis	Run SQL queries, upload datasets, and visualize data without data engineering skills
Data Quality	Monitor data at the table and field levels and block problematic tasks to prevent dirty data propagation
Data Map	Search, categorize, and trace data lineage across your data assets
DataService Studio	Build, publish, and manage data APIs with serverless architecture
Open Platform	Integrate external systems through OpenAPI, OpenEvent, and Extensions
Migration Assistant	Migrate jobs from open-source scheduling engines or between DataWorks environments

Data Integration

Data Integration is a stable, efficient, and elastic data synchronization platform that connects heterogeneous data sources across network environments.

Synchronization modes and capabilities

Data Integration supports full and incremental data synchronization in offline, real-time, or integrated modes.

Batch synchronization: Configure scheduling cycles for synchronization tasks.
50+ data sources: Synchronize data between relational databases, data warehouses, non-relational databases, file storage, and message queues.
Network flexibility: Connect to data sources across public internet, IDCs, or VPCs.
Security: Monitor operations and enforce access controls during synchronization.

Engine architecture

Data Integration uses a star-shaped engine architecture. Any connected data source can form synchronization links with any other supported source. For a list of supported data sources, see Supported data sources and synchronization solutions.

Star-shaped engine architecture of Data Integration showing interconnected data sources

Before synchronizing data, establish network connectivity between your data source and a resource group. Data Integration tasks run on serverless resource groups (recommended) or exclusive resource groups for Data Integration (legacy). For network solutions, see Network connectivity solutions.

Resource groups and network connectivity between data sources and Data Integration

Typical use cases

Ingesting data into data lakes and data warehouses
Sharding databases and tables
Archiving real-time data
Moving data between clouds

Data Studio and Operation Center

Data Studio is a development platform for data processing. Operation Center is an intelligent operations and maintenance (O&M) platform. Together, they provide a standardized way to build and manage data development workflows.

Multi-engine development and environment isolation

Multi-engine support: Develop, test, deploy, and manage tasks across MaxCompute, E-MapReduce, CDH, Hologres, AnalyticDB, and ClickHouse from a unified platform.
Intelligent editor and visual orchestration: An intelligent editor and drag-and-drop dependency orchestration for building task workflows. The scheduling system is proven by Alibaba Group's internal workloads.
Environment isolation: Separate development and production environments in standard mode. Version control, code review, smoke testing, deployment control, and operational auditing standardize your development lifecycle.
Operational monitoring: Operation Center provides data timeliness assurance, task diagnostics, impact analysis, automated O&M, and mobile-based O&M.

DataWorks provides workspaces in standard mode to isolate development and production environments. For more information, see Differences between workspace modes.

Development and operations workflow

Development workflow
Task monitoring, troubleshooting, and resolution

Data modeling

Data modeling in DataWorks incorporates over a decade of best practices from Alibaba's data warehouse modeling methodologies. Build enterprise data assets through structured modeling and reverse modeling for data marts and data middle platforms.

Four core modules

Data modeling includes four modules: Data Warehouse Planning, Data Standard, Dimensional Modeling, and Data Metrics.

Module	Capabilities
Data Warehouse Planning	Plan data warehouse layers, data domains, and data marts. Configure model design spaces so that different departments share a common set of data standards and models.
Data Standard	Define field standards, standard codes, units of measurement, and naming dictionaries. Automatically generate data quality rules from standard codes to simplify compliance checks.
Dimensional Modeling	Reverse modeling addresses the cold-start problem for existing data warehouses. Import models from Excel files or build them with FML, an SQL-like domain-specific language. Visual dimensional modeling integrates with Data Studio to automatically generate ETL code.
Data Metrics	Define atomic metrics and derived metrics. Batch-create derived metrics based on atomic metrics and various dimensions. Integrates with dimensional modeling.

Architecture

Data modeling architecture showing relationships between Data Warehouse Planning, Data Standard, Dimensional Modeling, and Data Metrics

Typical use cases

Structured data management: Organize and store large-scale enterprise data in a structured and consistent manner.
Cross-department data integration: Break data silos between departments and business domains to give decision-makers a complete view of business data.
Unified data standards: Establish consistent data definitions across systems without changing existing architectures. Enable upstream and downstream data interconnection.
Data value realization: Use various types of enterprise data to deliver more effective data services.

Data Analysis

Data Analysis provides tools for data analysts, product managers, and operations staff to retrieve and analyze data without requiring data engineering skills -- making everyone a data analyst.

Core capabilities

Upload personal datasets and access public datasets
Search and bookmark tables
Run online SQL queries
Share SQL files and download query results
Visualize data on large screens using spreadsheets

Typical use cases

Use case	Description
Scalable analysis	Leverage compute engine resources to analyze full-scale datasets.
Cross-system data flow	Analyze data from databases across different business systems. Export data to MaxCompute tables or share result sets with specified users and grant them permissions.
Secure operations	Integrate SQL queries and result downloads with security auditing.

Data Quality

Data Quality monitors data at the table and field levels using over 30 preset monitoring templates and custom templates. It detects source data changes, identifies dirty data during ETL (extract, transform, load) processing, and automatically blocks problematic tasks to prevent dirty data from propagating downstream.

Monitoring and verification

Data Quality monitors datasets across various engines, including MaxCompute. When offline data changes, Data Quality verifies the data and blocks the production pipeline to prevent data pollution. It stores historical verification results for quality analysis and classification. For more information, see Data Quality.

Data Quality addresses the following issues:

Frequent database changes
Frequent business changes
Data definition issues
Dirty data from business systems
Quality issues caused by system interactions
Issues caused by data correction
Quality issues originating from the data warehouse

Data Map

Data Map is built on data search capabilities. It provides tools for table usage instructions, data categories, data lineage, and field lineage. Data consumers and data owners use Data Map to manage data and collaborate on development.

Data Map interface showing data search, categorization, and lineage features

DataService Studio

DataService Studio is a flexible, lightweight, secure, and stable platform for building and publishing data APIs. It provides publication approval, access control, usage metering, and resource isolation.

Unified API service bus

DataService Studio acts as a unified service bus between the data warehouse and applications. It unifies the creation and management of API services, closing the gap between the data warehouse, databases, and data applications.

DataService Studio architecture showing the bridge between data warehouse and applications

Generate data APIs from tables in various data sources using no-code or self-service SQL mode. Use Function Compute to process API request parameters and returned results.
Publish API services to an API gateway with a single click.

Serverless architecture

DataService Studio uses a serverless architecture. Focus on API query logic instead of managing infrastructure. DataService Studio automatically provisions computing resources with elastic scaling, resulting in zero O&M costs.

DataService Studio serverless architecture with elastic scaling

Open Platform

Open Platform exposes DataWorks data and capabilities to external systems through OpenAPI, OpenEvent, and Extensions. Integrate applications with DataWorks to manage data workflows, govern data, and respond to business status changes.

Three integration capabilities

OpenAPI: Integrate your applications with DataWorks. Batch create, publish, and manage tasks to improve processing efficiency and reduce manual operations. For more information, see OpenAPI.
OpenEvent: Subscribe to system events for real-time notifications. For example, subscribe to table change events to monitor core tables, or subscribe to task change events to build a real-time task monitoring dashboard. For more information, see OpenEvent.
Extensions: Service-level plug-ins that combine OpenAPI and OpenEvent. Customize workflow controls in DataWorks. For example, create a deployment control plug-in to block tasks that do not comply with your standards. For more information, see Extensions.

Typical use cases

Open Platform supports deep system integration, automated operations, workflow definition, and business monitoring. Build industry-specific and scenario-based data applications and plug-ins on the DataWorks Open Platform.

Migration Assistant

Migration Assistant migrates jobs from open-source scheduling engines to DataWorks. It supports cross-cloud, cross-region, and cross-account job migration, allowing you to quickly clone and deploy DataWorks jobs. The DataWorks team, in collaboration with big data expert service teams, also offers cloud migration services to help you move your data and tasks to the cloud.

Migration capabilities

Capability	Description
Task migration to the cloud	Migrate jobs from open-source scheduling engines to DataWorks.
DataWorks migration	Migrate development assets within the DataWorks ecosystem.

Typical use cases

Use case	Description
Task migration to the cloud	Migrate jobs from open-source scheduling engines to DataWorks.
Task backup	Regularly back up task code to minimize losses from accidental project deletion.
Business replication	Abstract common business logic and use the export/import feature to replicate it across projects.
Test environment setup	Replicate business code and change the data input from production to test data.
Cross-cloud development	Import and export between DataWorks on the public cloud and DataWorks in a private cloud for collaborative development.