What are the core capabilities of DataWorks? - DataWorks

DataWorks is an all-in-one big data development and governance platform for data engineers, data analysts, architects, and operations teams. Use it to integrate, develop, model, analyze, monitor, serve, and govern data across your organization — and build an enterprise-level data middle platform without switching between separate tools.

How data flows through DataWorks

Data moves through five stages in a typical DataWorks workflow:

Connect and collect — Data Integration syncs data from 50+ heterogeneous sources into your data lake or warehouse, without writing custom connectors or managing network infrastructure manually.
Transform and enrich — Data Studio lets you develop, test, and schedule processing tasks across multiple compute engines. Operation Center keeps those pipelines running reliably in production.
Model and standardize — Data modeling applies consistent definitions, dimensional models, and metrics across departments — eliminating data silos without changing existing architectures.
Analyze and serve — Data Analysis runs ad hoc SQL queries without data engineering skills. DataService Studio publishes results as governed data APIs with zero O&M overhead.
Govern and extend — Data Quality blocks dirty data before it propagates downstream. Data Map traces lineage across assets. Open Platform integrates DataWorks into your existing systems.

Module overview

Module	What it does
Data Integration	Synchronize data across 50+ heterogeneous sources in offline, real-time, or integrated modes
Data Studio and Operation Center	Develop, orchestrate, deploy, and monitor data processing tasks across multiple compute engines
Data modeling	Plan data warehouse layers, define standards, build dimensional models, and manage metrics
Data Analysis	Run SQL queries, upload datasets, and visualize data without data engineering skills
Data Quality	Monitor data at the table and field levels and block problematic tasks to prevent dirty data propagation
Data Map	Search, categorize, and trace data lineage across your data assets
DataService Studio	Build, publish, and manage data APIs with serverless architecture
Open Platform	Integrate external systems through OpenAPI, OpenEvent, and Extensions
Migration Assistant	Migrate jobs from open-source scheduling engines or between DataWorks environments

Data Integration

Data Integration syncs data across heterogeneous sources — no custom connectors or manual network configuration required. It supports full and incremental synchronization in offline, real-time, or integrated modes.

Capabilities

Feature	Description
Batch synchronization	Configure scheduling cycles for synchronization tasks
50+ data sources	Synchronize between relational databases, data warehouses, non-relational databases, file storage, and message queues
Network flexibility	Connect to data sources across the public internet, IDCs, or VPCs
Security	Monitor operations and enforce access controls during synchronization

Engine architecture

Data Integration uses a star-shaped engine architecture: any connected data source can form synchronization links with any other supported source. For a full list, see Supported data sources and synchronization solutions.

Before synchronizing data, establish network connectivity between your data source and a resource group. Data Integration tasks run on serverless resource groups (recommended) or exclusive resource groups for Data Integration (legacy). For network solutions, see Network connectivity solutions.

Resource groups and network connectivity between data sources and Data Integration

Use cases

Use case	Description
Data lake and warehouse ingestion	Ingest data from source systems into your data lake or data warehouse
Database sharding	Shard databases and tables across distributed storage
Real-time data archiving	Archive streaming data for long-term storage and analysis
Cross-cloud migration	Move data between cloud environments

Data Studio and Operation Center

Data Studio is a development platform for data processing. Operation Center is an intelligent operations and maintenance (O&M) platform. Together, they give you a standardized end-to-end workflow — from writing code to keeping pipelines healthy in production.

Capabilities

Feature	Description
Multi-engine support	Develop, test, deploy, and manage tasks across MaxCompute, E-MapReduce, CDH, Hologres, AnalyticDB, and ClickHouse from a single platform
Intelligent editor and visual orchestration	Write code in an intelligent editor and build task workflows with drag-and-drop dependency orchestration. The scheduling system is proven by Alibaba Group's internal workloads
Environment isolation	Standard mode separates development and production environments. Version control, code review, smoke testing, deployment control, and operational auditing standardize your development lifecycle
Operational monitoring	Operation Center provides data timeliness assurance, task diagnostics, impact analysis, automated O&M, and mobile-based O&M

DataWorks provides workspaces in standard mode to isolate development and production environments. For more information, see Differences between workspace modes.

Development and operations workflow

Development workflow

Task monitoring, troubleshooting, and resolution

Data modeling

Data modeling builds on over a decade of Alibaba's data warehouse modeling best practices. Use it to establish consistent data definitions across departments, build dimensional models, and manage metrics — without rebuilding your existing architecture.

Four core modules

Module	Capabilities
Data Warehouse Planning	Plan data warehouse layers, data domains, and data marts. Configure model design spaces so different departments share a common set of data standards and models
Data Standard	Define field standards, standard codes, units of measurement, and naming dictionaries. Automatically generate data quality rules from standard codes to simplify compliance checks
Dimensional Modeling	Reverse modeling addresses the cold-start problem for existing data warehouses. Import models from Excel files or build them with FML, an SQL-like domain-specific language. Visual dimensional modeling integrates with Data Studio to automatically generate ETL (extract, transform, load) code
Data Metrics	Define atomic metrics and derived metrics. Batch-create derived metrics based on atomic metrics and various dimensions. Integrates with dimensional modeling

Architecture

Use cases

Use case	Description
Structured data management	Organize and store large-scale enterprise data in a structured, consistent manner
Cross-department data integration	Break data silos between departments to give decision-makers a complete view of business data
Unified data standards	Establish consistent data definitions across systems without changing existing architectures, enabling upstream and downstream data interconnection
Data value realization	Deliver more effective data services using various types of enterprise data

Data Analysis

Data Analysis gives data analysts, product managers, and operations staff the tools to query and visualize data directly — no data engineering skills required.

Capabilities

Upload personal datasets and access public datasets
Search and bookmark tables
Run online SQL queries
Share SQL files and download query results
Visualize data on large screens using spreadsheets

Use cases

Use case	Description
Scalable analysis	Use compute engine resources to analyze full-scale datasets
Cross-system data flow	Analyze data from databases across different business systems. Export data to MaxCompute tables or share result sets with specified users and grant them permissions
Secure operations	Integrate SQL queries and result downloads with security auditing

Data Quality

Data Quality monitors data at the table and field levels using over 30 preset monitoring templates and custom templates. It detects source data changes and dirty data during ETL processing, then automatically blocks problematic tasks — preventing dirty data from propagating downstream.

Monitoring and verification

Data Quality monitors datasets across various engines, including MaxCompute. When offline data changes, Data Quality verifies the data and blocks the production pipeline to prevent data pollution. Historical verification results are stored for quality analysis and classification. For more information, see Data Quality.

Data Quality addresses the following issues:

Issue type	Description
Database changes	Frequent schema or structural changes in source databases
Business changes	Evolving business logic that causes data inconsistencies
Data definition issues	Mismatched or undefined field standards
Dirty data from business systems	Invalid or malformed data originating in upstream systems
System interaction issues	Quality degradation caused by cross-system dependencies
Data correction issues	Errors introduced during manual data corrections
Data warehouse quality issues	Quality problems originating within the warehouse itself

Data Map

Data Map is built on data search capabilities. It provides table usage instructions, data categories, data lineage, and field lineage — giving data consumers and data owners a shared space to manage assets and collaborate on development.

DataService Studio

DataService Studio is a flexible, lightweight, secure, and stable platform for building and publishing data APIs. It provides publication approval, access control, usage metering, and resource isolation — with zero O&M overhead.

Unified API service bus

DataService Studio acts as a unified service bus between the data warehouse and applications, closing the gap between the data warehouse, databases, and data applications.

DataService Studio architecture showing the bridge between data warehouse and applications

Feature	Description
No-code and SQL-mode API generation	Generate data APIs from tables in various data sources using no-code or self-service SQL mode. Use Function Compute to process API request parameters and returned results
Single-click publishing	Publish API services to an API gateway with a single click

Serverless architecture

DataService Studio uses a serverless architecture, so you focus on API query logic instead of managing infrastructure. Computing resources are provisioned automatically with elastic scaling, resulting in zero O&M costs.

Open Platform

Open Platform exposes DataWorks data and capabilities to external systems through OpenAPI, OpenEvent, and Extensions. Use it to manage data workflows, govern data, and respond to business status changes from your own applications.

Integration capabilities

Capability	Description
OpenAPI	Integrate your applications with DataWorks. Batch-create, publish, and manage tasks to improve processing efficiency and reduce manual operations. For more information, see OpenAPI
OpenEvent	Subscribe to system events for real-time notifications. For example, subscribe to table change events to monitor core tables, or subscribe to task change events to build a real-time task monitoring dashboard. For more information, see OpenEvent
Extensions	Service-level plug-ins that combine OpenAPI and OpenEvent to customize workflow controls in DataWorks. For example, create a deployment control plug-in to block tasks that do not comply with your standards. For more information, see Extensions

Use cases

Open Platform supports deep system integration, automated operations, workflow definition, and business monitoring. Build industry-specific and scenario-based data applications and plug-ins on the DataWorks Open Platform.

Migration Assistant

Migration Assistant migrates jobs from open-source scheduling engines to DataWorks and supports cross-cloud, cross-region, and cross-account job migration — so you can clone and deploy DataWorks jobs without rebuilding from scratch. The DataWorks team, in collaboration with big data expert service teams, also offers cloud migration services to help you move your data and tasks to the cloud.

Migration capabilities

Capability	Description
Task migration to the cloud	Migrate jobs from open-source scheduling engines to DataWorks
DataWorks migration	Migrate development assets within the DataWorks ecosystem

Use cases

Use case	Description
Task migration to the cloud	Migrate jobs from open-source scheduling engines to DataWorks
Task backup	Regularly back up task code to minimize losses from accidental project deletion
Business replication	Abstract common business logic and use the export/import feature to replicate it across projects
Test environment setup	Replicate business code and switch the data input from production to test data
Cross-cloud development	Import and export between DataWorks on the public cloud and DataWorks in a private cloud for collaborative development