All Products
Search
Document Center

DataWorks:Features

Last Updated:Mar 31, 2026

DataWorks is an all-in-one big data development and governance platform for data engineers, data analysts, architects, and operations teams. Use it to integrate, develop, model, analyze, monitor, serve, and govern data across your organization — and build an enterprise-level data middle platform without switching between separate tools.

How data flows through DataWorks

Data moves through five stages in a typical DataWorks workflow:

  1. Connect and collect — Data Integration syncs data from 50+ heterogeneous sources into your data lake or warehouse, without writing custom connectors or managing network infrastructure manually.

  2. Transform and enrich — Data Studio lets you develop, test, and schedule processing tasks across multiple compute engines. Operation Center keeps those pipelines running reliably in production.

  3. Model and standardize — Data modeling applies consistent definitions, dimensional models, and metrics across departments — eliminating data silos without changing existing architectures.

  4. Analyze and serve — Data Analysis runs ad hoc SQL queries without data engineering skills. DataService Studio publishes results as governed data APIs with zero O&M overhead.

  5. Govern and extend — Data Quality blocks dirty data before it propagates downstream. Data Map traces lineage across assets. Open Platform integrates DataWorks into your existing systems.

Module overview

ModuleWhat it does
Data IntegrationSynchronize data across 50+ heterogeneous sources in offline, real-time, or integrated modes
Data Studio and Operation CenterDevelop, orchestrate, deploy, and monitor data processing tasks across multiple compute engines
Data modelingPlan data warehouse layers, define standards, build dimensional models, and manage metrics
Data AnalysisRun SQL queries, upload datasets, and visualize data without data engineering skills
Data QualityMonitor data at the table and field levels and block problematic tasks to prevent dirty data propagation
Data MapSearch, categorize, and trace data lineage across your data assets
DataService StudioBuild, publish, and manage data APIs with serverless architecture
Open PlatformIntegrate external systems through OpenAPI, OpenEvent, and Extensions
Migration AssistantMigrate jobs from open-source scheduling engines or between DataWorks environments

Data Integration

Data Integration syncs data across heterogeneous sources — no custom connectors or manual network configuration required. It supports full and incremental synchronization in offline, real-time, or integrated modes.

Capabilities

FeatureDescription
Batch synchronizationConfigure scheduling cycles for synchronization tasks
50+ data sourcesSynchronize between relational databases, data warehouses, non-relational databases, file storage, and message queues
Network flexibilityConnect to data sources across the public internet, IDCs, or VPCs
SecurityMonitor operations and enforce access controls during synchronization

Engine architecture

Data Integration uses a star-shaped engine architecture: any connected data source can form synchronization links with any other supported source. For a full list, see Supported data sources and synchronization solutions.

Star-shaped engine architecture of Data Integration showing interconnected data sources

Before synchronizing data, establish network connectivity between your data source and a resource group. Data Integration tasks run on serverless resource groups (recommended) or exclusive resource groups for Data Integration (legacy). For network solutions, see Network connectivity solutions.

Resource groups and network connectivity between data sources and Data Integration

Use cases

Use caseDescription
Data lake and warehouse ingestionIngest data from source systems into your data lake or data warehouse
Database shardingShard databases and tables across distributed storage
Real-time data archivingArchive streaming data for long-term storage and analysis
Cross-cloud migrationMove data between cloud environments

Data Studio and Operation Center

Data Studio is a development platform for data processing. Operation Center is an intelligent operations and maintenance (O&M) platform. Together, they give you a standardized end-to-end workflow — from writing code to keeping pipelines healthy in production.

Capabilities

FeatureDescription
Multi-engine supportDevelop, test, deploy, and manage tasks across MaxCompute, E-MapReduce, CDH, Hologres, AnalyticDB, and ClickHouse from a single platform
Intelligent editor and visual orchestrationWrite code in an intelligent editor and build task workflows with drag-and-drop dependency orchestration. The scheduling system is proven by Alibaba Group's internal workloads
Environment isolationStandard mode separates development and production environments. Version control, code review, smoke testing, deployment control, and operational auditing standardize your development lifecycle
Operational monitoringOperation Center provides data timeliness assurance, task diagnostics, impact analysis, automated O&M, and mobile-based O&M
DataWorks provides workspaces in standard mode to isolate development and production environments. For more information, see Differences between workspace modes.

Development and operations workflow

Development workflow

Development workflow from code editing through testing to deployment

Task monitoring, troubleshooting, and resolution

Task monitoring, troubleshooting, and resolution workflow in Operation Center

Data modeling

Data modeling builds on over a decade of Alibaba's data warehouse modeling best practices. Use it to establish consistent data definitions across departments, build dimensional models, and manage metrics — without rebuilding your existing architecture.

Four core modules

ModuleCapabilities
Data Warehouse PlanningPlan data warehouse layers, data domains, and data marts. Configure model design spaces so different departments share a common set of data standards and models
Data StandardDefine field standards, standard codes, units of measurement, and naming dictionaries. Automatically generate data quality rules from standard codes to simplify compliance checks
Dimensional ModelingReverse modeling addresses the cold-start problem for existing data warehouses. Import models from Excel files or build them with FML, an SQL-like domain-specific language. Visual dimensional modeling integrates with Data Studio to automatically generate ETL (extract, transform, load) code
Data MetricsDefine atomic metrics and derived metrics. Batch-create derived metrics based on atomic metrics and various dimensions. Integrates with dimensional modeling

Architecture

Data modeling architecture showing relationships between Data Warehouse Planning, Data Standard, Dimensional Modeling, and Data Metrics

Use cases

Use caseDescription
Structured data managementOrganize and store large-scale enterprise data in a structured, consistent manner
Cross-department data integrationBreak data silos between departments to give decision-makers a complete view of business data
Unified data standardsEstablish consistent data definitions across systems without changing existing architectures, enabling upstream and downstream data interconnection
Data value realizationDeliver more effective data services using various types of enterprise data

Data Analysis

Data Analysis gives data analysts, product managers, and operations staff the tools to query and visualize data directly — no data engineering skills required.

Capabilities

  • Upload personal datasets and access public datasets

  • Search and bookmark tables

  • Run online SQL queries

  • Share SQL files and download query results

  • Visualize data on large screens using spreadsheets

Use cases

Use caseDescription
Scalable analysisUse compute engine resources to analyze full-scale datasets
Cross-system data flowAnalyze data from databases across different business systems. Export data to MaxCompute tables or share result sets with specified users and grant them permissions
Secure operationsIntegrate SQL queries and result downloads with security auditing

Data Quality

Data Quality monitors data at the table and field levels using over 30 preset monitoring templates and custom templates. It detects source data changes and dirty data during ETL processing, then automatically blocks problematic tasks — preventing dirty data from propagating downstream.

Monitoring and verification

Data Quality monitors datasets across various engines, including MaxCompute. When offline data changes, Data Quality verifies the data and blocks the production pipeline to prevent data pollution. Historical verification results are stored for quality analysis and classification. For more information, see Data Quality.

Data Quality addresses the following issues:

Issue typeDescription
Database changesFrequent schema or structural changes in source databases
Business changesEvolving business logic that causes data inconsistencies
Data definition issuesMismatched or undefined field standards
Dirty data from business systemsInvalid or malformed data originating in upstream systems
System interaction issuesQuality degradation caused by cross-system dependencies
Data correction issuesErrors introduced during manual data corrections
Data warehouse quality issuesQuality problems originating within the warehouse itself

Data Map

Data Map is built on data search capabilities. It provides table usage instructions, data categories, data lineage, and field lineage — giving data consumers and data owners a shared space to manage assets and collaborate on development.

Data Map interface showing data search, categorization, and lineage features

DataService Studio

DataService Studio is a flexible, lightweight, secure, and stable platform for building and publishing data APIs. It provides publication approval, access control, usage metering, and resource isolation — with zero O&M overhead.

Unified API service bus

DataService Studio acts as a unified service bus between the data warehouse and applications, closing the gap between the data warehouse, databases, and data applications.

DataService Studio architecture showing the bridge between data warehouse and applications
FeatureDescription
No-code and SQL-mode API generationGenerate data APIs from tables in various data sources using no-code or self-service SQL mode. Use Function Compute to process API request parameters and returned results
Single-click publishingPublish API services to an API gateway with a single click

Serverless architecture

DataService Studio uses a serverless architecture, so you focus on API query logic instead of managing infrastructure. Computing resources are provisioned automatically with elastic scaling, resulting in zero O&M costs.

DataService Studio serverless architecture with elastic scaling

Open Platform

Open Platform exposes DataWorks data and capabilities to external systems through OpenAPI, OpenEvent, and Extensions. Use it to manage data workflows, govern data, and respond to business status changes from your own applications.

Integration capabilities

CapabilityDescription
OpenAPIIntegrate your applications with DataWorks. Batch-create, publish, and manage tasks to improve processing efficiency and reduce manual operations. For more information, see OpenAPI
OpenEventSubscribe to system events for real-time notifications. For example, subscribe to table change events to monitor core tables, or subscribe to task change events to build a real-time task monitoring dashboard. For more information, see OpenEvent
ExtensionsService-level plug-ins that combine OpenAPI and OpenEvent to customize workflow controls in DataWorks. For example, create a deployment control plug-in to block tasks that do not comply with your standards. For more information, see Extensions

Use cases

Open Platform supports deep system integration, automated operations, workflow definition, and business monitoring. Build industry-specific and scenario-based data applications and plug-ins on the DataWorks Open Platform.

Migration Assistant

Migration Assistant migrates jobs from open-source scheduling engines to DataWorks and supports cross-cloud, cross-region, and cross-account job migration — so you can clone and deploy DataWorks jobs without rebuilding from scratch. The DataWorks team, in collaboration with big data expert service teams, also offers cloud migration services to help you move your data and tasks to the cloud.

Migration capabilities

CapabilityDescription
Task migration to the cloudMigrate jobs from open-source scheduling engines to DataWorks
DataWorks migrationMigrate development assets within the DataWorks ecosystem

Use cases

Use caseDescription
Task migration to the cloudMigrate jobs from open-source scheduling engines to DataWorks
Task backupRegularly back up task code to minimize losses from accidental project deletion
Business replicationAbstract common business logic and use the export/import feature to replicate it across projects
Test environment setupReplicate business code and switch the data input from production to test data
Cross-cloud developmentImport and export between DataWorks on the public cloud and DataWorks in a private cloud for collaborative development