DataWorks is a one-stop platform for big data development and governance. It integrates with compute engines to run data processing tasks, and connects to data sources to move data in and out of those engines. This topic lists the compute engines and data sources that DataWorks supports.
Compute engine ecosystem
DataWorks does not execute computing tasks directly. Instead, it uses an engine binding mechanism: bind computing resources to register an engine with the platform, then create, orchestrate, and manage data processing tasks from a unified interface.
The following compute engines are supported:
| Engine | Typical use case |
|---|---|
| MaxCompute | Large-scale offline batch processing |
| Hologres | Real-time interactive queries on large datasets |
| Flink | Real-time stream processing |
| EMR on ECS | Open-source big data workloads (Hadoop, Spark, Hive) on ECS |
| EMR on ACK | Container-based open-source big data workloads |
| EMR Serverless StarRocks | Serverless real-time analytics with StarRocks |
| EMR Serverless Spark | Serverless Spark jobs without cluster management |
| CDH | On-premises Cloudera Hadoop clusters |
| AnalyticDB for MySQL | Cloud-native data warehousing compatible with MySQL |
| AnalyticDB for PostgreSQL | Massively parallel processing (MPP) analytics |
| AnalyticDB for Spark | Spark workloads integrated with AnalyticDB |
| OpenSearch | Full-text search and intelligent search |
| ClickHouse | High-performance OLAP and real-time reporting |
| Lindorm | Multi-model storage for IoT and time series data |
Data source ecosystem
A data source is the unified entry point in DataWorks for connecting to external systems. Configure the connection information and network settings once in Management Center, then reuse the connection across Data Integration, Data Studio, Data Map, DataAnalysis, and DataService Studio — without repeating the configuration. In standard mode, you can also configure data source isolation to keep development and production environments physically separate.
The subsections below list the data sources supported by each DataWorks module.
Module support overview
Use this matrix to check which modules support a specific data source. See the relevant subsections below for links to setup guides.
| Data source | Data Integration | Data Studio | Data Map | DataAnalysis | DataService Studio |
|---|---|---|---|---|---|
| MaxCompute | ✓ | ✓ | ✓ | ✓ | |
| Hologres | ✓ | ✓ | ✓ | ||
| MySQL | ✓ | ✓ | ✓ | ✓ | |
| PostgreSQL | ✓ | ✓ | ✓ | ✓ | |
| Oracle | ✓ | ✓ | ✓ | ✓ | |
| SQL Server | ✓ | ✓ | ✓ | ✓ | |
| AnalyticDB for MySQL | ✓ | ✓ | ✓ | ✓ | |
| AnalyticDB for PostgreSQL | ✓ | ✓ | ✓ | ✓ | |
| StarRocks | ✓ | ✓ | ✓ | ✓ | ✓ |
| ClickHouse | ✓ | ✓ | ✓ | ||
| Doris | ✓ | ✓ | ✓ | ✓ | |
| PolarDB | ✓ | ✓ | ✓ | ||
| SelectDB | ✓ | ✓ | ✓ | ||
| OceanBase | ✓ | ✓ | ✓ | ||
| Tablestore | ✓ | ✓ | ✓ | ||
| Tablestore Stream | ✓ | ✓ | |||
| Lindorm | ✓ | ✓ | |||
| HBase | ✓ | ✓ | |||
| Kafka | ✓ | ||||
| Object Storage Service (OSS) | ✓ | ||||
| Simple Log Service (SLS) / LogHub | ✓ | ||||
| DataHub | ✓ | ||||
| HDFS | ✓ | ||||
| Amazon S3 | ✓ | ||||
| Azure Blob Storage | ✓ | ||||
| BigQuery | ✓ | ||||
| Amazon Redshift | ✓ | ✓ | |||
| Elasticsearch | ✓ | ||||
| MongoDB | ✓ | ✓ | |||
| Redis | ✓ | ||||
| Maxgraph | ✓ | ||||
| EMR (Hive, Spark SQL, Impala, Presto, Trino) | ✓ | ||||
| CDH (Hive, Spark SQL) | ✓ | ✓ | |||
| Data Lake Formation (DLF) | ✓ | ✓ | |||
| SAP HANA | ✓ | ✓ | ✓ | ||
| DB2 | ✓ | ✓ | ✓ | ||
| DM | ✓ | ✓ | ✓ | ||
| DRDS (PolarDB-X 1.0) | ✓ | ✓ | |||
| PolarDB-X 2.0 | ✓ | ||||
| MariaDB | ✓ | ✓ | |||
| KingbaseES | ✓ | ✓ | |||
| Vertica | ✓ | ✓ | |||
| GBase8a | ✓ | ✓ | |||
| Milvus | ✓ | ||||
| TiDB | ✓ | ||||
| FTP | ✓ | ||||
| HttpFile | ✓ | ||||
| RestAPI (HTTP) | ✓ | ||||
| Salesforce | ✓ | ||||
| Sensors Data | ✓ | ||||
| Memcache (OCS) | ✓ | ||||
| MetaQ | ✓ | ||||
| OSS-HDFS | ✓ | ||||
| TOS | ✓ | ||||
| TSDB | ✓ | ||||
| Graph Database (GDB) | ✓ | ||||
| AnalyticDB for Spark | ✓ | ||||
| E-MapReduce HIVE | ✓ |
The table above covers the data sources listed in this topic. For the full list of supported data sources and synchronization methods, see Supported data sources and synchronization solutions.
Data Integration
Data Integration is the primary module for moving data between systems. Configure a data source once in Management Center, then use it to set up sync tasks — choose single-table or full-database scope, and offline or real-time mode. Supported sync patterns include full migration, incremental capture (CDC), and automatic full-and-incremental synchronization.
For setup instructions, see Data source management and Supported data sources and synchronization solutions.
Cloud storage
Databases
Alibaba Cloud data stores
Big data and open-source systems
NoSQL, APIs, and SaaS
Data Studio
Data Studio supports hybrid orchestration and scheduling across compute engines and databases. In addition to engines such as MaxCompute, E-MapReduce (EMR), and AnalyticDB, you can connect databases directly as nodes in your development pipeline. Configure data source connections and scheduling policies once, then call them from the development and O&M modules.
For more information, see Database nodes.
|
MySQL data source |
PolarDB MySQL data source |
SAP HANA data source |
|
SQL Server data source |
PolarDB PostgreSQL data source |
Vertica data source |
|
Oracle data source |
Doris data source |
DM data source |
|
PostgreSQL data source |
MariaDB data source |
KingbaseES data source |
|
StarRocks data source |
SelectDB data source |
OceanBase data source |
|
DRDS data source |
Amazon Redshift data source |
DB2 data source |
|
GBase8a data source |
Data Map
Data Map uses pre-configured data source connections to collect metadata automatically. The built-in collector retrieves database table schemas, partition information, and cross-system data lineage. After collection, view table details and visualize the lineage graph in Data Map to trace the origin and flow of your data assets.
For more information, see Metadata acquisition.
|
AnalyticDB for PostgreSQL data source |
MySQL data source |
Hologres data source |
|
AnalyticDB for MySQL data source |
PostgreSQL data source |
Lindorm data source |
|
AnalyticDB for Spark data source |
SQL Server data source |
MaxCompute data source |
|
CDH Hive data source |
Oracle data source |
StarRocks data source |
|
Data Lake Formation (DLF) |
Tablestore (OTS) data source |
ClickHouse data source |
|
E-MapReduce HIVE data source |
DataAnalysis
DataAnalysis lets you query, analyze, transform, and visualize data interactively using the engines and data sources registered in DataWorks.
For more information, see SQL query and analysis.
|
MaxCompute data source |
Hologres data source |
EMR Hive data source |
|
EMR Spark SQL data source |
EMR Impala data source |
EMR Presto data source |
|
EMR Trino data source |
CDH Hive data source |
CDH Spark SQL data source |
|
StarRocks data source |
ClickHouse data source |
SelectDB data source |
|
Doris data source |
AnalyticDB for MySQL 3.0 data source |
AnalyticDB for PostgreSQL data source |
|
Tablestore (OTS) data source |
MySQL data source |
PostgreSQL data source |
|
Oracle data source |
SQL Server data source |
DataService Studio
DataService Studio generates APIs from data sources, exposing data as standard service endpoints for sharing across teams and applications.
For more information, see Generate an API.