DataWorks lets you build and schedule big data pipelines on E-MapReduce (EMR) clusters without writing command-line scripts. Create Hive, MapReduce (MR), Presto, Spark SQL, and other node types in DataStudio, enable periodic scheduling, and monitor jobs in Operation Center — all from a single console.
How it works
Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks to meet the business requirements of different users.
Development on DataWorks follows three stages:
Set up — Purchase the right DataWorks edition and resource group, create an EMR cluster, and register it with your workspace.
Develop — Build and schedule EMR nodes in DataStudio. Use Data Integration to move data in and out of EMR Hive.
Operate — Monitor scheduled tasks in Operation Center, govern metadata in Data Map, and enforce data quality with Data Quality rules.
Supported cluster types
Limitations
Permissions
Only the following identities can register an EMR cluster with DataWorks:
An Alibaba Cloud account (root account)
A RAM user or RAM role with the DataWorks Workspace Administrator role and the
AliyunEMRFullAccesspolicyA RAM user or RAM role with both
AliyunDataWorksFullAccessandAliyunEMRFullAccesspolicies
For details, see Grant permissions to a RAM user.
Region availability
EMR Serverless Spark is available in: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).
Task type restrictions
DataWorks does not support EMR Flink tasks.
Run EMR tasks using serverless resource groups (recommended) or exclusive resource groups for scheduling (legacy).
Data lineage
Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes generate data lineage:
| Cluster version | Lineage support |
|---|---|
| 5.9.1, 3.43.1, or later (Hive and Spark SQL) | Table-level and field-level lineage |
| 5.8.0, 3.42.0, or later (Spark-type nodes) | Table-level and field-level lineage |
| Earlier than 5.8.0 or 3.42.0 (Spark-type nodes) | Table-level lineage (Spark 2.x only) |
To use metadata management for DataLake or custom clusters, configure EMR-HOOK on the cluster first. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, and data lineage is unavailable in DataWorks. EMR-HOOK supports EMR Hive and EMR Spark SQL services only. See Configure EMR-HOOK for Hive and Configure EMR-HOOK for Spark SQL.
Kerberos clusters
For EMR clusters with Kerberos authentication enabled, add an inbound rule to the security group to allow UDP access from the vSwitch CIDR block associated with your resource group.
To find the security group and add the rule:
On the Basic Information tab of the EMR cluster, click the
In the Rule section, click Inbound, then select Add Rule.
Set Protocol Type to Custom UDP.
For Port Range, check
/etc/krb5.confin the EMR cluster for the KDC port.Set Destination to the vSwitch CIDR block associated with your resource group.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud account with a RAM user or role that has the required permissions to register an EMR cluster (see Permissions above)
DataWorks activated and a workspace created — see Purchase and Manage workspaces
An EMR cluster created — see Create a cluster. Refer to the EMR cluster configuration suggestions section to select the right services for your cluster.
A DataWorks serverless resource group purchased — see Use serverless resource groups. Serverless resource groups support both data synchronization and task scheduling, making them suitable for most use cases. New users can purchase only serverless resource groups.
If you have a legacy exclusive resource group, you can still use it to run EMR tasks. Use an exclusive resource group for Data Integration for data synchronization tasks, and an exclusive resource group for scheduling for scheduling tasks. See Use legacy resource groups.
Billing
Charges come from two sources when you run EMR tasks in DataWorks.
Understand the cost structure before creating resources. EMR cluster charges vary by instance type, region, and runtime. To avoid unexpected charges, delete clusters and resource groups that you no longer need.
DataWorks charges
| Item | Description |
|---|---|
| DataWorks edition | Activating Standard, Professional, or Enterprise Edition incurs edition fees. |
| Scheduling resource group | Purchasing a serverless or exclusive resource group for scheduling incurs resource group fees. |
| Data synchronization resource group | Data synchronization tasks consume scheduling resources. A serverless resource group covers both task scheduling and data synchronization. |
For a full breakdown of DataWorks billable items, see Billing overview.
Other Alibaba Cloud service charges
The following charges appear on separate bills from the respective Alibaba Cloud services, not on your DataWorks bill.
| Item | When charges occur |
|---|---|
| Database fees | When data synchronization tasks read from or write to databases. |
| Computing and storage fees | When tasks run on an EMR cluster. For EMR billing, see Billing. |
| Network service fees | When network connections are established between DataWorks and other services (for example, Express Connect or Elastic IP Address). |
Environment preparation
Resource preparation
| Item | Description | References |
|---|---|---|
| DataWorks edition | Basic Edition covers core operations: data migration, task development, EMR scheduling, and data governance. Upgrade to Standard, Professional, or Enterprise Edition for advanced governance and security features. | Features by edition |
| Resource group | Use serverless resource groups (recommended) or legacy exclusive resource groups to run EMR tasks. | Use serverless resource groups / Use legacy resource groups |
Development environment preparation
Register an EMR cluster with your DataWorks workspace and add team members before starting collaborative development.
| Item | Description | References |
|---|---|---|
| Data synchronization environment | Before running data synchronization tasks, add the EMR service as a data source in DataWorks. | Supported data sources |
| Data development and analysis environment | Register the EMR cluster as a data source in DataWorks to enable data development, analysis, and periodic scheduling. | Register an EMR cluster |
| Collaborative development environment | To enable team members to develop data together in a workspace: (1) Add RAM users to the workspace and assign them the Development role. (2) Add workspace members to the EMR cluster. | Workspace-level permissions / Manage OpenLDAP users |
Permission management
DataWorks provides permission management at two levels: data access and service features.
Data access permissions
Configure mappings between RAM users and EMR cluster accounts so that workspace members have the correct cluster-level permissions. See Configure cluster identity mapping.
If Data Lake Formation (DLF) is your metadata storage service, manage data access permissions directly in DataWorks Security Center — including permission requests, approvals, and audits. See DLF data access control.
Service and feature permissions
Assign workspace-level roles to RAM users before they start developing data. For a guided setup, see Best practices: Grant permissions to RAM users.
To control access to DataWorks global modules (for example, restricting access to Data Map or allowing workspace deletion), see Global module permission control.
To control access to workspace-level modules (for example, granting DataStudio access or restricting Data Security Guard), see Workspace-level module permission control.
Getting started
Data integration
Data Integration lets you read data from and write data to EMR Hive. Add the Hive service as a data source first, then set up synchronization pipelines. Supported scenarios include batch, full, and incremental synchronization. See Data Integration overview.
Data modeling and development
Data Modeling
Data Modeling structures and manages large volumes of unordered data using Alibaba's data mid-end methodology. It covers data warehouse planning, data standards, dimensional modeling, and data metrics, helping teams share a common understanding of business data. See Data Modeling overview.
DataStudio
DataStudio is the development environment where you create, schedule, and debug EMR nodes without using command-line tools.
Available EMR node types:
| Node type | Create guide |
|---|---|
| EMR Hive node | Create an EMR Hive node |
| EMR MR node | Create an EMR MR node |
| EMR Spark SQL node | Create an EMR Spark SQL node |
| EMR Spark node | Create an EMR Spark node |
| EMR Shell node | Create an EMR Shell node |
| EMR Presto node | Create an EMR Presto node |
| EMR Spark Streaming node | Create an EMR Spark Streaming node |
| EMR Kyuubi node | Create an EMR Kyuubi node |
| EMR Trino node | Create an EMR Trino node |
| EMR table | Create an EMR table |
| EMR resource | Create an EMR resource |
| EMR function | Create an EMR function |
General-purpose node types are also available for complex workflow logic:
| Category | Node types |
|---|---|
| Workflow control | Zero load node |
| Event-driven triggers | HTTP Trigger node, OSS object inspection node, FTP Check node |
| Parameter passing | Assignment node, Parameter node |
| Loop and branching logic | Do-while node, For-each node, Branch node |
| Other | Common Shell node, MySQL database node |
See the reference guides: Zero load node, HTTP Trigger node, OSS object inspection node, FTP Check node, Assignment node, Parameter node, Do-while node logic, For-each node logic, and Configure a branch node.
After developing nodes, the typical next steps are:
Configure scheduling properties — Set dependencies and scheduling parameters to enable periodic execution. See Scheduling properties overview.
Debug nodes — Run and validate tasks before deploying to avoid wasting compute resources. See Task debugging process.
Deploy nodes — Deploy tasks to the production environment so they appear in Operation Center. See Publish tasks.
Manage nodes — Perform bulk operations such as deploying, undeploying, and updating scheduling properties across multiple nodes. See Batch operations.
Apply process controls — Enforce code review, smoke testing, and custom review logic to ensure development accuracy. See Development process control.
Operation Center
Operation Center is the production O&M platform where you monitor scheduled tasks and resolve issues. Key capabilities include intelligent diagnostics, task reruns, and baseline monitoring to ensure critical tasks complete on time. See Perform basic O&M operations on scheduled tasks.
Data Quality
Data Quality monitors your data throughout the development lifecycle by attaching quality rules to scheduling tasks. This surfaces data issues early and prevents them from affecting downstream systems. See Data Quality overview.
Data governance
After you register an EMR cluster, DataWorks automatically collects metadata from the cluster. The following modules help you manage and govern that data:
| Module | Description | Reference |
|---|---|---|
| Data Map | Enterprise metadata management platform for discovering, searching, and understanding data assets. | Data Map overview |
| Security Center / Data Security Guard / Approval Center | End-to-end data security platform covering data classification, sensitive data identification, authorization management, data masking, access auditing, and risk response. | Security Center overview / Data Security Guard overview / Approval Center overview |
| Data Governance Center | Automatically identifies governance issues across multiple dimensions based on best-practice rules, and provides both preventive and corrective governance actions. | Data Governance Center overview |
For an overview of detected issues and available actions, see Data Governance Center overview.
Data analysis and services
| Module | Description | Reference |
|---|---|---|
| DataAnalysis | Run SQL queries online, build chart cards from query results, and generate data reports for daily reporting. | DataAnalysis overview |
| DataService Studio | Manage and expose API services for internal and external systems in a centralized service hub. | DataService Studio overview |
Open Platform
Integrate your application systems with DataWorks programmatically using the following capabilities:
| Module | Description | Reference |
|---|---|---|
| OpenAPI | Call DataWorks API operations to automate big data workflows, reduce manual operations, and minimize data risks. | OpenAPI |
| OpenEvent | Subscribe to DataWorks change events to detect and respond to workspace changes in real time. | OpenEvent overview |
| Extensions | Register local programs as extensions to manage extension point events and customize DataWorks processes. | Extensions overview |
EMR cluster configuration suggestions
Tune the following EMR services before running production workloads in DataWorks.
Kyuubi
In the production environment, set:
kyuubi_java_opts: 10g or largerkyuubi_beeline_opts: 2g or larger
Spark
The default memory allocation for Spark is small. Adjust the following parameters based on your cluster scale:
spark.driver.memoryspark.driver.memoryOverheadspark.executor.memory
Pass memory settings via the spark-submit CLI or configure them in the cluster. For full configuration options, see Spark memory management.
Only EMR Hive, EMR Spark, and EMR Spark SQL nodes generate data lineage. EMR Hive nodes support both table-level and column-level lineage. Spark-based EMR nodes support table-level lineage only.
HDFS
Adjust the following parameters based on your cluster scale:
hadoop_namenode_heapsizehadoop_datanode_heapsizehadoop_secondary_namenode_heapsizehadoop_namenode_opts