All Products
Search
Document Center

DataWorks:EMR development guidelines

Last Updated:Mar 26, 2026

DataWorks lets you build and schedule big data pipelines on E-MapReduce (EMR) clusters without writing command-line scripts. Create Hive, MapReduce (MR), Presto, Spark SQL, and other node types in DataStudio, enable periodic scheduling, and monitor jobs in Operation Center — all from a single console.

How it works

Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks to meet the business requirements of different users.

Development on DataWorks follows three stages:

  1. Set up — Purchase the right DataWorks edition and resource group, create an EMR cluster, and register it with your workspace.

  2. Develop — Build and schedule EMR nodes in DataStudio. Use Data Integration to move data in and out of EMR Hive.

  3. Operate — Monitor scheduled tasks in Operation Center, govern metadata in Data Map, and enforce data quality with Data Quality rules.

Supported cluster types

Register one of the following EMR cluster types with DataWorks before running tasks:

Cluster typeTypical scenarioLearn more
DataLake cluster (EMR on ECS)New-generation data lake for Hive, Spark, and Presto workloads. Recommended for new projects.What is EMR on ECS
Custom cluster (EMR on ECS)Fully configurable cluster for advanced users with specific component requirements.What is EMR on ECS
Hadoop cluster (EMR on ECS)Legacy data lake. Migrate to DataLake clusters.What is EMR on ECS
Spark cluster (EMR on ACK)Kubernetes-based cluster for Spark workloads on container infrastructure.What is EMR on ACK
EMR Serverless SparkFully managed, serverless Spark with no cluster management overhead.What is EMR Serverless Spark
Important

Hadoop clusters (old data lake) are no longer recommended. Migrate to DataLake clusters. For migration steps, see Migrate a Hadoop cluster to a DataLake cluster.

Supported Hadoop cluster versions in DataWorks: EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-3.38.2, EMR-3.38.3, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-4.9.0, EMR-5.2.1, EMR-5.4.3, EMR-5.6.0.

If your cluster type cannot be registered in DataWorks, submit a ticket for technical support.

Limitations

Permissions

Only the following identities can register an EMR cluster with DataWorks:

  • An Alibaba Cloud account (root account)

  • A RAM user or RAM role with the DataWorks Workspace Administrator role and the AliyunEMRFullAccess policy

  • A RAM user or RAM role with both AliyunDataWorksFullAccess and AliyunEMRFullAccess policies

For details, see Grant permissions to a RAM user.

Region availability

EMR Serverless Spark is available in: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).

Task type restrictions

  • DataWorks does not support EMR Flink tasks.

  • Run EMR tasks using serverless resource groups (recommended) or exclusive resource groups for scheduling (legacy).

Data lineage

Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes generate data lineage:

Cluster versionLineage support
5.9.1, 3.43.1, or later (Hive and Spark SQL)Table-level and field-level lineage
5.8.0, 3.42.0, or later (Spark-type nodes)Table-level and field-level lineage
Earlier than 5.8.0 or 3.42.0 (Spark-type nodes)Table-level lineage (Spark 2.x only)

To use metadata management for DataLake or custom clusters, configure EMR-HOOK on the cluster first. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, and data lineage is unavailable in DataWorks. EMR-HOOK supports EMR Hive and EMR Spark SQL services only. See Configure EMR-HOOK for Hive and Configure EMR-HOOK for Spark SQL.

Kerberos clusters

For EMR clusters with Kerberos authentication enabled, add an inbound rule to the security group to allow UDP access from the vSwitch CIDR block associated with your resource group.

To find the security group and add the rule:

  1. On the Basic Information tab of the EMR cluster, click the

  2. In the Rule section, click Inbound, then select Add Rule.

  3. Set Protocol Type to Custom UDP.

  4. For Port Range, check /etc/krb5.conf in the EMR cluster for the KDC port.

  5. Set Destination to the vSwitch CIDR block associated with your resource group.

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud account with a RAM user or role that has the required permissions to register an EMR cluster (see Permissions above)

  • DataWorks activated and a workspace created — see Purchase and Manage workspaces

  • An EMR cluster created — see Create a cluster. Refer to the EMR cluster configuration suggestions section to select the right services for your cluster.

  • A DataWorks serverless resource group purchased — see Use serverless resource groups. Serverless resource groups support both data synchronization and task scheduling, making them suitable for most use cases. New users can purchase only serverless resource groups.

If you have a legacy exclusive resource group, you can still use it to run EMR tasks. Use an exclusive resource group for Data Integration for data synchronization tasks, and an exclusive resource group for scheduling for scheduling tasks. See Use legacy resource groups.

Billing

Charges come from two sources when you run EMR tasks in DataWorks.

Understand the cost structure before creating resources. EMR cluster charges vary by instance type, region, and runtime. To avoid unexpected charges, delete clusters and resource groups that you no longer need.

DataWorks charges

ItemDescription
DataWorks editionActivating Standard, Professional, or Enterprise Edition incurs edition fees.
Scheduling resource groupPurchasing a serverless or exclusive resource group for scheduling incurs resource group fees.
Data synchronization resource groupData synchronization tasks consume scheduling resources. A serverless resource group covers both task scheduling and data synchronization.

For a full breakdown of DataWorks billable items, see Billing overview.

Other Alibaba Cloud service charges

Important

The following charges appear on separate bills from the respective Alibaba Cloud services, not on your DataWorks bill.

ItemWhen charges occur
Database feesWhen data synchronization tasks read from or write to databases.
Computing and storage feesWhen tasks run on an EMR cluster. For EMR billing, see Billing.
Network service feesWhen network connections are established between DataWorks and other services (for example, Express Connect or Elastic IP Address).

Environment preparation

Resource preparation

ItemDescriptionReferences
DataWorks editionBasic Edition covers core operations: data migration, task development, EMR scheduling, and data governance. Upgrade to Standard, Professional, or Enterprise Edition for advanced governance and security features.Features by edition
Resource groupUse serverless resource groups (recommended) or legacy exclusive resource groups to run EMR tasks.Use serverless resource groups / Use legacy resource groups

Development environment preparation

Register an EMR cluster with your DataWorks workspace and add team members before starting collaborative development.

ItemDescriptionReferences
Data synchronization environmentBefore running data synchronization tasks, add the EMR service as a data source in DataWorks.Supported data sources
Data development and analysis environmentRegister the EMR cluster as a data source in DataWorks to enable data development, analysis, and periodic scheduling.Register an EMR cluster
Collaborative development environmentTo enable team members to develop data together in a workspace: (1) Add RAM users to the workspace and assign them the Development role. (2) Add workspace members to the EMR cluster.Workspace-level permissions / Manage OpenLDAP users

Permission management

DataWorks provides permission management at two levels: data access and service features.

Data access permissions

Configure mappings between RAM users and EMR cluster accounts so that workspace members have the correct cluster-level permissions. See Configure cluster identity mapping.

If Data Lake Formation (DLF) is your metadata storage service, manage data access permissions directly in DataWorks Security Center — including permission requests, approvals, and audits. See DLF data access control.

Service and feature permissions

Assign workspace-level roles to RAM users before they start developing data. For a guided setup, see Best practices: Grant permissions to RAM users.

Getting started

Data integration

Data Integration lets you read data from and write data to EMR Hive. Add the Hive service as a data source first, then set up synchronization pipelines. Supported scenarios include batch, full, and incremental synchronization. See Data Integration overview.

Data modeling and development

Data Modeling

Data Modeling structures and manages large volumes of unordered data using Alibaba's data mid-end methodology. It covers data warehouse planning, data standards, dimensional modeling, and data metrics, helping teams share a common understanding of business data. See Data Modeling overview.

DataStudio

DataStudio is the development environment where you create, schedule, and debug EMR nodes without using command-line tools.

Available EMR node types:

Node typeCreate guide
EMR Hive nodeCreate an EMR Hive node
EMR MR nodeCreate an EMR MR node
EMR Spark SQL nodeCreate an EMR Spark SQL node
EMR Spark nodeCreate an EMR Spark node
EMR Shell nodeCreate an EMR Shell node
EMR Presto nodeCreate an EMR Presto node
EMR Spark Streaming nodeCreate an EMR Spark Streaming node
EMR Kyuubi nodeCreate an EMR Kyuubi node
EMR Trino nodeCreate an EMR Trino node
EMR tableCreate an EMR table
EMR resourceCreate an EMR resource
EMR functionCreate an EMR function

General-purpose node types are also available for complex workflow logic:

CategoryNode types
Workflow controlZero load node
Event-driven triggersHTTP Trigger node, OSS object inspection node, FTP Check node
Parameter passingAssignment node, Parameter node
Loop and branching logicDo-while node, For-each node, Branch node
OtherCommon Shell node, MySQL database node

See the reference guides: Zero load node, HTTP Trigger node, OSS object inspection node, FTP Check node, Assignment node, Parameter node, Do-while node logic, For-each node logic, and Configure a branch node.

After developing nodes, the typical next steps are:

  • Configure scheduling properties — Set dependencies and scheduling parameters to enable periodic execution. See Scheduling properties overview.

  • Debug nodes — Run and validate tasks before deploying to avoid wasting compute resources. See Task debugging process.

  • Deploy nodes — Deploy tasks to the production environment so they appear in Operation Center. See Publish tasks.

  • Manage nodes — Perform bulk operations such as deploying, undeploying, and updating scheduling properties across multiple nodes. See Batch operations.

  • Apply process controls — Enforce code review, smoke testing, and custom review logic to ensure development accuracy. See Development process control.

Operation Center

Operation Center is the production O&M platform where you monitor scheduled tasks and resolve issues. Key capabilities include intelligent diagnostics, task reruns, and baseline monitoring to ensure critical tasks complete on time. See Perform basic O&M operations on scheduled tasks.

Data Quality

Data Quality monitors your data throughout the development lifecycle by attaching quality rules to scheduling tasks. This surfaces data issues early and prevents them from affecting downstream systems. See Data Quality overview.

Data governance

After you register an EMR cluster, DataWorks automatically collects metadata from the cluster. The following modules help you manage and govern that data:

ModuleDescriptionReference
Data MapEnterprise metadata management platform for discovering, searching, and understanding data assets.Data Map overview
Security Center / Data Security Guard / Approval CenterEnd-to-end data security platform covering data classification, sensitive data identification, authorization management, data masking, access auditing, and risk response.Security Center overview / Data Security Guard overview / Approval Center overview
Data Governance CenterAutomatically identifies governance issues across multiple dimensions based on best-practice rules, and provides both preventive and corrective governance actions.Data Governance Center overview

For an overview of detected issues and available actions, see Data Governance Center overview.

Data analysis and services

ModuleDescriptionReference
DataAnalysisRun SQL queries online, build chart cards from query results, and generate data reports for daily reporting.DataAnalysis overview
DataService StudioManage and expose API services for internal and external systems in a centralized service hub.DataService Studio overview

Open Platform

Integrate your application systems with DataWorks programmatically using the following capabilities:

ModuleDescriptionReference
OpenAPICall DataWorks API operations to automate big data workflows, reduce manual operations, and minimize data risks.OpenAPI
OpenEventSubscribe to DataWorks change events to detect and respond to workspace changes in real time.OpenEvent overview
ExtensionsRegister local programs as extensions to manage extension point events and customize DataWorks processes.Extensions overview

EMR cluster configuration suggestions

Tune the following EMR services before running production workloads in DataWorks.

Kyuubi

In the production environment, set:

  • kyuubi_java_opts: 10g or larger

  • kyuubi_beeline_opts: 2g or larger

Spark

The default memory allocation for Spark is small. Adjust the following parameters based on your cluster scale:

  • spark.driver.memory

  • spark.driver.memoryOverhead

  • spark.executor.memory

Pass memory settings via the spark-submit CLI or configure them in the cluster. For full configuration options, see Spark memory management.

Important

Only EMR Hive, EMR Spark, and EMR Spark SQL nodes generate data lineage. EMR Hive nodes support both table-level and column-level lineage. Spark-based EMR nodes support table-level lineage only.

HDFS

Adjust the following parameters based on your cluster scale:

  • hadoop_namenode_heapsize

  • hadoop_datanode_heapsize

  • hadoop_secondary_namenode_heapsize

  • hadoop_namenode_opts