DataWorks lets you run Hive, MapReduce (MR), Presto, and Spark SQL tasks on E-MapReduce (EMR) clusters — including scheduling workflows and managing metadata. This topic describes how to register an EMR cluster with DataWorks, whether the cluster belongs to your current Alibaba Cloud account or a different one.
Supported cluster types
Limitations
-
Permissions: Only the following identities can register an EMR cluster. For more information, see Grant permissions to a RAM user.
-
An Alibaba Cloud account
-
A RAM user or RAM role with the DataWorks Workspace Administrator role and the AliyunEMRFullAccess policy
-
A RAM user or RAM role with the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies
-
-
Regions: EMR Serverless Spark is available only in China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).
-
Task types: DataWorks does not support EMR Flink tasks.
-
Task execution: Use serverless resource groups (recommended) or exclusive resource groups for scheduling (old version) to run EMR tasks.
-
Data governance:
-
Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes support data lineage generation. For clusters running version 5.9.1, 3.43.1, or later, all these node types support table-level and field-level lineage.
NoteFor Spark nodes, table-level and field-level lineage requires cluster version 5.8.0, 3.42.0, or later. For earlier versions, only Spark 2.x supports table-level lineage.
-
To manage metadata for a DataLake or custom cluster in DataWorks, configure EMR-HOOK on the cluster. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, and data lineage is unavailable — making EMR-related governance tasks impossible. EMR-HOOK is currently supported only for EMR Hive and EMR Spark SQL services. For more information, see Configure EMR-HOOK for Hive and Configure E-HOOK for Spark SQL.
-
-
Kerberos authentication: For EMR clusters with Kerberos authentication enabled, add an inbound rule to the security group to allow UDP access from the vSwitch CIDR block associated with the resource group.
NoteOn the Basic Information tab of the EMR cluster, click the icon for Cluster Security Group to open the Security Group Details tab. Click Inbound in the Rule section, then select Add Rule. Set Protocol Type to Custom UDP. For Port Range, check
/etc/krb5.confin the EMR cluster for the KDC port. Set Destination to the vSwitch CIDR block associated with the resource group.
Usage notes
-
Standard mode workspaces require two clusters: To isolate development and production environments, register two separate EMR clusters. Store their metadata using one of the following methods:
-
Method 1 (recommended for data lake solutions): Store metadata in two separate data catalogs in Data Lake Formation (DLF). For more information, see Switch the metastore type.
-
Method 2: Store metadata in two separate databases in Relational Database Service (RDS). For more information, see Configure a self-managed RDS database.
-
-
Cross-workspace registration: One EMR cluster can be registered to multiple workspaces within the same Alibaba Cloud account, but not to workspaces across different Alibaba Cloud accounts.
-
Network connectivity: If the DataWorks resource group cannot reach the EMR cluster — even when they share the same virtual private cloud (VPC) and vSwitch — check the cluster's security group rules. Add an inbound rule for the vSwitch CIDR block covering the ports of common open source components. For more information, see Manage EMR cluster security groups.
Prerequisites
Before you begin, ensure that you have:
-
An EMR cluster already created. See EMR cluster configuration recommendations for guidance on selecting component settings.
-
The required permissions: an Alibaba Cloud account, or a RAM user/role with the AliyunEMRFullAccess policy plus either the Workspace Administrator role or the AliyunDataWorksFullAccess policy
-
A compatible resource group: a serverless resource group (recommended) or an exclusive resource group for scheduling (old version)
-
(Standard mode workspaces only) Two separate EMR clusters — one for development, one for production
Register an EMR cluster
Step 1: Open the cluster registration page
-
Go to the SettingCenter page. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center.
-
In the left navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. Select E-MapReduce for Cluster Type To Register. The Register EMR Cluster page appears.
Step 2: Configure cluster information
On the Register EMR Cluster page, configure the cluster parameters.
For standard mode workspaces, configure cluster information separately for the development and production environments. For more information, see Differences between workspace modes.
Start by setting Display Name of Cluster — the name the cluster uses in DataWorks. The name must be unique.
Then, select an option for Alibaba Cloud Account To Which Cluster Belongs based on where your EMR cluster resides.
Current Alibaba Cloud account
Select this option if the EMR cluster and your DataWorks workspace belong to the same Alibaba Cloud account.
| Parameter | Description | Required |
|---|---|---|
| Cluster Type | The type of EMR cluster to register. For supported types, see Limitations. | Yes |
| Cluster | The EMR cluster to register. If you select EMR Serverless Spark, follow the on-screen instructions to select the E-MapReduce Workspace, default engine version, default resource queue, and other settings. | Yes |
| Default Access Identity | The identity used to access the EMR cluster in this workspace. Development environment: use the cluster account hadoop or the account mapped to the task executor. Production environment: use hadoop, or the account mapped to the task owner, Alibaba Cloud account, or RAM user. If the mapped account is not configured, DataWorks falls back as follows: RAM user runs the task — uses an EMR cluster system account with the same name; fails if LDAP or Kerberos is enabled. Alibaba Cloud account runs the task — the task reports an error. For more information, see Configure cluster identity mappings. |
Yes |
| Pass Proxy User Information | Whether to pass proxy user information when running tasks. Pass: permissions are verified based on the proxy user. In DataStudio and DataAnalysis, the task executor's account name is passed dynamically. In Operation Center, the default access identity's account name is passed. Do Not Pass: permissions are based on the authentication method configured during registration. For EMR Kyuubi tasks, proxy user information is passed using hive.server2.proxy.user. For EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks, it is passed using -proxy-user. |
Yes |
| Configuration Files | Required if Cluster Type is HADOOP. Export configuration files from the EMR console (see Export and import service configurations), or retrieve them directly by logging on to the EMR cluster from these paths: /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml, /etc/ecm/hadoop-conf/mapred-site.xml, /etc/ecm/hadoop-conf/yarn-site.xml, /etc/ecm/hive-conf/hive-site.xml, /etc/ecm/spark-conf/spark-defaults.conf, /etc/ecm/spark-conf/spark-env.sh. After exporting, rename the files as required by the upload UI. |
Conditional |
Another Alibaba Cloud account
Select this option if the EMR cluster belongs to a different Alibaba Cloud account. Cross-account registration supports only EMR on ECS: DataLake, EMR on ECS: Hadoop, and EMR on ECS: Custom cluster types. EMR Serverless Spark cannot be registered across accounts.
| Parameter | Description | Required |
|---|---|---|
| UID of Alibaba Cloud Account | The UID of the Alibaba Cloud account that owns the EMR cluster. | Yes |
| RAM Role | The RAM role used to access the EMR cluster. The role must be created in the other account and granted permissions to access the DataWorks service in your current account. For setup details, see Scenario: Register a cross-account EMR cluster. | Yes |
| EMR Cluster Type | The type of EMR cluster to register. Currently, only EMR on ECS: DataLake cluster, EMR on ECS: Hadoop cluster, and EMR on ECS: Custom cluster are supported for cross-account registration. |
Yes |
| EMR Cluster | The EMR cluster from the other account to register to DataWorks. | Yes |
| Configuration Files | Configure the files as prompted on the UI. For details on obtaining configuration files, see Export and import service configurations. Alternatively, log on to the EMR cluster and retrieve the files from: /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml, /etc/ecm/hadoop-conf/mapred-site.xml, /etc/ecm/hadoop-conf/yarn-site.xml, /etc/ecm/hive-conf/hive-site.xml, /etc/ecm/spark-conf/spark-defaults.conf, /etc/ecm/spark-conf/spark-env.sh. After exporting, rename the files as required by the upload UI. |
Yes |
| Default Access Identity | The identity used to access the EMR cluster in this workspace. Development environment: use hadoop or the account mapped to the task owner. Production environment: use hadoop, or the account mapped to the task owner, Alibaba Cloud account, or RAM user. Fallback behavior when no mapping is configured: RAM user runs the task — uses an EMR cluster system account with the same name; fails if LDAP or Kerberos is enabled. Alibaba Cloud account runs the task — the task reports an error. |
Yes |
| Pass Proxy User Information | Whether to pass proxy user information when running tasks. Pass: permissions are verified based on the proxy user. In DataStudio and DataAnalysis, the task executor's account name is passed dynamically. In Operation Center, the default access identity's account name is passed. Do Not Pass: permissions are based on the authentication method configured during registration. For EMR Kyuubi tasks, proxy user information is passed using hive.server2.proxy.user. For EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks, it is passed using -proxy-user. |
Yes |
Step 3: Initialize a resource group
Initialize the resource group when you first register a cluster, change cluster service configurations (such as modifying core-site.xml), or upgrade a component version. This step ensures the resource group can connect to EMR and run tasks in the current environment.
-
On the Cluster Management page, find the registered EMR cluster tab and click Initialize Resource Group in the upper-right corner.
-
Locate the target resource group and click Initialize.
Both serverless resource groups and exclusive resource groups for scheduling (old version) can be initialized.
-
Wait 1 to 2 minutes for initialization to complete, then click OK.
-
If initialization fails, use the connectivity diagnosis tool to troubleshoot the cause.
-
Initialization may cause running tasks to fail. Unless immediate reinitialization is necessary (for example, to prevent widespread task failures after a configuration change), initialize the resource group during off-peak hours.
What's next
After registering your EMR cluster, complete the following setup steps:
-
Data development: Follow the Data development process guide to configure component environments.
-
Configure cluster identity mappings: If the default access identity is not the
hadoopaccount, configure identity mappings to control which resources RAM users can access in DataWorks. -
Set global YARN resource queues: Specify the YARN queues used by each module, and control whether workspace-level settings override module-level configurations.
-
Set global Spark parameters: Customize global Spark parameters based on the official Spark documentation, and control whether workspace-level parameters override module-level settings for matching parameter names.
-
Set Kyuubi connection information: If you want to use a custom account and password to log on to Kyuubi and run tasks, configure the Kyuubi connection details.