All Products
Search
Document Center

DataWorks:Data Studio (legacy version): Associate an EMR computing resource

Last Updated:Mar 27, 2026

DataWorks lets you run Hive, MapReduce (MR), Presto, and Spark SQL tasks on E-MapReduce (EMR) clusters — including scheduling workflows and managing metadata. This topic describes how to register an EMR cluster with DataWorks, whether the cluster belongs to your current Alibaba Cloud account or a different one.

Supported cluster types

DataWorks supports registering the following cluster types:

Important

The following EMR Hadoop cluster (old data lake) versions are supported in DataWorks: EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, EMR-5.4.3 Hadoop clusters (old data lake) are no longer recommended. Migrate to DataLake clusters as soon as possible. For more information, see Migrate a Hadoop cluster to a DataLake cluster.

Note

If the type of cluster you are using cannot be registered in DataWorks, submit a ticket to contact technical support.

Limitations

  • Permissions: Only the following identities can register an EMR cluster. For more information, see Grant permissions to a RAM user.

    • An Alibaba Cloud account

    • A RAM user or RAM role with the DataWorks Workspace Administrator role and the AliyunEMRFullAccess policy

    • A RAM user or RAM role with the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies

  • Regions: EMR Serverless Spark is available only in China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), and US (Virginia).

  • Task types: DataWorks does not support EMR Flink tasks.

  • Task execution: Use serverless resource groups (recommended) or exclusive resource groups for scheduling (old version) to run EMR tasks.

  • Data governance:

    • Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes support data lineage generation. For clusters running version 5.9.1, 3.43.1, or later, all these node types support table-level and field-level lineage.

      Note

      For Spark nodes, table-level and field-level lineage requires cluster version 5.8.0, 3.42.0, or later. For earlier versions, only Spark 2.x supports table-level lineage.

    • To manage metadata for a DataLake or custom cluster in DataWorks, configure EMR-HOOK on the cluster. Without EMR-HOOK, metadata is not displayed in real time, audit logs are not generated, and data lineage is unavailable — making EMR-related governance tasks impossible. EMR-HOOK is currently supported only for EMR Hive and EMR Spark SQL services. For more information, see Configure EMR-HOOK for Hive and Configure E-HOOK for Spark SQL.

  • Kerberos authentication: For EMR clusters with Kerberos authentication enabled, add an inbound rule to the security group to allow UDP access from the vSwitch CIDR block associated with the resource group.

    Note

    On the Basic Information tab of the EMR cluster, click the icon for Cluster Security Group to open the Security Group Details tab. Click Inbound in the Rule section, then select Add Rule. Set Protocol Type to Custom UDP. For Port Range, check /etc/krb5.conf in the EMR cluster for the KDC port. Set Destination to the vSwitch CIDR block associated with the resource group.

Usage notes

  • Standard mode workspaces require two clusters: To isolate development and production environments, register two separate EMR clusters. Store their metadata using one of the following methods:

    • Method 1 (recommended for data lake solutions): Store metadata in two separate data catalogs in Data Lake Formation (DLF). For more information, see Switch the metastore type.

    • Method 2: Store metadata in two separate databases in Relational Database Service (RDS). For more information, see Configure a self-managed RDS database.

  • Cross-workspace registration: One EMR cluster can be registered to multiple workspaces within the same Alibaba Cloud account, but not to workspaces across different Alibaba Cloud accounts.

  • Network connectivity: If the DataWorks resource group cannot reach the EMR cluster — even when they share the same virtual private cloud (VPC) and vSwitch — check the cluster's security group rules. Add an inbound rule for the vSwitch CIDR block covering the ports of common open source components. For more information, see Manage EMR cluster security groups.

Prerequisites

Before you begin, ensure that you have:

  • An EMR cluster already created. See EMR cluster configuration recommendations for guidance on selecting component settings.

  • The required permissions: an Alibaba Cloud account, or a RAM user/role with the AliyunEMRFullAccess policy plus either the Workspace Administrator role or the AliyunDataWorksFullAccess policy

  • A compatible resource group: a serverless resource group (recommended) or an exclusive resource group for scheduling (old version)

  • (Standard mode workspaces only) Two separate EMR clusters — one for development, one for production

Register an EMR cluster

Step 1: Open the cluster registration page

  1. Go to the SettingCenter page. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center.

  2. In the left navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster. Select E-MapReduce for Cluster Type To Register. The Register EMR Cluster page appears.

Step 2: Configure cluster information

On the Register EMR Cluster page, configure the cluster parameters.

Note

For standard mode workspaces, configure cluster information separately for the development and production environments. For more information, see Differences between workspace modes.

Start by setting Display Name of Cluster — the name the cluster uses in DataWorks. The name must be unique.

Then, select an option for Alibaba Cloud Account To Which Cluster Belongs based on where your EMR cluster resides.

Current Alibaba Cloud account

Select this option if the EMR cluster and your DataWorks workspace belong to the same Alibaba Cloud account.

Parameter Description Required
Cluster Type The type of EMR cluster to register. For supported types, see Limitations. Yes
Cluster The EMR cluster to register. If you select EMR Serverless Spark, follow the on-screen instructions to select the E-MapReduce Workspace, default engine version, default resource queue, and other settings. Yes
Default Access Identity The identity used to access the EMR cluster in this workspace. Development environment: use the cluster account hadoop or the account mapped to the task executor. Production environment: use hadoop, or the account mapped to the task owner, Alibaba Cloud account, or RAM user. If the mapped account is not configured, DataWorks falls back as follows: RAM user runs the task — uses an EMR cluster system account with the same name; fails if LDAP or Kerberos is enabled. Alibaba Cloud account runs the task — the task reports an error. For more information, see Configure cluster identity mappings. Yes
Pass Proxy User Information Whether to pass proxy user information when running tasks. Pass: permissions are verified based on the proxy user. In DataStudio and DataAnalysis, the task executor's account name is passed dynamically. In Operation Center, the default access identity's account name is passed. Do Not Pass: permissions are based on the authentication method configured during registration. For EMR Kyuubi tasks, proxy user information is passed using hive.server2.proxy.user. For EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks, it is passed using -proxy-user. Yes
Configuration Files Required if Cluster Type is HADOOP. Export configuration files from the EMR console (see Export and import service configurations), or retrieve them directly by logging on to the EMR cluster from these paths: /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml, /etc/ecm/hadoop-conf/mapred-site.xml, /etc/ecm/hadoop-conf/yarn-site.xml, /etc/ecm/hive-conf/hive-site.xml, /etc/ecm/spark-conf/spark-defaults.conf, /etc/ecm/spark-conf/spark-env.sh. After exporting, rename the files as required by the upload UI. Conditional

Another Alibaba Cloud account

Select this option if the EMR cluster belongs to a different Alibaba Cloud account. Cross-account registration supports only EMR on ECS: DataLake, EMR on ECS: Hadoop, and EMR on ECS: Custom cluster types. EMR Serverless Spark cannot be registered across accounts.

Parameter Description Required
UID of Alibaba Cloud Account The UID of the Alibaba Cloud account that owns the EMR cluster. Yes
RAM Role The RAM role used to access the EMR cluster. The role must be created in the other account and granted permissions to access the DataWorks service in your current account. For setup details, see Scenario: Register a cross-account EMR cluster. Yes
EMR Cluster Type The type of EMR cluster to register. Currently, only EMR on ECS: DataLake cluster, EMR on ECS: Hadoop cluster, and EMR on ECS: Custom cluster are supported for cross-account registration. Yes
EMR Cluster The EMR cluster from the other account to register to DataWorks. Yes
Configuration Files Configure the files as prompted on the UI. For details on obtaining configuration files, see Export and import service configurations. Alternatively, log on to the EMR cluster and retrieve the files from: /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml, /etc/ecm/hadoop-conf/mapred-site.xml, /etc/ecm/hadoop-conf/yarn-site.xml, /etc/ecm/hive-conf/hive-site.xml, /etc/ecm/spark-conf/spark-defaults.conf, /etc/ecm/spark-conf/spark-env.sh. After exporting, rename the files as required by the upload UI. Yes
Default Access Identity The identity used to access the EMR cluster in this workspace. Development environment: use hadoop or the account mapped to the task owner. Production environment: use hadoop, or the account mapped to the task owner, Alibaba Cloud account, or RAM user. Fallback behavior when no mapping is configured: RAM user runs the task — uses an EMR cluster system account with the same name; fails if LDAP or Kerberos is enabled. Alibaba Cloud account runs the task — the task reports an error. Yes
Pass Proxy User Information Whether to pass proxy user information when running tasks. Pass: permissions are verified based on the proxy user. In DataStudio and DataAnalysis, the task executor's account name is passed dynamically. In Operation Center, the default access identity's account name is passed. Do Not Pass: permissions are based on the authentication method configured during registration. For EMR Kyuubi tasks, proxy user information is passed using hive.server2.proxy.user. For EMR Spark tasks and non-JDBC-mode EMR Spark SQL tasks, it is passed using -proxy-user. Yes

Step 3: Initialize a resource group

Initialize the resource group when you first register a cluster, change cluster service configurations (such as modifying core-site.xml), or upgrade a component version. This step ensures the resource group can connect to EMR and run tasks in the current environment.

  1. On the Cluster Management page, find the registered EMR cluster tab and click Initialize Resource Group in the upper-right corner.

  2. Locate the target resource group and click Initialize.

    Both serverless resource groups and exclusive resource groups for scheduling (old version) can be initialized.
  3. Wait 1 to 2 minutes for initialization to complete, then click OK.

Important
  • If initialization fails, use the connectivity diagnosis tool to troubleshoot the cause.

  • Initialization may cause running tasks to fail. Unless immediate reinitialization is necessary (for example, to prevent widespread task failures after a configuration change), initialize the resource group during off-peak hours.

What's next

After registering your EMR cluster, complete the following setup steps: