All Products
Search
Document Center

DataWorks:Data Studio: Associate an EMR computing resource

Last Updated:Mar 26, 2026

Before you can run E-MapReduce (EMR) tasks or synchronize data in DataWorks, you must associate your EMR cluster with a DataWorks workspace as a computing resource. Without this association, Data Studio cannot dispatch tasks to the cluster.

Prerequisites

Before you begin, ensure that you have:

Limitations

Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).

Supported Hadoop cluster versions (legacy data lake): EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-3.38.2, EMR-3.38.3, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-4.9.0, EMR-5.2.1, EMR-5.4.3, EMR-5.6.0.

Kerberos authentication: If the EMR cluster has Kerberos authentication enabled, add an inbound rule to its security group before proceeding:

  1. In the EMR cluster's Basic Information section, click the icon next to Cluster Security Group to open the Security Group Details tab.

  2. On the Rules tab, click Inbound, then Add Rule.

  3. Set Protocol to Custom UDP, set Port Range to the KDC port specified in /etc/krb5.conf on the EMR cluster, and set Source to the vSwitch CIDR block of the associated resource group.

EMR-HOOK for DataLake and custom clusters: To display real-time metadata, audit logs, and data lineages in DataWorks—and to run EMR governance tasks—configure EMR-HOOK on the cluster. Only EMR Hive and EMR Spark SQL support this configuration.

Service How to configure Reinitialize resource group?
EMR Hive Configure EMR-HOOK for Hive in the EMR console No
EMR Spark SQL (via EMR console) Configure EMR-HOOK for Spark SQL in the EMR console Yes — reinitialize after configuration
EMR Spark SQL (via SPARK parameters) Set SPARK property parameters when configuring the computing resource No

Step 1: Open the Computing Resource page

  1. Log on to the DataWorks console. Switch to the destination region, then in the left navigation pane choose More > Management Center. Select your workspace and click Go To Management Center.

  2. In the left navigation pane, click Computing Resource.

Step 2: Associate the EMR computing resource

  1. Click Associate Computing Resource.

  2. On the Associate Computing Resource page, set the computing resource type to EMR. The Associate EMR Computing Resource configuration page opens.

  3. Configure the parameters:

    Parameter Description Suggested value
    Alibaba Cloud account to which cluster belongs Whether the EMR cluster belongs to the current Alibaba Cloud account or a different one. If you select Another Alibaba Cloud Account, follow Scenario: Register a cross-account EMR cluster to grant the required permissions first. Current Alibaba Cloud Account for most setups
    Cluster type The type of EMR cluster to associate. Select the type that matches your existing cluster
    Cluster The specific EMR cluster to associate.
    Default access identity The cluster account used to run tasks. See the table below for details. hadoop for a quick start
    Pass Proxy User Information Whether to pass the task executor's identity to the cluster for fine-grained data access control. See the table below for details. Do Not Pass unless LDAP or Kerberos authentication is enabled
    Configuration file Required when Cluster type is HADOOP. Export the file from the EMR console (see Export and import service configurations), or log on to the EMR cluster and copy the files from these paths: /etc/ecm/hadoop-conf/core-site.xml, /etc/ecm/hadoop-conf/hdfs-site.xml, /etc/ecm/hadoop-conf/mapred-site.xml, /etc/ecm/hadoop-conf/yarn-site.xml, /etc/ecm/hive-conf/hive-site.xml, /etc/ecm/spark-conf/spark-defaults.conf, /etc/ecm/spark-conf/spark-env.sh
    Computing resource instance name A custom name for the computing resource. Tasks use this name to select the resource they run on. Use a descriptive name, for example, emr-datalake-prod

    Default access identity options:

    Environment Available identities Notes
    Development hadoop cluster account, or the cluster account mapped to the task executor
    Production hadoop cluster account, or the cluster account mapped to the task owner, the Alibaba Cloud account, or a RAM user When using a mapped identity, configure identity mapping between DataWorks tenant members and EMR cluster accounts. Without a mapping: a RAM user falls back to the EMR system account with the same name (task fails if LDAP or Kerberos is enabled); an Alibaba Cloud account causes the task to fail immediately.

    Proxy user information options:

    Setting Behavior
    Pass Data access is verified based on the proxy user's identity. In DataStudio and DataAnalysis, the task executor's Alibaba Cloud account name is passed as the proxy user. In Operation Center, the default access identity configured during registration is passed. For EMR Kyuubi tasks, the identity is passed via hive.server2.proxy.user. For EMR Spark tasks and non-JDBC EMR Spark SQL tasks, it is passed via -proxy-user.
    Do Not Pass Data access is verified based on the identity method configured during cluster registration.
  4. Click OK.

Step 3: Initialize the resource group

Initialize the resource group when you register a cluster for the first time, change cluster service configurations, or upgrade component versions (for example, after modifying core-site.xml). This ensures the resource group can connect to the EMR cluster after you configure network connectivity.

  1. On the Computing Resource page, find the EMR computing resource and click Initialize Resource Group in the upper-right corner.

  2. Find the resource group and click Initialize. After initialization completes, click OK.

(Optional) Configure a YARN resource queue

On the Computing Resource page, find the associated EMR cluster. On the YARN Resource Queue tab, click EditYARN Resource Queue to set a global YARN resource queue for tasks across different modules.

(Optional) Set Spark-related parameters

Set global Spark property parameters for tasks across different modules.

  1. On the Computing Resource page, find the associated EMR cluster.

  2. On the Spark-related Parameters tab, click EditSpark-related Parameter.

  3. Click Add, enter a Spark Property Name and its corresponding Spark Property Value to set the global Spark parameters.

What's next