Before you can run E-MapReduce (EMR) tasks or synchronize data in DataWorks, you must associate your EMR cluster with a DataWorks workspace as a computing resource. Without this association, Data Studio cannot dispatch tasks to the cluster.
Prerequisites
Before you begin, ensure that you have:
-
A DataWorks workspace with the Use Data Studio (New Version) option enabled, and your RAM user added to the workspace with the Workspace Administrator role
Workspaces that do not enable Use Data Studio (New Version) should use Cluster Management instead. See DataStudio (legacy version): Associate an EMR computing resource.
-
An EMR cluster of one of the supported types:
WarningHadoop clusters (legacy data lake) are deprecated. Migrate to DataLake clusters as soon as possible. See Migrate Hadoop clusters to DataLake clusters.
-
A resource group associated with the workspace, with network connectivity established:
Resource group type Connectivity requirement Serverless resource group EMR cluster can reach the Serverless resource group Legacy exclusive resource group EMR cluster can reach the exclusive resource group for scheduling -
The O&M and Workspace Administrator roles, or the
AliyunDataWorksFullAccesspermission, if you operate as a RAM user or RAM role. Alibaba Cloud accounts require no additional permissions. See Grant space administrator permissions to a user.
Limitations
Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).
Supported Hadoop cluster versions (legacy data lake): EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-3.38.2, EMR-3.38.3, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-4.9.0, EMR-5.2.1, EMR-5.4.3, EMR-5.6.0.
Kerberos authentication: If the EMR cluster has Kerberos authentication enabled, add an inbound rule to its security group before proceeding:
-
In the EMR cluster's Basic Information section, click the icon next to Cluster Security Group to open the Security Group Details tab.
-
On the Rules tab, click Inbound, then Add Rule.
-
Set Protocol to Custom UDP, set Port Range to the KDC port specified in
/etc/krb5.confon the EMR cluster, and set Source to the vSwitch CIDR block of the associated resource group.
EMR-HOOK for DataLake and custom clusters: To display real-time metadata, audit logs, and data lineages in DataWorks—and to run EMR governance tasks—configure EMR-HOOK on the cluster. Only EMR Hive and EMR Spark SQL support this configuration.
| Service | How to configure | Reinitialize resource group? |
|---|---|---|
| EMR Hive | Configure EMR-HOOK for Hive in the EMR console | No |
| EMR Spark SQL (via EMR console) | Configure EMR-HOOK for Spark SQL in the EMR console | Yes — reinitialize after configuration |
| EMR Spark SQL (via SPARK parameters) | Set SPARK property parameters when configuring the computing resource | No |
Step 1: Open the Computing Resource page
-
Log on to the DataWorks console. Switch to the destination region, then in the left navigation pane choose More > Management Center. Select your workspace and click Go To Management Center.
-
In the left navigation pane, click Computing Resource.
Step 2: Associate the EMR computing resource
-
Click Associate Computing Resource.
-
On the Associate Computing Resource page, set the computing resource type to EMR. The Associate EMR Computing Resource configuration page opens.
-
Configure the parameters:
Parameter Description Suggested value Alibaba Cloud account to which cluster belongs Whether the EMR cluster belongs to the current Alibaba Cloud account or a different one. If you select Another Alibaba Cloud Account, follow Scenario: Register a cross-account EMR cluster to grant the required permissions first. Current Alibaba Cloud Account for most setups Cluster type The type of EMR cluster to associate. Select the type that matches your existing cluster Cluster The specific EMR cluster to associate. — Default access identity The cluster account used to run tasks. See the table below for details. hadoopfor a quick startPass Proxy User Information Whether to pass the task executor's identity to the cluster for fine-grained data access control. See the table below for details. Do Not Pass unless LDAP or Kerberos authentication is enabled Configuration file Required when Cluster type is HADOOP. Export the file from the EMR console (see Export and import service configurations), or log on to the EMR cluster and copy the files from these paths: /etc/ecm/hadoop-conf/core-site.xml,/etc/ecm/hadoop-conf/hdfs-site.xml,/etc/ecm/hadoop-conf/mapred-site.xml,/etc/ecm/hadoop-conf/yarn-site.xml,/etc/ecm/hive-conf/hive-site.xml,/etc/ecm/spark-conf/spark-defaults.conf,/etc/ecm/spark-conf/spark-env.sh— Computing resource instance name A custom name for the computing resource. Tasks use this name to select the resource they run on. Use a descriptive name, for example, emr-datalake-prodDefault access identity options:
Environment Available identities Notes Development hadoopcluster account, or the cluster account mapped to the task executor— Production hadoopcluster account, or the cluster account mapped to the task owner, the Alibaba Cloud account, or a RAM userWhen using a mapped identity, configure identity mapping between DataWorks tenant members and EMR cluster accounts. Without a mapping: a RAM user falls back to the EMR system account with the same name (task fails if LDAP or Kerberos is enabled); an Alibaba Cloud account causes the task to fail immediately. Proxy user information options:
Setting Behavior Pass Data access is verified based on the proxy user's identity. In DataStudio and DataAnalysis, the task executor's Alibaba Cloud account name is passed as the proxy user. In Operation Center, the default access identity configured during registration is passed. For EMR Kyuubi tasks, the identity is passed via hive.server2.proxy.user. For EMR Spark tasks and non-JDBC EMR Spark SQL tasks, it is passed via-proxy-user.Do Not Pass Data access is verified based on the identity method configured during cluster registration. -
Click OK.
Step 3: Initialize the resource group
Initialize the resource group when you register a cluster for the first time, change cluster service configurations, or upgrade component versions (for example, after modifying core-site.xml). This ensures the resource group can connect to the EMR cluster after you configure network connectivity.
-
On the Computing Resource page, find the EMR computing resource and click Initialize Resource Group in the upper-right corner.
-
Find the resource group and click Initialize. After initialization completes, click OK.
(Optional) Configure a YARN resource queue
On the Computing Resource page, find the associated EMR cluster. On the YARN Resource Queue tab, click EditYARN Resource Queue to set a global YARN resource queue for tasks across different modules.
(Optional) Set Spark-related parameters
Set global Spark property parameters for tasks across different modules.
-
On the Computing Resource page, find the associated EMR cluster.
-
On the Spark-related Parameters tab, click EditSpark-related Parameter.
-
Click Add, enter a Spark Property Name and its corresponding Spark Property Value to set the global Spark parameters.
What's next
-
Configure Kyuubi connection information: Set a custom account and password for Kyuubi sessions.
-
Develop tasks in Data Studio using EMR nodes: Start building data pipelines with EMR-related node types.