Register an EMR cluster to DataWorks - DataWorks - Alibaba Cloud Documentation Center

DataWorks allows you to create various nodes, such as Hive, MapReduce, Presto, and Spark SQL nodes, based on an E-MapReduce (EMR) compute engine instance and implement features such as configuring a workflow for EMR tasks, periodically scheduling the workflow, and managing metadata of the workflow. The features ensure that data can be generated and managed in an efficient and stable manner. This topic describes how to register an EMR cluster to DataWorks in same-account mode or cross-account mode.

Background information

EMR is a big data processing solution that runs on the Alibaba Cloud platform.

EMR is based on Apache Hadoop and Apache Spark, which allows you to use peripheral systems in the Hadoop and Spark ecosystems to perform data analysis and processing with ease. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS. Alibaba Cloud provides EMR on ECS, EMR on ACK, and EMR Serverless StarRocks for EMR to meet the business requirements of different users.

You can select various EMR components for running EMR tasks in DataWorks. The optimal configurations of different EMR components for running an EMR task vary. When you configure an EMR cluster, you can refer to Instruction on configuring an EMR cluster to select components that meet your business requirements.

Supported EMR cluster types

You must register an EMR cluster to DataWorks before you can use the cluster in the DataWorks console to run tasks. Before you can perform operations related to EMR in the DataWorks console, you must create required EMR clusters. You can register the following types of EMR clusters to DataWorks:

Note

If your cluster cannot be registered to DataWorks, submit a ticket to contact technical support.

Limits

Task type: You cannot run EMR Flink tasks in the DataWorks console.

Task running: You can use only an exclusive resource group for scheduling to run an EMR task.
Task governance:
- Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.
  Note
  For Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.
- If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. Only EMR Hive and EMR Spark SQL services can be configured with EMR-HOOK. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.

Prerequisites

The identity that you want to use is prepared and granted the required permissions.
Only the following identities can register an EMR cluster. For information about how to grant permissions to a RAM user, see Grant permissions to RAM users.
- An Alibaba Cloud account
- A RAM user or RAM role that is assigned the Workspace Administrator role and attached the AliyunEMRFullAccess policy
- A RAM user or RAM role that is attached the AliyunDataWorksFullAccess and AliyunEMRFullAccess policies
An EMR cluster that meets your business requirements is purchased.
For information about the types of EMR clusters that you can register to DataWorks, see the Supported EMR cluster types section in this topic.

Precautions

If you want to isolate EMR data in the development environment from EMR data in the production environment by using a workspace that is in standard mode, you must associate different EMR clusters with the workspace in the development environment and the workspace in the production environment. In addition, the metadata of the EMR clusters must be stored by using one of the following methods:
- Method 1: Store the metadata in two different catalogs in DLF. We recommend that you use this method. For more information, see Use DLF for unified metadata storage.
- Method 2: Store the metadata in two different ApsaraDB RDS databases. For information about how to configure an ApsaraDB RDS database as the metadatabase of an EMR cluster, see Configure a self-managed ApsaraDB RDS for MySQL database.
You can register an EMR cluster in multiple workspaces within the same Alibaba Cloud account. If an existing EMR cluster within the current Alibaba Cloud account is registered in a workspace within the same account, you cannot register the EMR cluster in a workspace within another Alibaba Cloud account.

Step 1: Go to the EMR cluster registration page

Go to the SettingCenter page.
Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the Management Center page, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Open Source Clusters. On the Open Source Clusters page, click Registering a cluster. In the dialog box that appears, click E-MapReduce to go to the EMR cluster registration page.

Step 2: Register an EMR cluster

On the Register E-MapReduce cluster page, configure cluster information.

Note

If your workspace is in standard mode, you must configure cluster information for the development environment and production environment. For information about workspace modes, see Differences between workspaces in basic mode and workspaces in standard mode.

Cluster Display Name: the name of the EMR cluster in DataWorks. The name must be unique within the current tenant.
Cloud Account To Which The Cluster Belongs: the type of the Alibaba Cloud account to which the EMR cluster you want to register in the current workspace belongs. Valid values:
- Current Alibaba Cloud primary account: the current Alibaba Cloud account
- Other Alibaba Cloud primary accounts: another Alibaba Cloud account

The parameters that you must configure vary based on the account type that you select. You can refer to the following two tables to configure the parameters.

Parameters to configure if you select Current Alibaba Cloud primary account

If you set the Cloud Account To Which The Cluster Belongs parameter to Current Alibaba Cloud primary account, you must configure the following parameters.

Parameter	Description
Cluster Type	The type of the EMR cluster that you want to register. DataWorks allows you to register only specific types of EMR clusters. For more information, see the Supported EMR cluster types section in this topic.
Cluster	The EMR cluster that you want to register.
Default Access Identity	The identity that you want to use to access the EMR cluster in the current workspace. Development environment: You can select Cluster account: hadoop or Cluster account mapped by task performer. Production environment: You can select Cluster account: hadoop, Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User. Note If you select Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can refer to Configure mappings between tenant member accounts and EMR cluster accounts to configure a mapping between a DataWorks tenant member and a specified EMR cluster account. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running: If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run. If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.
Configuration files	If you set the Cluster Type parameter to HADOOP, you must also upload the configuration files that are required. You can obtain the configuration files in the EMR console. For more information, see Export and import service configurations. After you export a service configuration file, modify the name of the file based on the file upload requirements of the GUI. You can also log on to the EMR cluster that you want to register, and access the following paths to obtain the required configuration files: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`

Parameters to configure if you select Other Alibaba Cloud primary accounts

If you set the Cloud Account To Which The Cluster Belongs parameter to Other Alibaba Cloud primary accounts, you must configure the following parameters.

Parameter	Description
Alibaba Cloud Primary Account UID	The UID of the Alibaba Cloud account to which the EMR cluster you want to register belongs.
Opposite RAM Role	The RAM role that you want to use to access the EMR cluster. The RAM role must meet the following conditions: The RAM role is created within the specified Alibaba Cloud account. The RAM role is authorized to access the DataWorks service activated within the current logon account. Note For information about how to register an EMR cluster across accounts, see Scenario: Register a cross-account EMR cluster.
Peer EMR Cluster Type	The type of the EMR cluster that you want to register. You can register only an EMR Hadoop cluster that is created in EMR on ECS across accounts.
Peer EMR Cluster	The EMR cluster that you want to register.
Configuration files	The configuration files that are required. You can configure the parameters that are displayed to upload the required configuration files. For information about how to obtain configuration files, see Export and import service configurations. After you export a service configuration file, modify the name of the file based on the file upload requirements of the GUI. You can also log on to the EMR cluster that you want to register, and access the following paths to obtain the required configuration files: `/etc/ecm/hadoop-conf/core-site.xml /etc/ecm/hadoop-conf/hdfs-site.xml /etc/ecm/hadoop-conf/mapred-site.xml /etc/ecm/hadoop-conf/yarn-site.xml /etc/ecm/hive-conf/hive-site.xml /etc/ecm/spark-conf/spark-defaults.conf /etc/ecm/spark-conf/spark-env.sh`
Default Access Identity	The identity that you want to use to access the EMR cluster in the current workspace. Development environment: You can select Cluster account: hadoop or Cluster account mapped by task performer. Production environment: You can select Cluster account: hadoop, Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User. Note If you select Cluster Account Mapped to Account of Node Owner, Cluster Account Mapped to Alibaba Cloud Account, or Cluster Account Mapped to RAM User for the Default Access Identity parameter, you can refer to Configure mappings between tenant member accounts and EMR cluster accounts to configure a mapping between a DataWorks tenant member and a specified EMR cluster account. The mapped EMR cluster account is used to run EMR tasks in DataWorks. If no mapping is configured between a DataWorks tenant member and an EMR cluster account, DataWorks implements the following policies on task running: If you set the Default Access Identity parameter to Cluster Account Mapped to RAM User and select a RAM user from the RAM User drop-down list, the EMR cluster account that has the same name as the RAM user is automatically used to run EMR tasks in DataWorks. If LDAP or Kerberos authentication is enabled for the EMR cluster, the EMR tasks fail to be run. If you set the Default Access Identity parameter to Cluster Account Mapped to Alibaba Cloud Account, errors will be reported when EMR tasks are run in DataWorks.

Step 3: Initialize a resource group

The first time you register an EMR cluster to DataWorks, or if the service configurations of your EMR cluster change or the version of a component in your EMR cluster is updated, you must initialize the resource group that you use. This ensures that the resource group can normally access the EMR cluster and EMR tasks can be normally run by using the current environment configurations of the resource group. For example, if you modify the core-site.xml configuration file of your EMR cluster, you must initialize the resource group. To initiate a resource group, perform the following steps:

Go to the Open Source Clusters page in SettingCenter. Find the desired EMR cluster that is registered to DataWorks and click Initialize Resource Group in the section that displays the information of the EMR cluster.
In the Initialize Resource Group dialog box, find the desired resource group and click Initialize.
After the initialization is complete, click Confirmation.

Note

DataWorks allows you to use only exclusive resource groups for scheduling to run EMR tasks. Therefore, you can select only an exclusive resource group for scheduling when you initialize a resource group.
Resource group initialization may cause tasks that are running to fail. Therefore, we recommend that you initialize a resource group during off-peak hours unless otherwise required. For example, if cluster configurations are modified, you must immediately reinitialize a specified resource group. Otherwise, a large number of tasks fail to run.

What to do next

Data development: You can refer to General development process to configure the required component environments.
Configure identity mappings for the EMR cluster: If you log on to the DataWorks console as a RAM user, and you set the Default Access Identity parameter to a value other than Cluster account: hadoop when you register the EMR cluster to DataWorks, you must configure identity mappings for the EMR cluster. The identity mappings are used to control the permissions of the RAM user on the EMR cluster in DataWorks.
Specify global YARN queues: You can specify global YARN queues that can be used by each module of DataWorks and specify whether the global settings can overwrite the settings that are separately configured in each module.
Configure global Spark properties: You can refer to the official documentation for Spark to configure global Spark-related parameters. In addition, you can specify whether the global settings can overwrite the Spark-related parameters that are separately configured in each module of DataWorks and have the same names as the global parameters.