You can register an E-MapReduce (EMR) cluster across Alibaba Cloud accounts. This operation must be performed by using a RAM role. This topic describes how to use a RAM role to enable Alibaba Cloud Account A to register an EMR cluster that belongs to Alibaba Cloud Account B in DataWorks. This way, you can implement cross-account access to EMR data.
Prerequisites
Alibaba Cloud Account A and Alibaba Cloud Account B are created. For information about how to create an Alibaba Cloud account, see Create an Alibaba Cloud account.
Alibaba Cloud Account A: used to register an EMR cluster that belongs to Alibaba Cloud Account B in DataWorks.
Alibaba Cloud Account B: used to provide an EMR cluster.
An EMR cluster is created by using Alibaba Cloud Account B. For information about how to create an EMR cluster, see Create a cluster.
Precautions
Only EMR Hadoop clusters for which the Metadata parameter is not set to DLF Unified Metadata can be used.
Kerberos authentication is not supported.
Spark supports table lineages of SQL nodes and does not support field lineages of SQL nodes.
Alibaba Cloud Account B: Create a RAM role and authorize Alibaba Cloud Account A to assume the RAM role
Alibaba Cloud Account B is assigned a RAM role that has permissions to access EMR resources. Alibaba Cloud Account B authorizes Alibaba Cloud Account A to assume this role to access the EMR resources.
Create a RAM role.
Log on to the RAM console by using Alibaba Cloud Account B. Create a RAM role and add Alibaba Cloud Account A as a trusted Alibaba Cloud account for the role. Then, Alibaba Cloud Account A can assume the role to access the authorized resources. For information about how to create a RAM role, see Create a RAM role for a trusted Alibaba Cloud account.
Sample key configurations of a RAM role:
Set the RAM Role Name parameter to EMRRole.
Set the Select Trusted Alibaba Cloud Account parameter to Other Alibaba Cloud Account, and enter the ID of Alibaba Cloud Account A in the field that appears. You can log on to the RAM console by using Alibaba Cloud Account A, and move the pointer over the profile picture in the top navigation bar to obtain the ID of Alibaba Cloud Account A.
After the configuration is complete, Alibaba Cloud Account A can assume the EMRRole role and access the authorized resources.
Modify the trust policy of the EMRRole role.
You must go to the details page of the EMRRole role and modify its trust policy to authorize Alibaba Cloud Account A to access EMR clusters that belong to Alibaba Cloud Account B. For information about how to modify the trust policy of a RAM role, see Edit the trust policy of a RAM role. The following code shows the document of the trust policy:
{ "Statement": [ { "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": [ "san******@emr.dataworks.aliyuncs.com" ] } } ], "Version": "1" }
Notesan******@emr.dataworks.aliyuncs.com
: indicates the ID of Alibaba Cloud Account A.Attach the AliyunDataWorksAccessingEMRReadOnlyPolicy policy to the EMRRole role.
Alibaba Cloud Account A: Register an EMR cluster that belongs to Alibaba Cloud Account B
In this step, use Alibaba Cloud Account A to register an EMR cluster that belongs to Alibaba Cloud Account B in a workspace of Alibaba Cloud Account A. Before you perform the following steps, you must obtain the ID of Alibaba Cloud Account B.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.
Configure information about an EMR cluster.
Configure basic information about the EMR cluster.
Configure the parameters that are shown in the following figure as prompted. If you use a workspace in standard mode, you must register EMR clusters in the development and production environments. For information about workspaces in different modes, see Differences between workspaces in basic mode and workspaces in standard mode.
Configuration descriptions of key parameters:
Set the Alibaba Cloud Primary Account UID parameter to the ID of the Alibaba Cloud account to which the EMR cluster belongs. In this example, set the parameter to the ID of Alibaba Cloud Account B.
Set the Opposite RAM Role parameter to the RAM role that can be assumed by Alibaba Cloud Account A to access the EMR resources of Alibaba Cloud Account B. In this example, set the parameter to EMRRole.
Set the Peer EMR Cluster parameter to the EMR cluster that you want to register in DataWorks. In this example, you can select only EMR Hadoop clusters of V3.38.3 or V3.38.2 for which the Metadata parameter is not set to DLF Unified Metadata.
For more information about how to register an EMR cluster, see Register an EMR cluster in DataWorks.
Initialize the resource group that you want to use.
If you register an EMR cluster to DataWorks for the first time, modify the service configurations of your EMR cluster, such as configurations in the core-site.xml file, or update the version of a component in your EMR cluster, you must initialize the resource group that you use. This ensures that the resource group can properly access the EMR cluster and EMR tasks can run as expected by using the current environment configurations of the resource group. To initiate a resource group, perform the following steps:
NoteDataWorks allows you to use serverless resource groups (recommended) or old-version exclusive resource groups for scheduling to run EMR tasks. Therefore, you can select a serverless resource group or an exclusive resource group for scheduling when you need to initialize a resource group.
Resource group initialization may cause failure of tasks that are in progress. Therefore, we recommend that you initialize a resource group during off-peak hours unless otherwise required. For example, if cluster configurations are modified, you must immediately reinitialize a specified resource group. Otherwise, a large number of tasks may fail to run.
What to do next
After you register the EMR cluster, you can perform the following operations:
Configure mappings between tenant member accounts and EMR cluster accounts. If the default identity used to access the EMR cluster is a non-Hadoop account, you must configure mappings between tenant member accounts and EMR cluster accounts. This way, the RAM user that you use in DataWorks can access only resources on which the RAM user has permissions.
Configure a data synchronization node in Data Integration to synchronize data based on the EMR cluster. For more information, see Data Integration overview.
Go to Operation Center and Data Map to view more information about the cluster. For more information, see Operation Center overview and Data Map overview.