Associate an EMR cluster with a DataWorks workspace as an EMR compute engine instance -

DataWorks allows you to create E-MapReduce (EMR) nodes such as EMR Hive nodes, EMR MR nodes, EMR Presto nodes, and EMR Spark SQL nodes based on an EMR compute engine instance. You can use different types of EMR nodes to perform different operations, such as configuration of an EMR workflow, scheduling of nodes in a workflow, or metadata management in a workflow. These features help EMR users generate data in an efficient manner. Before you create an EMR node to develop data, you must associate an EMR cluster with a DataWorks workspace as an EMR compute engine instance. This topic describes how to associate an EMR cluster with a DataWorks workspace as an EMR compute engine instance.

DataWorks provides the following modes in which you can associate an EMR cluster with a workspace: shortcut mode and security mode. The preceding modes can meet the security requirements of various enterprises. If you associate an EMR cluster with a workspace in shortcut mode, you can create and run EMR nodes to generate data. If you associate an EMR cluster with a workspace in security mode, you can create and run EMR nodes to generate data and manage permissions on the data to improve security.

Shortcut mode

In shortcut mode, if you run or schedule EMR nodes in DataWorks by using your Alibaba Cloud account or a RAM user, the code is committed to the EMR cluster and run by the Hadoop user of the EMR cluster.

Important

The Hadoop user is granted all permissions on the EMR cluster. Proceed with caution when you associate an EMR cluster with a workspace in shortcut mode.
Before you associate an EMR cluster with a workspace in shortcut mode, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio.
- By default, the AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts.
- If you want to use a RAM user to run EMR nodes, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.

The shortcut mode is suitable for workspaces that do not require strict permission management for users who run nodes. 快捷模式

To associate an EMR cluster with a workspace in shortcut mode, perform the following steps:

Go to the page on which you can associate an EMR compute engine with a DataWorks workspace.
1. Log on to the DataWorks console.
2. In the left-side navigation pane, click Workspaces.
3. Go to the Workspace Management page to associate an EMR compute engine with a DataWorks workspace.
  Note If you do not have a DataWorks workspace, you must create a workspace. For more information, see Create a workspace and associate an EMR compute engine with the workspace.
  After you select the region where the desired workspace resides, you can use one of the following methods to go to the Workspace Management page:
  - On the Workspaces page, find the desired workspace, move the pointer over the icon in the Actions column, and then select Workspace Settings. In the Workspace Settings panel, click More.
  - On the Workspaces page, find the desired workspace and click DataStudio in the Actions column. On the DataStudio page, click the icon in the upper-right corner of the page.
On the Compute Engine Information tab, click E-MapReduce.
On the E-MapReduce tab, click Add Instance.

In the Add EMR Cluster dialog box, configure the parameters.

The configurations of a DataWorks workspace vary based on the mode in which the DataWorks workspace runs. You must configure the parameters for the production and development environments of a DataWorks workspace in standard mode. The following table describes the parameters for a DataWorks workspace in standard mode. 标准模式

Parameter	Description
Instance Display Name	The display name of the EMR compute engine instance.
Region	The region of the current workspace. The value of this parameter cannot be changed.
Access Mode	The access mode of the EMR compute engine instance. Select Shortcut mode from the drop-down list.
Scheduling access identity	The identity that is used to commit the code of an EMR node to the EMR cluster. The code is committed when the node is committed to the scheduling system of DataWorks in the production environment. Valid values: Alibaba Cloud primary account and Alibaba Cloud sub-account. Note This parameter is available only for the production environment. Before you associate an EMR cluster with a workspace in shortcut mode, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio. By default, the AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts. If you want to use a RAM user to run EMR nodes, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
Access identity	The identity that is used to commit the code of an EMR node in the development environment to the EMR compute engine instance. Default value: Task owner. Note This parameter is available only for the development environment of a workspace in standard mode. Task owner can be an Alibaba Cloud account or a RAM user. Before you associate an EMR cluster with a workspace in shortcut mode, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio. By default, the AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts. If you want to use a RAM user to run EMR nodes, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
Cluster ID	The ID of the EMR cluster that you want to associate with the workspace. Select an ID from the drop-down list. The EMR cluster is used as the runtime environment of EMR nodes.
Project ID	The ID of the EMR project that you want to associate with the workspace. Select an ID from the drop-down list. The EMR project is used as the runtime environment of EMR nodes. Note If you set the Access Mode parameter to Security mode, no EMR project IDs are displayed and can be selected.
YARN resource queue	The name of the YARN resource queue in the EMR cluster. Unless otherwise specified, set this parameter to default.
Endpoint	The endpoint of the EMR cluster. The value of this parameter cannot be changed.

Click Confirmation.

Security mode

In security mode, if you commit the code of EMR nodes by using an Alibaba Cloud account or a RAM user to an EMR cluster, the code is run by a user that has the same name as the Alibaba Cloud account or the RAM user. EMR Ranger can be used to manage the permissions of each Hadoop user in the EMR cluster. This ensures that different Alibaba Cloud accounts, node owners, or RAM users have different data permissions when they run EMR nodes in DataWorks. This improves data security.

Note

Before you associate an EMR cluster with a workspace in security mode, you must add the credentials of workspace roles such as developers and administrators to the Lightweight Directory Access Protocol (LDAP) directory of the EMR cluster. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy and grant relevant data permissions to the workspace roles. This way, the roles can be used to create and run EMR nodes in DataStudio.

By default, the credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default.
If you want to use a RAM user to run EMR nodes, you must add the credentials of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.

The security mode is suitable for workspaces that require strict management and isolation of data permissions for users who run nodes. 安全模式

To associate an EMR cluster with a workspace in security mode, perform the following steps:

Turn on Security Mode for the EMR project.
1. Log on to the EMR console.
2. In the left-side navigation pane, click Data Development (Old).
3. In the Projects section, find the project for which you want to enable the security mode and click Edit Job in the Actions column.
4. On the page that appears, click the Projects tab in the top navigation bar.
5. In the left-side navigation pane, click General Configuration. On the General Configuration page, turn on Security Mode.
Add the credentials of specific RAM users to the LDAP directory of the EMR cluster.
1. Go back to the cluster list page of the EMR console.
2. Find the desired cluster and click the name of the cluster in the Cluster ID/Name column.
3. On the page that appears, click the Users tab.
4. On the Users tab, click Add User.
5. In the Add User dialog box, configure the parameters.
  We recommend that you add the credentials of the following RAM users to the LDAP directory of the EMR cluster:
  - RAM users that are used to create, test, and run EMR nodes in DataStudio
  - RAM users that are used to create, commit, and deploy EMR nodes in DataStudio
6. Click OK.
Configure EMR Ranger and manage the permissions of the Hadoop users that correspond to your Alibaba Cloud account and RAM users. For more information, see Integrate Ranger UserSync with an LDAP server and Overview.

Associate the EMR cluster with the current DataWorks workspace.

Go to the Workspace page.
On the Compute Engine Information tab of the Workspace page, click E-MapReduce.
On the E-MapReduce tab, click Add Instance.

In the Add EMR Cluster dialog box, configure the parameters.

Parameter	Description
Instance Display Name	The display name of the EMR compute engine instance.
Region	The region of the current workspace.
Access Mode	The access mode of the EMR compute engine instance. Select Security mode from the drop-down list and click Confirm in the Please note message. Note You can use only one mode to associate an EMR cluster with a DataWorks workspace at a time. Proceed with caution when you change the access mode of the EMR cluster because a mode change causes permission changes.
Scheduling access identity	The identity that is used to commit the code of an EMR node to the EMR cluster. The code is committed when the node is committed and deployed to the DataWorks scheduling system in the production environment. The Hadoop user that corresponds to this identity runs the code. Valid values: Task owner, Alibaba Cloud primary account, and Alibaba Cloud sub-account. Task owner: commits and runs the code of an EMR node as the node owner. If you select this value, the data permissions of Hadoop users are isolated. Task owner can be an Alibaba Cloud account or a RAM user. Alibaba Cloud primary account: commits the code of an EMR node to the EMR cluster by using an Alibaba Cloud account. Alibaba Cloud sub-account: commits the code of an EMR node to the EMR cluster as a RAM user. Note This parameter is available only for the production environment. By default, the credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default. If you want to use a RAM user to run EMR nodes, you must add the credentials of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
Access identity	The identity that is used to commit the code of an EMR node in the development environment to the EMR compute engine instance. Default value: Task owner. The Hadoop user that corresponds to the user who runs the node runs the code. Note This parameter is available only for the development environment of a workspace in standard mode. Make sure that the credentials of the user who runs the node are added to the LDAP directory of the EMR cluster. In addition, make sure that the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy is attached to the user and relevant data permissions are granted to the user. This way, the user can run EMR nodes in DataStudio. Task owner can be an Alibaba Cloud account or a RAM user. By default, the credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default. If you want to use a RAM user to run EMR nodes, you must add the credentials of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
Cluster ID	The ID of the EMR cluster that you want to associate with the workspace. Select an ID from the drop-down list. The EMR cluster is used as the runtime environment of EMR nodes.
Project ID	The ID of the EMR project that you want to associate with the workspace. Select the ID of an EMR project in security mode from the drop-down list. Note The IDs of the EMR projects that are not in security mode are not displayed and cannot be selected.
YARN resource queue	The name of the YARN resource queue in the EMR cluster. Unless otherwise specified, set this parameter to default.
Endpoint	The endpoint of the EMR cluster. The value of this parameter cannot be changed.

Select a resource group.
1. Select an exclusive resource group for scheduling that is connected to the DataWorks workspace. If no exclusive resource group for scheduling is available, create an exclusive resource group for scheduling. For more information about how to create an exclusive resource group for scheduling and configure network connectivity, see Create and use an exclusive resource group for scheduling.
2. Click Test Connectivity to verify the network connectivity between the exclusive resource group for scheduling and the EMR cluster.
Click OK.

Configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts.
After the EMR cluster in security mode is associated with your DataWorks workspace, the LDAP account that maps the identity specified by the Access identity parameter is used to run EMR nodes. When you associate the EMR cluster with your DataWorks workspace, you must configure the Access identity parameter. You must configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts on the EMR Cluster Configuration page.
1. After you associate the EMR cluster with your DataWorks workspace, the Please note message appears. In the Please note message, click Configure Development Environment and Configure Production Environment to configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts.
2. On the EMR Cluster Configuration page, click Edit in the upper-right corner of the EMR cluster that is associated with your DataWorks workspace.
3. In the Edit EMR Cluster Configuration dialog box, configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts.You can use one of the following methods to configure mappings:
  - Reference an existing mapping: You can reference an existing mapping in the current workspace.
  - Create a mapping: In the Configure Account Mapping section, specify the Alibaba Cloud account and LDAP account between which you want to configure a mapping, and enter the password of the LDAP account.
    Note
    An Alibaba Cloud account or a RAM user to which the AliyunEMRFullAccess policy is attached can configure mappings for all members of the workspace. Other members of the workspace can configure mappings only for themselves.
    You can add mappings between multiple Alibaba Cloud accounts and LDAP accounts. An LDAP account can be mapped to multiple Alibaba Cloud accounts in DataWorks.
4. Click Confirm. The mappings are configured.