This topic describes how to associate an E-MapReduce (EMR) cluster with a DataWorks workspace. In DataWorks, you can create nodes such as EMR Hive, EMR MR, EMR Presto, and EMR Spark SQL nodes based on an EMR compute engine instance and configure EMR workflows. You can also schedule the nodes and manage metadata. This improves your data output. You must associate an EMR cluster with a DataWorks workspace before you create an EMR node to develop data. This topic describes how to associate an EMR cluster with a DataWorks workspace.

DataWorks provides two modes for you to associate an EMR cluster with a workspace: Shortcut mode and Security mode. The two modes can meet the security requirements of various enterprises. If you associate an EMR cluster with a workspace by using the Shortcut mode, you can create and run EMR nodes to generate data. If you associate an EMR cluster with a workspace by using the Security mode, you can create and run EMR nodes to generate data and manage permissions on the data to ensure higher security.

Shortcut mode

In Shortcut mode, if you run or schedule EMR nodes in DataWorks by using your Alibaba Cloud account or as a RAM user, the code is committed to the EMR cluster and run by the Hadoop user of the EMR cluster.
Notice
  • The Hadoop user has all the permissions on the EMR cluster. Proceed with caution when you use the Shortcut mode to associate an EMR cluster with a workspace.
  • Before you use the Shortcut mode to associate an EMR cluster with a workspace, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio.
    • The AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts by default.
    • To run EMR nodes as a RAM user, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
The Shortcut mode is suitable for workspaces that do not require strict permission management for users who run nodes. Shortcut mode
To associate an EMR cluster with a workspace in Shortcut mode, perform the following steps:
  1. Go to the Workspace Management page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Select the region in which the workspace resides. Then, find the workspace and click Workspace Settings in the Actions column. In the Workspace Settings panel, click More. The Workspace Management page appears. More
      Alternatively, you can find the workspace and click Data Analytics in the Actions column. On the DataStudio page, click the Workspace Manage icon icon in the upper-right corner. The Workspace Management page appears. Workspace Manage icon
  2. In the Compute Engine Information section, click the E-MapReduce tab.
  3. On the E-MapReduce tab, click Add Instance.
  4. In the New EMR cluster dialog box, set the parameters.
    Parameters in the New EMR cluster dialog box vary based on the mode in which your DataWorks workspace runs. The following table describes the parameters for a DataWorks workspace in standard mode. You must set the parameters for both the production environment and the development environment. Standard mode
    Parameter Description
    Instance Display Name The display name of the EMR compute engine instance.
    Region The region of the current workspace. The value of this parameter cannot be changed.
    Access Mode The access mode of the EMR cluster. Select Shortcut mode from the drop-down list.
    Scheduling access identity The identity that is used to commit the code of an EMR node to the EMR cluster. The code is committed when the node is committed to the scheduling system of DataWorks in the production environment. Valid values: Alibaba Cloud primary account and Alibaba Cloud sub-account.
    Note
    • This parameter is available only for the production environment.
    • Before you use the Shortcut mode to associate an EMR cluster with a workspace, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio.
      • The AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts by default.
      • To run EMR nodes as a RAM user, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
    Access identity The identity that is used to commit the code of an EMR node in the development environment to the EMR cluster. Default value: Task owner.
    Note
    • This parameter is available only for the development environment of a workspace in standard mode.
    • Task owner can be an Alibaba Cloud account or a RAM user.
      Before you use the Shortcut mode to associate an EMR cluster with a workspace, you must attach the AliyunEMRDevelopAccess policy to workspace roles such as developers and administrators. This way, the roles can be used to create and run EMR nodes in DataStudio.
      • The AliyunEMRDevelopAccess policy is attached to Alibaba Cloud accounts by default.
      • To run EMR nodes as a RAM user, you must attach the AliyunEMRDevelopAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
    Cluster ID The ID of the EMR cluster that you want to associate with the workspace. Select an ID from the drop-down list. The EMR cluster with the selected ID is used as the runtime environment of EMR nodes.
    Project ID The ID of the EMR project that you want to associate with the workspace. Select an ID from the drop-down list. The EMR project with the selected ID is used as the runtime environment of EMR nodes.
    Note The IDs of the EMR projects that are not in Security mode are not displayed and cannot be selected.
    YARN resource queue The name of the resource queue in the EMR cluster. Unless otherwise specified, set this parameter to default.
    default task submit node The node from which an EMR job is submitted to the EMR cluster. If your EMR cluster is associated with a gateway cluster, set this parameter to the associated gateway cluster. Otherwise, set this parameter to the default value HEADER.
    Note If you set this parameter to the associated gateway cluster, all EMR nodes in the workspace that is associated with the current EMR cluster use the associated gateway cluster to submit EMR jobs by default. If a node does not need to submit EMR jobs by using the gateway cluster, you can add the USE_GATEWAY parameter to the advanced configurations of the node and set the parameter to False. For more information about the advanced configurations of EMR nodes, see the topics related to EMR nodes.
    Endpoint The endpoint of the EMR cluster. The value of this parameter cannot be changed.
  5. Select a resource group.
    1. Select an exclusive resource group for scheduling that connects to the DataWorks workspace. If no exclusive resource group for scheduling is available, create one. For more information about how to create an exclusive resource group for scheduling and configure network connectivity, see Create and use an exclusive resource group for scheduling.
    2. Click Test Connectivity to verify the network connectivity between the exclusive resource group for scheduling and the EMR cluster.
  6. Click Confirm.

Security mode

In Security mode, if you commit the code of EMR nodes by using an Alibaba Cloud account or as a RAM user to an EMR cluster, the code is run by a user that has the same name as the Alibaba Cloud account or RAM user. EMR Ranger can be used to manage the permissions of each Hadoop user in the EMR cluster. This ensures that different Alibaba Cloud accounts, node owners, or RAM users have different data permissions when they run EMR nodes in DataWorks. This provides higher data security.
Note
Before you use the Security mode to associate an EMR cluster with a workspace, you must add the credentials of workspace roles such as developers and administrators to the Lightweight Directory Access Protocol (LDAP) directory of the EMR cluster. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy and grant relevant data permissions to the workspace roles. This way, the roles can be used to create and run EMR nodes in DataStudio.
  • The credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster by default. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default.
  • To run EMR nodes as a RAM user, you must add the credential of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
The Security mode is suitable for workspaces that require strict management and isolation of data permissions for users who run nodes. Security mode
To use the Security mode to associate an EMR cluster with a workspace, perform the following steps:
  1. Turn on Security Mode for the EMR project.
    1. Log on to the EMR console.
    2. In the top navigation bar, click Data Platform.
    3. In the Projects section, find the project for which you want to enable the Security mode and click Edit Job in the Actions column.
    4. On the page that appears, click the Projects tab in the top navigation bar.
    5. In the left-side navigation pane, click General Configuration. On the General Configuration page, turn on Security Mode. Turn on Security Mode
  2. Add the credentials of specific RAM users to the LDAP directory of the EMR cluster.
    1. Return to the homepage of the EMR console. In the top navigation bar, click Cluster Management.
    2. Find the cluster that you want to manage and click Details in the Actions column.
    3. In the left-side navigation pane, click Users.
    4. On the Users page, click Add User.
    5. In the Add User dialog box, set the parameters.
      We recommend that you add the credentials of the following RAM users to the LDAP directory of the EMR cluster:
      • RAM users that create, test, and run EMR nodes in DataStudio
      • RAM users that create, commit, and deploy EMR nodes in DataStudio
    6. Click OK.
  3. Configure EMR Ranger and manage the permissions of the Hadoop users that correspond to your Alibaba Cloud account and RAM users. For more information, see Integrate Ranger UserSync with an LDAP server and Integrate components with Ranger.
  4. Associate the EMR cluster with the current DataWorks workspace.
    1. Go to the Workspace Management page.
    2. In the Compute Engine Information section, click the E-MapReduce tab.
    3. On the E-MapReduce tab, click Add Instance.
    4. In the New EMR cluster dialog box, set the parameters.
      Parameters in the New EMR cluster dialog box vary based on the mode in which your DataWorks workspace runs. The following table describes the parameters for a DataWorks workspace in standard mode. You must set the parameters for both the production environment and the development environment. Security mode
      Parameter Description
      Instance Display Name The display name of the EMR compute engine instance.
      Region The region of the current workspace.
      Access Mode The access mode of the EMR cluster. Select Security mode from the drop-down list and click Confirm in the Please note message.
      Note You cannot use multiple modes to associate an EMR cluster with a DataWorks workspace at the same time. Proceed with caution when you change the access mode of the EMR cluster because a mode change leads to permission changes.
      Scheduling access identity The identity that is used to commit the code of an EMR node to the EMR cluster. The code is committed when the node is committed and deployed to the DataWorks scheduling system in the production environment. The Hadoop user that corresponds to this identity runs the code.
      Valid values: Task owner, Alibaba Cloud primary account, and Alibaba Cloud sub-account.
      • Task owner: commits and runs the code of an EMR node as the node owner. If you select this value, the data permissions of Hadoop users are isolated. Task owner can be an Alibaba Cloud account or a RAM user.
      • Alibaba Cloud primary account: commits the code of an EMR node to the EMR cluster by using an Alibaba Cloud account.
      • Alibaba Cloud sub-account: commits the code of an EMR node to the EMR cluster as a RAM user.
      Note
      • This parameter is available only for the production environment.
      • The credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster by default. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default.
      • To run EMR nodes as a RAM user, you must add the credential of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
      Access identity The identity that is used to commit the code of an EMR node in the development environment to the EMR cluster. Default value: Task owner. The Hadoop user that corresponds to the user who runs the node runs the code.
      Note
      • This parameter is available only for the development environment of a workspace in standard mode.
      • Make sure that the credential of the user who runs the node is added to the LDAP directory of the EMR cluster. In addition, make sure that the AliyunEMRDevelopAccess or AliyunEMRFulAccess policy is attached to the user and relevant data permissions are granted to the user. This way, the user can run EMR nodes in DataStudio. Task owner can be an Alibaba Cloud account or a RAM user.
        • The credentials of Alibaba Cloud accounts are in the LDAP directory of the EMR cluster by default. The AliyunEMRDevelopAccess and AliyunEMRFullAccess policies are also attached to Alibaba Cloud accounts by default.
        • To run EMR nodes as a RAM user, you must add the credential of the RAM user to the LDAP directory of the EMR cluster. For more information, see the Add the credentials of specific RAM users to the LDAP directory of the EMR cluster step. In addition, you must attach the AliyunEMRDevelopAccess or AliyunEMRFullAccess policy to the RAM user. For more information, see Grant permissions to RAM users.
      Cluster ID The ID of the EMR cluster that you want to associate with the workspace. Select an ID from the drop-down list. The EMR cluster with the selected ID is used as the runtime environment of EMR nodes.
      Project ID The ID of the EMR project that you want to associate with the workspace. Select the ID of an EMR project in Security mode from the drop-down list.
      Note The IDs of the EMR projects that are not in Security mode are not displayed and cannot be selected.
      YARN resource queue The name of the resource queue in the EMR cluster. Unless otherwise specified, set this parameter to default.
      default task submit node The node from which an EMR job is submitted to the EMR cluster. If your EMR cluster is associated with a gateway cluster, set this parameter to the associated gateway cluster. Otherwise, set this parameter to the default value HEADER.
      Note If you set this parameter to the associated gateway cluster, all EMR nodes in the workspace that is associated with the current EMR cluster use the associated gateway cluster to submit EMR jobs by default. If a node does not need to submit EMR jobs by using the gateway cluster, you can add the USE_GATEWAY parameter to the advanced configurations of the node and set the parameter to False. For more information about the advanced configurations of EMR nodes, see the topics related to EMR nodes.
      Endpoint The endpoint of the EMR cluster. The value of this parameter cannot be changed.
    5. Select a resource group.
      1. Select an exclusive resource group for scheduling that connects to the DataWorks workspace. If no exclusive resource group for scheduling is available, create one. For more information about how to create an exclusive resource group for scheduling and configure network connectivity, see Create and use an exclusive resource group for scheduling.
      2. Click Test Connectivity to verify the network connectivity between the exclusive resource group for scheduling and the EMR cluster.
    6. Click Confirm.
  5. Configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts.

    After the EMR cluster in Security mode is associated with your DataWorks workspace, the LDAP account that maps the identity specified by the Access identity parameter is used to run EMR nodes. The Access identity parameter is set when you associate the EMR cluster with your DataWorks workspace. You must configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts on the EMR Cluster Configuration page.

    1. After the EMR cluster is associated with your DataWorks workspace, the Please note message appears. In the Please note message, click Configure Development Environment and Configure Production Environment to configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts.
    2. On the EMR Cluster Configuration page, click Edit in the upper-right corner of the EMR cluster that is associated with your DataWorks workspace.
    3. In the Edit EMR Cluster Configuration dialog box, configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts. You can use one of the following methods to configure the mappings:
      • Reference an existing mapping: You can reference an existing mapping in the current workspace.
      • Create a mapping: In the Configure Account Mapping section, specify the Alibaba Cloud account and LDAP account between which you want to configure a mapping, and enter the password of the LDAP account.
        Note
        • An Alibaba Cloud account or a RAM user to whom the AliyunEMRFullAccess policy is attached can configure mappings for all members of this workspace. Other members of the workspace can configure mappings only for themselves.
        • You can add mappings between multiple Alibaba Cloud accounts and LDAP accounts. A single LDAP account can be mapped to multiple Alibaba Cloud accounts in DataWorks.
    4. Click Confirm. The mappings are configured.