After you bind an E-MapReduce cluster to a DataWorks workspace, you can create EMR nodes such as EMR Hive, EMR MR, EMR Presto, and EMR Spark SQL nodes, add them to workflows, and configure scheduling policies for the nodes. This facilitates metadata management and helps E-MapReduce users better produce data.

DataWorks allows you to bind an E-MapReduce cluster in Shortcut mode or Security mode to meet different security requirements of enterprises. In Shortcut mode, you create and run EMR nodes only to produce data. In Security mode, you can manage data permissions to ensure higher security.

Shortcut mode

If an E-MapReduce cluster is bound to a DataWorks workspace in Shortcut mode, the code of EMR nodes is delivered to the E-MapReduce cluster, regardless of whether the nodes are scheduled by DataWorks or run by using an Alibaba Cloud account or a RAM user. Then, a Hadoop user in the E-MapReduce cluster is used to run the code.
Notice
  • The Hadoop user has all permissions on the E-MapReduce cluster. Proceed with caution.
  • In Shortcut mode, you must attach the AliyunEMRDevelopAccess policy to workspace members such as developers and administrators, so that they can create and run EMR nodes in DataStudio.

The Shortcut mode is applicable to workspaces that do not require data permission management and isolation for node executors.

To bind an E-MapReduce cluster in Shortcut mode, perform the following steps:
  1. Go to the Workspace Management page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, find the workspace to which you want to bind an E-MapReduce cluster and click Workspace Settings in the Actions column. In the Workspace Settings pane, click More. The Workspace Management page appears.More
      You can also go to the Workspace Management page by using the following method: Find the workspace to which you want to bind an E-MapReduce cluster and click Data Analytics in the Actions column. On the DataStudio page, click the Workspace Manage icon icon in the upper-right corner. The Workspace Management page appears.Workspace Settings pane
  2. In the Computing Engine information section, click the E-MapReduce tab.
  3. On the E-MapReduce tab, click Add instances.
  4. In the New EMR cluster dialog box, set the parameters as required.
    Parameters in the New EMR cluster dialog box vary based on the DataWorks workspace mode. The following table describes the parameters for a DataWorks workspace in standard mode. You must set the parameters for both the production environment and the development environment.Standard mode
    Parameter Description
    Instance display name The display name of the E-MapReduce cluster to bind.
    Region The region of the current workspace, which cannot be modified.
    Access Mode The access mode of the E-MapReduce cluster. Select Shortcut mode from the drop-down list.
    Scheduling access identity The identity that is used to deliver the code of an EMR node to the E-MapReduce cluster after the node is committed to the scheduling system of DataWorks in the production environment. Valid values: Alibaba Cloud primary account and Alibaba Cloud sub-account.
    Note If you select Alibaba Cloud sub-account, you must specify a RAM user to which the AliyunEMRDevelopAccess policy is attached.
    Access identity The identity that is used to deliver the code of an EMR node in the development environment to the E-MapReduce cluster. Default value: Task owner.
    Note This parameter is available only when the workspace is in standard mode.
    Cluster ID The ID of the E-MapReduce cluster to bind. Select an ID from the drop-down list. The selected E-MapReduce cluster is used as the runtime environment of EMR nodes.
    Project ID The ID of the E-MapReduce project to bind. Select an ID from the drop-down list. The selected E-MapReduce project is used as the runtime environment of EMR nodes.
    Note E-MapReduce projects in Security mode are unavailable.
    YARN resource queue The name of the resource queue in the E-MapReduce cluster. Unless otherwise specified, set the value to default.
    Endpoint The endpoint of E-MapReduce, which cannot be modified.
  5. Click Confirm.

Security mode

If an E-MapReduce cluster is bound to a DataWorks workspace in Security mode, the code of EMR nodes is delivered to the E-MapReduce cluster, regardless of whether the nodes are scheduled by DataWorks or run by using an Alibaba Cloud account or a RAM user. Then, a Hadoop user in the E-MapReduce cluster with the same name as the Alibaba Cloud account or RAM user is used to run the code. E-MapReduce Ranger can be used to manage permissions of each Hadoop user in the E-MapReduce cluster. This ensures that different Alibaba Cloud accounts or RAM users have different data permissions when they run EMR nodes in DataWorks.

The Security mode is applicable to workspaces that require data permission management and isolation for node executors.
Note In Security mode, you must add the credentials of workspace members such as developers and administrators to the Lightweight Directory Access Protocol (LDAP) directory of the E-MapReduce cluster and grant relevant data permissions to the workspace members, so that they can create and run EMR nodes in DataStudio.
To bind an E-MapReduce cluster in Security mode, perform the following steps:
  1. Enable Security mode for the E-MapReduce project.
    1. Log on to the E-MapReduce console.
    2. Click Data Platform in the top navigation bar.
    3. In the Projects section, find the project for which you want to enable Security mode and click Edit Job in the Actions column.
    4. On the page that appears, click the Projects tab in the top navigation bar.
    5. In the left-side navigation pane, click General Settings. On the General Settings page, turn on Security Mode.
  2. Add the credentials of specific RAM users to the LDAP directory of the E-MapReduce cluster.
    1. Go back to the E-MapReduce console. Click Cluster Management in the top navigation bar.
    2. Find the cluster and click Details in the Actions column.
    3. In the left-side navigation pane, click Users.
    4. On the Users page, click Add User.
    5. In the Add User dialog box, set the parameters as required.
      We recommend that you add the credentials of the following RAM users to the LDAP directory of the E-MapReduce cluster:
      • RAM users that are used to create, test, and run EMR nodes in DataStudio.
      • RAM users that are used to create, commit, and publish EMR nodes in DataStudio.
    6. Click OK.
  3. Configure E-MapReduce Ranger and manage the permissions of the Hadoop users that correspond to your Alibaba Cloud account and RAM users. For more information, see Interconnect Ranger UserSync with an LDAP server.
  4. Bind the E-MapReduce cluster to the current DataWorks workspace.
    1. Go to the Workspace Management page.
    2. In the Computing Engine information section, click the E-MapReduce tab.
    3. On the E-MapReduce tab, click Add instances.
    4. In the New EMR cluster dialog box, set the parameters as required.
      Parameters in the New EMR cluster dialog box vary based on the DataWorks workspace mode. The following table describes the parameters for a DataWorks workspace in standard mode. You must set the parameters for both the production environment and the development environment.
      Parameter Description
      Instance display name The display name of the E-MapReduce cluster to bind.
      Region The region of the current workspace.
      Access Mode The access mode of the E-MapReduce cluster. Select Security mode from the drop-down list and click Confirm in the Please note message.
      Scheduling access identity The identity that is used to deliver the code of an EMR node to the E-MapReduce cluster after the node is committed and published to the scheduling system of DataWorks in the production environment. The Hadoop user that corresponds to this identity is used to run the code.
      Valid values: Task owner, Alibaba Cloud primary account, and Alibaba Cloud sub-account.
      • Task owner: uses the node owner to deliver and run the code. If you select this option, data permissions of Hadoop users are isolated.
      • Alibaba Cloud primary account: uses an Alibaba Cloud account to deliver the code to the E-MapReduce cluster.
      • Alibaba Cloud sub-account: uses a RAM user to deliver the code to the E-MapReduce cluster.
      Access identity The identity that is used to deliver the code of an EMR node in the development environment to the E-MapReduce cluster. Default value: Task owner. The corresponding Hadoop user in the E-MapReduce cluster is used to run the code.
      Note
      • This parameter is available only when the workspace is in standard mode.
      • To prevent node execution failures, add the credentials of the node executor to the LDAP directory of the E-MapReduce cluster and grant the node executor relevant data permissions.
      Cluster ID The ID of the E-MapReduce cluster to bind. Select an ID from the drop-down list. The selected E-MapReduce cluster is used as the runtime environment of EMR nodes.
      Project ID The ID of the E-MapReduce project to bind. Select the ID of an E-MapReduce project in Security mode from the drop-down list.
      Note E-MapReduce projects that are not in Security mode are unavailable.
      YARN resource queue The name of the resource queue in the E-MapReduce cluster. Unless otherwise specified, set the value to default.
      Endpoint The endpoint of E-MapReduce, which cannot be modified.
    5. Click Confirm.