Before you develop E-MapReduce (EMR) nodes in DataWorks to run EMR jobs, you must configure the related settings in the DataWorks console to make sure that the EMR jobs can be run as expected. For example, you must purchase a resource group, add members to a workspace, and associate an EMR compute engine instance with the workspace. This topic describes how to configure the key items in the DataWorks console.

Background information

Before you develop EMR nodes in DataWorks to run EMR jobs, you must associate an EMR data lake cluster as an EMR compute engine instance with a DataWorks workspace. The data lake cluster is used to run EMR nodes in DataWorks. When you configure the key items in the DataWorks console, take note of the following points that are related to the association modes, resource groups, and permission management.
  • Select an association mode
    You can associate an EMR compute engine instance with a DataWorks workspace in shortcut or security mode. The shortcut mode supports rapid data processing, and the security mode ensures higher security based on data permission management. For more information, see Configure the access mode.
    • If you select the shortcut mode, only a Hadoop account can be used to run nodes in DataWorks.
    • If you select the security mode, you can specify the account that you want to use to run nodes in DataWorks. If Lightweight Directory Access Protocol (LDAP) authentication is enabled for your cluster or you want to isolate data permissions among different cluster accounts, you can configure mappings between workspace members and cluster accounts in DataWorks. For more information, see Configure mappings between workspace members and cluster accounts.
    For more information about the differences between the shortcut and security modes, see Differences between the shortcut mode and security mode.
  • Establish a network connection between a resource group and an EMR cluster and initialize the resource group
    To develop EMR nodes to run EMR jobs by using an exclusive resource group for scheduling, you must make sure that the following preparations are made:
    • A network connection is established between an exclusive resource group for scheduling and an EMR cluster. This way, you can access the EMR cluster from DataWorks. For more information, see Configure an exclusive resource group for scheduling.
    • The exclusive resource group for scheduling is initialized. A client tool for the EMR cluster is deployed and configured for the resource group. This way, you can run different types of EMR nodes such as EMR Hive nodes, EMR Spark nodes, and EMR Presto nodes. For more information, see Initialize the resource group with which you associate an EMR compute engine instance in the "Associate an EMR compute engine instance with the DataWorks workspace" section in this topic.
  • Manage permissions
    • Association permissions

      Only an account that is attached the AliyunEMRFullAccess policy can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace. For more information about users, roles, and permissions, see Overview of users, roles, and permissions.

    • Data development permissions

      Before users can develop data in DataWorks, the users must be added as workspace members, and mappings between workspace members and EMR cluster accounts must be configured. This way, the EMR cluster accounts can be authenticated and data permissions of the EMR cluster accounts can be managed in DataWorks. The users can run nodes in DataWorks by using the mapped EMR cluster accounts and data permissions are isolated among the members in the workspace. For more information about how to add a workspace member, see Add a workspace member. For more information about how to configure mappings between workspace members and cluster accounts, see 3.

Limits

  • You can run EMR nodes in DataWorks by using only an exclusive resource group for scheduling.
  • Only an account that is attached the AliyunEMRDevelopAccess policy can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace.

Procedure

  1. Purchase and configure an exclusive resource group for scheduling.
    Before you develop EMR nodes to run EMR jobs in DataWorks, you must purchase an exclusive resource group for scheduling and connect the resource group to the virtual private cloud (VPC) where the current EMR cluster resides. For more information about how to purchase and configure an exclusive resource group for scheduling, see Overview.
  2. Add members to a workspace and assign roles to the workspace members.
    You must add users to a workspace as workspace members before the users can develop EMR nodes on the DataStudio page in DataWorks. For more information about how to add a workspace member, see Manage workspace-level roles and members.
  3. Go to the page on which you can associate an EMR compute engine instance with a DataWorks workspace.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. Go to the Workspace Management page to associate an EMR compute engine instance with a DataWorks workspace.
      Note If you do not have a DataWorks workspace, you must create a workspace. For more information, see Create a workspace and associate an EMR compute engine instance with the workspace.
      After you select the region where the desired workspace resides, you can use one of the following methods to go to the Workspace Management page:
      • On the Workspaces page, find the desired workspace, move the pointer over the More icon icon in the Actions column, and then select Workspace Settings. In the Workspace Settings panel, click More.
      • On the Workspaces page, find the desired workspace and click DataStudio in the Actions column. On the DataStudio page, click the Workspace Manage icon in the upper-right corner of the page.
  4. Associate an EMR compute engine instance with the DataWorks workspace.
    Associate an EMR data lake cluster as an EMR compute engine instance with the DataWorks workspace. For more information, see Associate an EMR compute engine instance with the DataWorks workspace.

Associate an EMR compute engine instance with the DataWorks workspace

  1. Add a compute engine instance.
    1. In the Compute Engine Information section of the Workspace Management page, click E-MapReduce.
    2. On the E-MapReduce tab, click Add Instance.
  2. Configure the EMR compute engine instance.
    In the Add EMR Cluster dialog box, configure the parameters. In this example, the parameter configurations in the production environment are used. The parameter configurations in the development environment are similar to those in the production environment.
    Note The configurations vary based on the mode in which the DataWorks workspace runs. A DataWorks workspace can run in basic mode or standard mode. You must configure the parameters for both the production and development environments for a DataWorks workspace in standard mode.
    1. Configure the access mode.
      Parameter Description
      Instance Display Name The name of the EMR compute engine instance.
      Region The region where the EMR compute engine instance resides. The region where the current workspace resides is used by default and you cannot change the value.
      Access Mode The mode in which you associate the EMR compute engine instance with the DataWorks workspace. Valid values: Shortcut mode and Security mode.
      • Shortcut mode: You can select this mode if you have no requirements for managing data permissions in the workspace. In this mode, you use a single account to perform operations.
      • Security mode: You can select this mode if you have requirements for managing data permissions in the workspace. In this mode, you can configure mappings between personal accounts and EMR cluster accounts to allow nodes to be committed and run by using the personal accounts and to isolate data permissions among the personal accounts in DataWorks.
        Note If LDAP authentication is enabled for your cluster, you must manually configure mappings between Alibaba Cloud accounts or RAM users and LDAP accounts. For more information about how to enable LDAP authentication, see Configure an EMR data lake cluster.
    2. Configure basic information.
      Parameter Description
      Clusteraccessidentity The identity that you use to run EMR nodes in DataWorks. You can configure this parameter based on the value of the Access Mode parameter.
      • Shortcut mode

        In both the development and production environments in DataWorks, when an Alibaba Cloud account or a RAM user is used to run the code of a node or run a node as scheduled, the code is issued to an EMR cluster, and the Hadoop account in the EMR cluster is actually used to commit the node.

      • Security mode
        This mode is supported in a workspace in standard or basic mode. For a workspace in standard mode, you can specify different EMR cluster accounts that are used to commit and run nodes in development and production environments.
        • Workspace in standard mode
          • Development environment: By default, the user that runs nodes commits the nodes.
          • Production environment: You can select one of the following identities to commit and run scheduling nodes:
            • Taskowner: the Linux or LDAP account that is mapped to the node owner is used to run nodes.
            • AlibabaCloudprimaryaccount: the Linux or LDAP account that is mapped to an Alibaba Cloud account is used to run nodes.
            • AlibabaCloudsub-account: the Linux or LDAP account that is mapped to a RAM user is used to run nodes.
        • Workspace in basic mode: You can select one of the following identities to commit and run scheduling nodes:
          • Taskowner: the Linux or LDAP account that is mapped to the node owner is used to run nodes.
          • AlibabaCloudprimaryaccount: the Linux or LDAP account that is mapped to an Alibaba Cloud account is used to run nodes.
          • AlibabaCloudsub-account: the Linux or LDAP account that is mapped to a RAM user is used to run nodes.
        Note If LDAP authentication is enabled for your EMR cluster and you associate the EMR cluster with a DataWorks workspace in security mode, you must configure mappings between workspace members and cluster accounts after the association is complete. This ensures that different workspace members have different data permissions. For more information about how to enable LDAP authentication, see Configure an EMR data lake cluster.
    3. Configure the EMR compute engine instance.
      Parameter Description
      ClusterID The ID of the EMR cluster that you want to associate with a DataWorks workspace. The EMR cluster serves as an EMR compute engine instance. Only data lake clusters in the current region are displayed.
      YARNresourcequeue The YARN queue that is selected by default when you commit nodes in DataWorks by using the current EMR compute engine instance. Default value: default.
      OverrideDataStudioYARNresourcequeue The queue rule that is used to run nodes.
      • If you select OverrideDataStudioYARNresourcequeue, all nodes are run based on the value of the YARNresourcequeue parameter.
      • If you do not select OverrideDataStudioYARNresourcequeue, all nodes are run based on the value of the queue parameter.
        • If the queue parameter is configured on the Advanced Settings tab of an EMR node, the EMR node is run based on the value of the queue parameter.
        • If the queue parameter is not configured on the Advanced Settings tab of an EMR node, the EMR node is run based on the value of the YARNresourcequeue parameter.
      Note If OverrideDataStudioYARNresourcequeue is not displayed in the Add EMR Cluster dialog box, the version of your EMR cluster may not meet the requirements. In this case, submit a ticket to upgrade your EMR cluster.
      Endpoint The endpoint of the EMR compute engine instance. You cannot modify the endpoint.
    4. Initialize the resource group with which you associate an EMR compute engine instance.
      1. Select an exclusive resource group for scheduling that connects to the current DataWorks workspace.

        If no exclusive resource groups for scheduling are available, create an exclusive resource group for scheduling and configure network connectivity for the resource group. For more information, see Create and use an exclusive resource group for scheduling.

      2. Click Initialize to initialize the resource group and test the network connectivity between the exclusive resource group for scheduling and the EMR compute engine instance.

        You can also select multiple resource groups to initialize them at a time.

      Note If the configurations of the EMR cluster that you use as an EMR compute engine instance change or the versions of the components in the EMR cluster change, you must initialize the resource group again in this dialog box.
  3. Configure mappings between workspace members and cluster accounts.
    After you associate the EMR compute engine instance with the DataWorks workspace in security mode, you must configure a mapping between an EMR cluster account and the identity that you specified for the Clusteraccessidentity parameter. This way, the EMR cluster account is used to run EMR nodes in DataWorks. To configure mappings between workspace members and EMR cluster accounts, perform the following steps:
    1. Go to the EMR Config page.
      The following table describes the methods that you can use to go to the EMR Config page:
      Method Procedure
      1 After the EMR compute engine instance is associated with your DataWorks workspace, the Please note message appears. In the Please note message, click Configure Development Environment and Configure Production Environment.
      2 Click Configure Account Mapping on the E-MapReduce tab in the Compute Engine Information section.
      3 On the EMR Config page, click Edit in the upper-right corner of the EMR cluster that is associated with your DataWorks workspace.
    2. In the Edit EMR Cluster Configuration dialog box, configure mappings between workspace members and cluster accounts.
      Edit EMR Cluster Configuration dialog boxThe following table describes the methods that you can use to configure the mappings:
      Method Procedure
      Reference created mappings You can select the mappings that are created in the current workspace for Reference Mapping.
      Configure mappings After you configure the Mapping Type parameter, you can select an Alibaba Cloud account or a RAM user and a mapped cluster account in the Configure Account Mapping section.
      Note
      • An Alibaba Cloud account or a RAM user to which the AliyunEMRFullAccess policy is attached can configure mappings for all members in the current workspace. Other members of the workspace can configure mappings only for themselves.
      • You can add multiple mappings between Alibaba Cloud accounts or RAM users and system accounts, or between Alibaba Cloud accounts or RAM users and LDAP accounts. In DataWorks, a single cluster account can be mapped to multiple Alibaba Cloud accounts or RAM users.
    3. Click Confirm. The mappings are configured.