All Products
Search
Document Center

:Configure DataWorks

Last Updated:Sep 06, 2023

Before you develop E-MapReduce (EMR) nodes in DataWorks to run EMR jobs, you must configure the related settings in the DataWorks console to make sure that the EMR jobs can be run as expected. For example, you must purchase a resource group, add members to a workspace, and associate an EMR compute engine with the workspace. This topic describes how to configure the key items in the DataWorks console.

Background information

Before you develop EMR nodes in DataWorks to run EMR jobs, you must associate an EMR DataLake cluster as an EMR compute engine instance with a DataWorks workspace. The DataLake cluster is used to run EMR nodes in DataWorks. When you configure the key items in the DataWorks console, take note of the following points that are related to the association modes, resource groups, and permission management:

  • Select an association mode

    You can associate an EMR compute engine with a DataWorks workspace in shortcut or security mode. The shortcut mode supports rapid data processing, and the security mode ensures higher security based on data permission management.

    • In shortcut mode, a unified Hadoop account is used to run nodes in DataWorks.

    • In security mode, you can specify the account that you want to use to run nodes in DataWorks. If Lightweight Directory Access Protocol (LDAP) authentication is enabled for your cluster or you want to isolate data permissions among different cluster accounts, you can configure mappings between member accounts and cluster accounts in DataWorks. For more information, see Subsequent operation: Configure mappings between DataWorks member accounts and EMR cluster accounts.

    For more information about the differences between the shortcut and security modes, see Differences between the shortcut mode and security mode.

  • Establish a network connection between a resource group and an EMR cluster and initialize the resource group

    To develop EMR nodes to run EMR jobs by using an exclusive resource group for scheduling, you must make sure that the following preparations are made:

    • A network connection is established between an exclusive resource group for scheduling and an EMR cluster. This way, you can access the EMR cluster from DataWorks. For more information, see Configure an exclusive resource group for scheduling.

    • The exclusive resource group for scheduling is initialized. A client tool for the EMR cluster is deployed and configured for the resource group. This way, you can run different types of EMR nodes such as EMR Hive nodes, EMR Spark nodes, and EMR Presto nodes. For more information, see Initialize the resource group with which you associate an EMR compute engine in the "Associate an EMR compute engine with the DataWorks workspace" section of this topic.

  • Manage permissions

    • Association permissions

      Only an account to which the AliyunEMRFullAccess policy is attached can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace. For more information about users, roles, and permissions, see Overview of users, roles, and permissions.

    • Data development permissions

      Before users can develop data in DataWorks, the users must be added as workspace members, and mappings between member accounts and EMR cluster accounts must be configured. This way, the EMR cluster accounts can be authenticated and data permissions of the EMR cluster accounts can be managed in DataWorks. The users can run nodes in DataWorks by using the mapped EMR cluster accounts and data permissions are isolated among the members in the workspace. For information about how to add members to a workspace, see the Add members to a workspace and assign roles to the workspace members step in the Procedure section. For information about how to configure mappings between member accounts and cluster accounts, see Subsequent operation: Configure mappings between DataWorks member accounts and EMR cluster accounts.

Limits

  • You can run EMR nodes in DataWorks by using only an exclusive resource group for scheduling.

  • Only an account to which the AliyunEMRDevelopAccess policy is attached can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace.

Procedure

  1. Purchase and configure an exclusive resource group for scheduling.

    Before you develop EMR nodes to run EMR jobs in DataWorks, you must purchase an exclusive resource group for scheduling and connect the resource group to the virtual private cloud (VPC) where the current EMR cluster resides. For more information about how to purchase and configure an exclusive resource group for scheduling, see Exclusive resource groups for scheduling.

  2. Add members to a workspace and assign roles to the workspace members.

    You must add users to a workspace as workspace members before the users can develop EMR nodes on the DataStudio page in DataWorks. For more information about how to add a workspace member, see Manage permissions on workspace-level services.

  3. Go to the page on which you can associate an EMR compute engine with a DataWorks workspace.

    1. Go to the Management Center page.

      Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left-side navigation pane, click Workspace. On the Workspace page, click the Compute Engine Information tab.

  4. Associate an EMR compute engine with the DataWorks workspace.

    Associate an EMR DataLake cluster as an EMR compute engine instance with a DataWorks workspace. For more information, see Associate an EMR compute engine with a workspace.