DataWorks allows you to create E-MapReduce (EMR) nodes such as Hive nodes, MR nodes, Presto nodes, and Spark SQL nodes based on an EMR compute engine instance. You can use the different types of EMR nodes for different features. For example, you can configure an EMR workflow, schedule nodes in a workflow, or manage metadata in a workflow. These features help you generate data in an efficient manner by using EMR. This topic describes the precautions and development process of an EMR node in DataWorks based on an EMR data lake cluster. We recommend that you read this topic before you develop an EMR node to run EMR jobs.

Background information

You can create an EMR data lake cluster only in the new EMR console. A data lake cluster is a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner. You can use a data lake cluster to build a scalable data pipeline. For more information about an EMR data lake cluster, see Configure a data lake cluster.

Limits

The following table describes the limits of EMR data lake clusters, DataWorks features, and EMR nodes that are used to run EMR jobs or perform other operations in DataWorks.
Item Description
Data lake cluster The EMR data lake cluster must be of V3.41.0 or a later minor version, or V5.7.0 or a later minor version. If the EMR data lake cluster is of a minor version earlier than V3.41.0 or V5.7.0, specific DataWorks features cannot be used.
DataWorks features
  • DataWorks does not allow you to develop EMR nodes to run Flink jobs.
  • Only EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate lineages. EMR Hive nodes can be used to generate table-level and column-level lineages. EMR Spark nodes and EMR Spark SQL nodes can be used to generate only table-level lineages.
    Note For Spark-based EMR nodes, if the EMR cluster to which the nodes belong is of V5.8.0 or a later minor version, or V3.42.0 or a later minor version, the Spark-based EMR nodes can be used to generate table-level and column-level lineages. If the EMR cluster to which the nodes belong is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to generate table-level lineages.
EMR nodes in DataWorks EMR nodes can be run only on an exclusive resource group for scheduling.
Note If your EMR cluster or DataWorks exclusive resource group for scheduling was purchased before July 2022, submit a ticket to apply for an upgrade of the agent that is used to run EMR nodes in DataWorks to the latest version.

Procedure

This section describes the procedure of developing an EMR node in DataWorks:
  1. Make preparations
    Before you develop an EMR node in DataWorks, you must complete the required preparations on the EMR and DataWorks sides.
    Item Description References
    EMR To prevent an error from being reported during the execution of an EMR node in DataWorks, you must make sure that the key configurations of the related EMR data lake cluster meet the requirements. For example, in the EMR data lake cluster, you must configure the settings of Lightweight Directory Access Protocol (LDAP), a Ranger whitelist, and a security policy to authenticate the identity of the account that you use to run the EMR node in DataWorks.
    DataWorks
    • You must associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace in the DataWorks console. The EMR cluster is used to run EMR nodes in DataWorks.
      Note You can associate an EMR cluster with a DataWorks workspace in shortcut or security mode. The shortcut mode supports rapid data processing, and the security mode ensures higher security based on data permission management. You can select a mode based on the condition of the EMR cluster and your business requirements.
    • To ensure that an EMR node can normally run, you must purchase and configure an exclusive resource group for scheduling, add members to a workspace, assign roles to the members, and prepare the resources and permissions that are required to run the EMR node.
  2. Develop an EMR node
    1. Create a workflow.

      In DataStudio, data development is performed by using components such as a node in a workflow. Before you can create a node, you must create a workflow. For more information about how to create a workflow, see Create a workflow.

    2. Create an EMR node.
      DataWorks uses nodes to develop data. DataWorks encapsulates the different types of jobs in an EMR cluster into different types of EMR nodes in DataWorks. You can select a type of EMR node and develop the node based on your business requirements in DataWorks. For more information about how to create an EMR node in DataWorks, see Create an EMR node.
      Note You can create the following types of EMR nodes in DataWorks: EMR Hive, EMR MR, EMR Spark SQL, EMR Spark, EMR Shell, EMR Presto, EMR Impala, and EMR Spark Streaming.
    3. Develop and debug the code for an EMR node.

      To ensure that the EMR node that you develop can be run in an efficient manner and fully utilize computing resources, we recommend that you debug the code for the node before you commit and deploy the node. For more information, see Debug and view a node.

    4. Commit and deploy the EMR node.
      After the debugging is complete, you can commit and deploy the EMR node to the production environment for scheduling. For more information, see Deploy nodes.
      Note The EMR node can be scheduled to run after the node is deployed to the production environment.

Additional information

After an EMR node is developed, you can manage EMR metadata, perform O&M and monitoring operations on the node, and monitor data quality of the node in DataWorks. This way, EMR data can be generated.
Operation Description References
Manage EMR metadata After you associate an EMR cluster with a DataWorks workspace, Data Map automatically collects the metadata of the EMR cluster. On the DataMap page, you can view information, including the details of EMR tables and the source data for the EMR tables.
Note
  • Only EMR Hive, EMR Spark, and EMR Spark SQL nodes support the lineage feature in Data Map. EMR Hive nodes can be used to generate table-level and column-level lineages. Spark-based EMR nodes can be used to generate only table-level lineages.
  • For Spark-based EMR nodes, if the EMR cluster to which the nodes belong is of V5.8.0 or a later minor version, or V3.42.0 or a later minor version, the Spark-based EMR nodes can be used to generate table-level and column-level lineages. If the EMR cluster to which the nodes belong is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to generate table-level lineages.
  • If you cannot query the metadata of the EMR cluster in Data Map after you associate the EMR cluster with the DataWorks workspace, you can collect the metadata of the EMR cluster again in Data Map. For more information, see Collect metadata from an EMR data source.
Overview
Monitor data quality of EMR nodes in DataWorks Data Quality allows you to monitor the quality of data in tables that are generated by scheduling nodes. You can configure monitoring rules for tables to monitor the data quality of data in the tables.
Note To configure monitoring rules for table data generated by an EMR node that is run based on an EMR data lake cluster, you must select the dqc_emr_plugin_datalake plug-in.
Overview
Perform O&M and monitoring operations on EMR nodes The intelligent monitoring feature allows you to monitor the status of scheduling nodes. You can configure alert rules to monitor the status of EMR nodes. Overview

Permissions

This section describes the required permissions to develop EMR nodes in DataWorks:
  • EMR cluster authentication

    If you use a non-system account to run EMR nodes in DataWorks, you must enable the authentication service in the EMR cluster and add the non-system account to the authentication service.

  • Permissions to associate an EMR cluster with a DataWorks workspace

    Only an account that is attached the AliyunEMRFullAccess policy can be used to associate an EMR cluster as an EMR compute engine instance with a DataWorks workspace in the DataWorks console. The EMR cluster is used to run EMR nodes in DataWorks.

  • Data permission management
    • DataWorks

      In DataWorks, you are granted permissions on the EMR compute engine based on the mapping between DataWorks workspace members and EMR cluster accounts. A member in a DataWorks workspace is authenticated to perform operations in an EMR cluster. Different data operation permissions are granted to different node owners, Alibaba Cloud accounts, and RAM users to isolate permissions when they run EMR nodes in DataWorks.

    • EMR

      The permissions that a workspace member has when the workspace member runs EMR nodes are determined by the mapped EMR cluster account. You can use the related components in the EMR cluster to manage the permissions of the EMR cluster account. For example, you can use Ranger to manage the permissions of an EMR cluster account that maps to an Alibaba Cloud account.

  • Permission management for DataWorks service modules

    If you want to run EMR nodes in DataWorks, you must be granted the permissions on DataWorks service modules such as DataStudio, Data Map, Data Quality, and intelligent monitoring. After you obtain the required permissions on the service modules, you can develop EMR nodes, perform O&M operations on the nodes, and monitor data quality of the nodes.

For more information about the required permissions to run EMR nodes in DataWorks, see Permission management for running EMR nodes in DataWorks.