DataWorks allows you to create E-MapReduce (EMR) nodes such as EMR Hive nodes, EMR MR nodes, EMR Presto nodes, and EMR Spark SQL nodes based on an EMR compute engine instance. You can use the different types of EMR nodes to perform different operations, such as the configuration of an EMR workflow, scheduling of nodes in a workflow, or metadata management in a workflow. These features help EMR users generate data in an efficient manner. This topic provides an overview on the use of EMR in DataWorks. We recommend that you read this topic before you use EMR in DataWorks.

Limits

The limits on EMR-related features in DataWorks vary based on the version or configurations of your EMR cluster. The following table describes the limits.

EMR-related feature in DataWorks Requirements for the version or configurations of an EMR cluster
Associate an EMR cluster with a DataWorks workspace. DataWorks no longer allows you to associate Hadoop clusters with DataWorks workspaces. However, the Hadoop clusters that are associated with your DataWorks workspace can still be used. We recommend that you associate data lake clusters with DataWorks workspaces for EMR job development. For more information, see Development process of an EMR node in DataWorks (Read this topic before you start).
In the Data Map service of DataWorks, view the output information of metadata tables and use the automatic recommendation feature of Data Map. Limits on EMR clusters: For an EMR V3.X cluster, the version of the cluster must be later than V3.33.0. For an EMR V4.X cluster, the version of the cluster must be later than V4.6. Clusters of other versions do not support these features.
Create and use an EMR JAR resource, Create an EMR table, and Create an EMR function. The Kerberos or LDAP-based security mode is disabled for the EMR cluster.
Create and use nodes except for EMR Hive nodes and EMR Spark nodes. The Kerberos or LDAP-based security mode is disabled for the EMR cluster. If the Kerberos or LDAP-based security mode is enabled, you can create and use only EMR Hive nodes, EMR Spark nodes, EMR Spark SQL nodes, and EMR Spark Streaming nodes.
  • DataWorks does not allow you to develop EMR nodes to run Flink jobs.
  • You can use only EMR Hive nodes to collect EMR metadata and lineage information.
  • If your EMR clusters or exclusive resource groups for scheduling of DataWorks were created before August 1, 2021, you must submit a ticket to update the plug-in for using EMR in DataWorks to the latest version.

Preparations

Before you run an EMR node in DataWorks, you must make the following preparations:
  • Purchase and configure an exclusive resource group for scheduling.

    Before you run an EMR node, you must purchase an exclusive resource group for scheduling and connect the resource group to the VPC in which the EMR cluster resides. For more information about how to purchase and configure an exclusive resource group for scheduling, see Overview.

  • Check the configurations of the EMR cluster.
    Before you run an EMR node in DataWorks, you must check whether the configurations of the EMR cluster meet the following requirements. Otherwise, an error may occur when you run the EMR node in DataWorks.
    • An EMR cluster is created, and an inbound rule that contains the following content is added to the security group to which the cluster belongs.
      • Action: Allow
      • Protocol type: Custom TCP
      • Port range: 8898/8898
      • Authorization object: 100.104.0.0/16
    • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
      1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
        hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
        Note In the preceding code, ALISA.* and SKYNET.* are specific to DataWorks.
      2. After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
    • The hadoop.http.authentication.simple.anonymous.allowed parameter in the configurations for Hadoop Distributed File System (HDFS) of the EMR cluster is set to true, and the HDFS and YARN services are restarted.
  • Associate the EMR cluster with a DataWorks workspace.
    You can create and develop an EMR node in DataWorks only after you associate an EMR cluster with a DataWorks workspace. When you associate an EMR cluster with a DataWorks workspace, select a mode to access the EMR cluster based on the configurations of the EMR cluster. Take note of the following items when you select the access mode. For more information, see Associate an EMR cluster with a DataWorks workspace as an EMR compute engine instance.
    • If the LDAP authentication is disabled for the EMR cluster, select Shortcut mode.
    • If the LDAP authentication is enabled for the EMR cluster, select Security mode.
      In this scenario, you must perform the following operations:
      • In the EMR console, restart the related services for which LDAP authentication is enabled.
      • Turn on Security Mode for the project in which the EMR cluster resides.
      • Set the Access Mode parameter to Security mode when you associate an EMR cluster with a DataWorks workspace. Security Mode
      • Configure mappings between RAM users and LDAP accounts on the EMR Cluster Configuration page. This operation is important.
      • If LDAP authentication is enabled for the Impala service, perform the following operations:
        1. Download the Impala JDBC driver.

          If LDAP authentication is enabled, you must provide LDAP authentication credentials when you access Impala by using JDBC. In addition, you must download the Impala JDBC driver from the official website of Cloudera and add the driver to the /usr/lib/hive-current/lib directory. To download the Impala JDBC driver, click JDBC Driver for Impala.

        2. After you download the JAR package, copy the JAR package to the /usr/lib/flow-agent-current/zeppelin/interpreter/jdbc/ directory of the master node in the EMR cluster or the gateway cluster that is associated with the EMR cluster.
        3. Restart the FlowAgent component by clicking Flow Agent Daemon in the EMR console.

Create and debug an EMR node

After the preparations are complete, you can create, compile, and run an EMR node in DataWorks. When you create and debug an EMR node, take note of the following items:
  • Advanced parameters
    • "USE_GATEWAY":true: If you set this parameter to true, the node is automatically committed to the gateway cluster that is associated with the EMR cluster. The default value is false, which indicates that the node is committed to the master node of the EMR cluster.
    • "SPARK_CONF": "--conf spark.driver.memory=2g --conf xxx=xxx": the parameters that are required to run Spark jobs. You can configure multiple parameters in the --conf xxx=xxx format.
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
      Note Queues that you configure by using this parameter have a higher priority than the queues that you configure when you associate an EMR cluster with a DataWorks workspace.
    • "vcores": the number of CPU cores. Default value: 1. We recommend that you use the default value.
    • "memory": the memory that is allocated to the launcher. Default value: 2048. We recommend that you use the default value.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. The value false indicates that only one SQL statement is executed at a time. The value true indicates that multiple SQL statements are executed at a time.
  • Debugging
    If parameters are used in the code of the node, you must declare these parameters in the Parameters field on the Properties tab and click Run with Parameters to start debugging. Run with Parameters

Data Map

Before you use DataWorks to collect metadata from EMR, you must check whether the configurations of the EMR cluster associated with your DataWorks workspace meet the requirements.
  1. In the EMR console, check whether the configurations for the hive.metastore.pre.event.listeners and hive.exec.post.hooks parameters take effect. Data Map for EMR
  2. Configure the parameters that are required to collect metadata on the DataMap page in the DataWorks console. For more information, see Collect and view metadata.

FAQ: Why does an EMR node fail to be committed?

  • Problem description
    The following error message is returned when an EMR node is committed in DataWorks.
    Note The error message does not indicate that the node failed to run.
    >>> [2021-07-29 07:49:05][INFO   ][InteractiveJobSubmitter]: Fail to submit job: ### ErrorCode: E00007
    
    java.io.IOException: Request Failed, code=500, message=
    
            at com.aliyun.emr.flow.agent.client.protocol.impl.FlowAgentClientRestProtocolImpl.exchange(FlowAgentClientRestProtocolImpl.java:146)
    
           
  • Possible causes

    The FlowAgent component of the EMR cluster is required to integrate EMR into DataWorks. If the preceding error message is returned, the cause may lie in the FlowAgent component.

  • Solution
    Restart the FlowAgent component.
    1. Visit the FlowAgent page.

      By default, the FlowAgent page cannot be directly visited from the EMR console. Therefore, you need to visit the page of a component, and then modify the URL to visit the FlowAgent page.

      HDFS is used in the example. After you go to the HDFS service page of an EMR cluster, the URL of the HDFS service page is shown as https://emr.console.aliyun.com/#/cn-hangzhou/cluster/C-XXXXXXXXXXXXXX/service/HDFS. You need to change the component name HDFS at the end of the URL to EMRFLOW. The complete URL is shown as https://emr.console.aliyun.com/#/cn-hangzhou/cluster/C-XXXXXXXXXXXXXX/service/EMRFLOW.

    2. Restart the FlowAgent component.

      In the upper-right corner of the FlowAgent page, choose Actions > Restart All Components.