After you associate an E-MapReduce (EMR) cluster with a DataWorks workspace, you can create EMR nodes such as EMR Hive, EMR MR, EMR Presto, and EMR Spark SQL nodes. Then, you can configure properties for these EMR nodes and run the nodes. This facilitates metadata management and data generation for EMR users. This topic describes the notes on how to use EMR in DataWorks. We recommend that you read this topic before you use EMR in DataWorks.

Limits

The following table describes the limits of different versions or configurations of EMR clusters for using EMR in DataWorks.

EMR-related feature in DataWorks Requirements for the versions or configurations of an EMR cluster
View the output information of metadata tables and use the automatic recommendation feature of the Data Map service of DataWorks. The version of the EMR cluster is later than V3.33.0 or V4.6.
Create and use an EMR Streaming SQL node The version of the EMR cluster is V3.36 or V5.2.
Create and use an EMR JAR resource, Create an EMR table, and Create an EMR function. The Kerberos or LDAP-based security mode is disabled for the EMR cluster.
Create and use nodes except for EMR Hive nodes and EMR Spark nodes. The Kerberos or LDAP-based security mode is disabled for the EMR cluster. If the Kerberos or LDAP-based security mode is enabled, you can create and use only EMR Hive nodes, EMR Spark nodes, EMR Spark SQL nodes, and EMR Spark Streaming nodes.
  • DataWorks does not support Flink tasks for EMR.
  • You can use only EMR Hive nodes in DataWorks to collect EMR metadata and lineage information.
  • If your EMR clusters or exclusive resource groups for scheduling of DataWorks were created before August 1, 2021, you must submit a ticket to update the plug-in for using EMR in DataWorks to the latest version.

Preparations

Before you run an EMR node in DataWorks, make the following preparations:
  • Purchase and configure an exclusive resource group for scheduling.

    Before you run an EMR node, you must purchase an exclusive resource group for scheduling and connect it to the VPC in which the EMR cluster resides. For more information about how to purchase and configure an exclusive resource group for scheduling, see Exclusive resource groups for scheduling.

  • Check the configurations of the EMR cluster.
    Before you run an EMR node in DataWorks, you must check whether the configurations of the EMR cluster meet the following requirements. Otherwise, an error may occur when you run an EMR node in DataWorks.
    • An EMR cluster is created. An inbound rule that contains the following content is added to the security group to which the EMR cluster belongs:
      • Action: Allow
      • Protocol type: Custom TCP
      • Port range: 8898/8898
      • Authorization object: 100.104.0.0/16
    • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
      1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
        hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
        Note In the code, ALISA.* and SKYNET.* are configurations in DataWorks.
      2. After the whitelist configurations are modified, restart the Hive service to make the configurations take effect. For more information, see Restart a service.
    • Set the hadoop.http.authentication.simple.anonymous.allowed parameter to true on the HDFS page in the EMR console. Then, restart the HDFS and YARN components.
  • Associate the EMR cluster with a DataWorks workspace.
    You can create and develop an EMR node in DataWorks only after you associate an EMR cluster with a DataWorks workspace. When you associate an EMR cluster with a DataWorks workspace, select a mode to access the EMR cluster based on the configurations of the EMR cluster. Take note of the following items when you select the access mode. For more information, see Associate an EMR cluster with a DataWorks workspace.
    • If the LDAP authentication is disabled for the EMR cluster, select Shortcut mode when you associate the EMR cluster with a DataWorks workspace.
    • If the LDAP authentication is enabled for the EMR cluster, select Security mode when you associate the EMR cluster with a DataWorks workspace.
      In this scenario, you must perform the following operations:
      • In the EMR console, restart the related components if the LDAP authentication is enabled for them.
      • Turn on Security Mode for the project in which the EMR cluster resides.
      • Set the Access Mode parameter to Security mode when you associate an EMR cluster with a DataWorks workspace. Security mode
      • Configure mappings between RAM users and LDAP accounts on the EMR Cluster Configuration page. This operation is important.
      • If the LDAP authentication is enabled for the Impala component, perform the following operations before you run an EMR node in DataWorks:
        1. Download the Impala JDBC driver.

          After the LDAP authentication is enabled, you must provide LDAP authentication credentials when you access Impala by using JDBC. In addition, you must download the Impala JDBC driver from the official website of Cloudera and add the driver to the /usr/lib/hive-current/lib directory. To download the Impala JDBC driver, click here.

        2. After you download the JAR package, copy the JAR package to the /usr/lib/flow-agent-current/zeppelin/interpreter/jdbc/ directory of the header or gateway node in the EMR cluster.
        3. Restart the FlowAgent component by clicking Flow Agent Daemon in the EMR console.

Create and debug an EMR node

If the preparations are complete, you can create, compile, and run an EMR node in DataWorks. Take note of the following items when you create and debug an EMR node:
  • Advanced parameters
    • "USE_GATEWAY":true: If you set this parameter to true, the node is automatically committed to the master node of an EMR gateway cluster.
    • "SPARK_CONF": "--conf spark.driver.memory=2g --conf xxx=xxx": the parameters that are required to run Spark jobs. You can configure multiple parameters in the --conf xxx=xxx format.
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
      Note Compared with the queue that you configure when you associate an EMR cluster with a DataWorks workspace, the queue that you configure here has a higher priority. Later versions will support the parameters of the default queue.
    • "vcores": the number of CPU cores. Default value: 1. We recommend that you use the default value.
    • "memory": the memory that is allocated to the launcher. Default value: 2048. We recommend that you use the default value.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. A value of false indicates that only one SQL statement is executed at a time. A value of true indicates that multiple SQL statements are executed at a time.
  • Debugging
    If parameters are used in the code of the node, you must declare these parameters in the Parameters field on the Properties pane and click Advanced run to start debugging. Advanced run (run with parameters) icon

Data Map

Before you use DataWorks to collect metadata from EMR, you must check whether the configurations of the EMR cluster meet the requirements.
  1. In the EMR console, check whether the configurations for the hive.metastore.pre.event.listeners and hive.exec.post.hooks fields take effect. Data Map of EMR
  2. Set the parameters that are required to collect metadata on the Data Map page in the DataWorks console. For more information, see Collect and view metadata.

FAQ: Why does the operation to commit an EMR job fail?

  • Problem description
    The following error message is returned after an EMR job failed to be committed in DataWorks.
    Note The error message does not indicate that the job failed to be run.
    >>> [2021-07-29 07:49:05][INFO ][InteractiveJobSubmitter]: Fail to submit job: ### ErrorCode: E00007
    
    java.io.IOException: Request Failed, code=500, message=
    
            at com.aliyun.emr.flow.agent.client.protocol.impl.FlowAgentClientRestProtocolImpl.exchange(FlowAgentClientRestProtocolImpl.java:146)
    
           
  • Possible causes

    The FlowAgent component of the EMR cluster is required to integrate EMR into DataWorks. If the preceding error message is returned, the cause may lie in the FlowAgent component. Restart the component to fix the problem.

  • Solution:
    You can restart the FlowAgent component to fix the problem.
    1. Visit the FlowAgent page.

      By default, the FlowAgent page cannot be directly visited from the EMR console. Therefore, you need to visit the page of a component, and then modify the URL to visit the FlowAgent page.

      Take the HDFS Components page as an example. After you visit the page, the URL of the HDFS Components page is shown as https://emr.console.aliyun.com/#/ cn-hangzhou/cluster/C-XXXXXXXXXXXXXX/ service/HDFS. You need to change the component name HDFS at the end of the URL to EMRFLOW. The complete URL is shown as https://emr.console.aliyun.com/#/ cn-hangzhou/cluster/C-XXXXXXXXXXXXXX/ service/EMRFLOW.

    2. Restart the FlowAgent component.

      In the upper-right corner of the FlowAgent page, choose Actions > Restart All Components.