This topic describes how to create an E-MapReduce (EMR) Hive node. EMR Hive nodes allow you to use SQL-like statements to read data from and write data to data warehouses with large volumes of data stored in a distributed storage system and manage these data warehouses. You can use EMR Hive nodes to efficiently analyze large amounts of log data.

Prerequisites

  • An Alibaba Cloud EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with the desired workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the code, ALISA.* and SKYNET.* are configurations in DataWorks.
    2. After the whitelist configurations are modified, you must restart the Hive service to make the configurations take effect. For more information, see Restart a service.
  • An exclusive resource group for scheduling is created, and the resource group is associated with the virtual private cloud (VPC) where the EMR cluster resides. For more information, see Create and use an exclusive resource group for scheduling.
    Note You can use only exclusive resource groups for scheduling to run EMR Hive nodes.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region where your workspace resides, find the workspace, and then click Data Analytics in the Actions column.
  2. On the DataStudio page, move the pointer over the Create icon and choose EMR > EMR Hive.
    Alternatively, you can find the required workflow, right-click the workflow name, and then choose Create > EMR > EMR Hive.
  3. In the Create Node dialog box, set the Node Name and Location parameters.
    Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
  4. Click Commit.
  5. On the configuration tab of the EMR Hive node, write code for the node.
    -- SQL statement example 
    -- The size of SQL statements cannot exceed 130 KB. 
    show tables;
    -- Scheduling parameters are supported. 
    select '${var}';
    -- LIMIT 10000 is automatically added to the SELECT statement. 
    select * from userinfo ;
    For more information about scheduling parameters, see Configure scheduling parameters.

    If you want to change the values that are assigned to the parameters in the code, click Run with Parameters in the top toolbar. For more information about value assignment for the scheduling parameters, see Scheduling parameters.

    hiveFor more information about how to configure a Hive SQL job, see Configure a Hive SQL job.
    Note If multiple EMR compute engine instances are associated with the current workspace, you must select an EMR compute engine instance. If only one EMR compute engine instance is associated with the current workspace, you do not need to select one.
  6. In the right-side navigation pane, click Advanced Settings. On the Advanced Settings tab, change the values of the parameters.
    • "SPARK_CONF": "--conf spark.driver.memory=2g --conf xxx=xxx": the parameters that are required to run Spark jobs. You can configure multiple parameters in the --conf xxx=xxx format.
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "vcores": the number of CPU cores. Default value:1.
    • "memory": the memory that is allocated to the launcher, in MB. Default value: 2048.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": specifies how SQL statements are executed. A value of false indicates that only one SQL statement is executed at a time. A value of true indicates that multiple SQL statements are executed at a time.
    • "USE_GATEWAY": specifies whether a gateway cluster is used to submit jobs on the current node. A value of true indicates that a gateway cluster is used to submit jobs. A value of false indicates that a gateway cluster is not used to submit jobs and jobs are submitted to the header node by default.
      Note If the EMR cluster to which the node belongs is not associated with a gateway cluster but you set the USE_GATEWAY parameter to true, jobs may fail to be submitted.
  7. In the right-side navigation pane, click Properties. On the Properties tab, you can configure properties for the EMR Presto node.
    For more information about how to configure basic properties for the EMR Presto node, see Configure basic properties.
  8. Save and commit the node.
    Notice You must set the Rerun and Parent Nodes parameters before you can commit the node.
    1. Click the Save icon in the toolbar to save the node.
    2. Click the Commit icon in the toolbar.
    3. In the Commit Node dialog box, enter your comments in the Change description field.
    4. Click OK.
    In a workspace in standard mode, you must click Deploy in the upper-right corner after you commit the node. For more information, see Deploy nodes.
  9. Test the node. For more information, see View auto triggered nodes.