All Products
Search
Document Center

DataWorks:Create an EMR Hive node

Last Updated:Jan 22, 2024

This topic describes how to create an E-MapReduce (EMR) Hive node. EMR Hive nodes allow you to use SQL-like statements to read data from and write data to data warehouses with large datasets and manage the data warehouses. The data warehouses are stored in a distributed storage system. You can use EMR Hive nodes to analyze and develop large amounts of log data in an efficient manner.

Prerequisites

  • A workflow is created in DataStudio.

    Development operations in different types of compute engines are performed based on workflows in DataStudio. Therefore, before you create a node, you must create a workflow. For more information, see Create a workflow.

  • An Alibaba Cloud EMR cluster is created and registered to DataWorks.

    Before you create an EMR node and use the EMR node to develop EMR tasks in DataWorks, you must register a desired EMR cluster to a DataWorks workspace. For more information, see Register an EMR cluster to DataWorks.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Development or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.

  • An exclusive resource group for scheduling is purchased and configured. The configurations include association with a workspace and network configuration. For more information, see Create and use an exclusive resource group for scheduling.

Limits

  • Tasks on this type of node cannot be run on the shared resource group for scheduling.

  • If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. For more information about how to configure EMR-HOOK, see Use the Hive extension feature to record data lineage and historical access information.

Procedure

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

  2. Create an EMR Hive node.

    1. Move the pointer over the image.png icon and choose Create Node > EMR > EMR Hive.

      Alternatively, you can find the desired workflow, right-click the name of the workflow, and then choose Create Node > EMR > EMR Hive.

    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.

    3. Click Confirm. The configuration tab of the EMR Hive node appears.

  3. Use the EMR Hive node to develop data.

    1. Select a resource group.

      In the top toolbar, click the Run with Parameters icon. In the Parameters dialog box, select the created resource group for scheduling.

      Note
      • If you want to access a data source over the Internet or a VPC, you must use the resource group for scheduling that is connected to the data source. For more information, see Establish a network connection between a resource group and a data source.

      • If you want to change the resource group in subsequent operations, you can also click the Run with Parameters icon to change the resource group in the Parameters dialog box.

    2. Write code for the node.

      In the SQL editor, enter the code for the node. Example:

      show tables;
      select '${var}'; -- You can assign a specific scheduling parameter to the var variable. 
      select * from userinfo ;
      Note
      • The size of an SQL statement for the node cannot exceed 130 KB.

      • If you use an EMR Hive node to query data, a maximum of 10,000 data records can be returned and the total size of the returned data records cannot exceed 10 MB.

      • If multiple EMR compute engines are associated with your workspace, you must select one from the compute engines based on your business requirements. If only one EMR compute engine is associated with your workspace, you do not need to select a compute engine.

      If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the value assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?

      Note

      For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

    3. Save and execute the SQL statements.

      In the top toolbar, click the Save icon to save the SQL statements. Then, click the Run icon to execute the SQL statements.

  4. Configure the parameters on the Advanced Settings tab.

    The following table describes the advanced parameters that are configured for different types of EMR clusters. For more information about how to configure the parameters of the node, see Spark Configuration.

    Cluster type

    Advanced parameter

    DataLake cluster or

    custom cluster (EMR on ECS)

    • "queue": the scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    • "priority": the priority. Default value: 1.

    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:

      • true: Multiple SQL statements are executed at a time.

      • false: Only one SQL statement is executed at a time.

      Note

      This parameter is available only for testing in the development environment of a DataWorks workspace.

    • "DATAWORKS_SESSION_DISABLE": specifies whether to establish a JDBC connection every time an SQL statement is executed. This parameter is available for testing in the development environment of a DataWorks workspace. Valid values:

      • true: A JDBC connection is established every time an SQL statement is executed.

      • false: The same JDBC connection is used when different SQL statements are executed for the same node.

      Default value: false.

      Note

      If the DATAWORKS_SESSION_DISABLE parameter is set to false, the value of yarn applicationId for the EMR Hive node is not displayed. If you want the value of yarn applicationId to be displayed, you can set the DATAWORKS_SESSION_DISABLE parameter to true.

    Note

    You can also add a Hive connection parameter on the Advanced Settings tab for the EMR Hive node.

    Hadoop cluster (EMR on ECS)

    • "queue": the scheduling queue to which jobs are committed. Default value: default. For information about EMR YARN, see YARN schedulers.

    • "priority": the priority. Default value: 1.

    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:

      • true: Multiple SQL statements are executed at a time.

      • false: Only one SQL statement is executed at a time.

    • USE_GATEWAY: specifies whether to use a gateway cluster to commit jobs on the current node. Valid values:

      • true: Use a gateway cluster to commit jobs.

      • false: Use no gateway cluster to commit jobs. Jobs are automatically committed to the master node.

    Note

    If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to true, jobs may fail to be committed.

  5. Configure scheduling properties

    If you want the system to periodically run the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure properties for the node based on your business requirements.

  6. Commit and deploy the node.

    1. Click the Save icon in the top toolbar to save the node.

    2. Click the Submit icon in the top toolbar to commit the node.

    3. In the Commit Node dialog box, configure the Change description parameter.

    4. Click OK.

    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.

  7. View the node.

    1. Click Operation Center in the upper-right corner of the configuration tab of the node to go to Operation Center in the production environment.

    2. View the scheduled node. For more information, see View and manage auto triggered nodes.

    To view more information about the node, click Operation Center in the top navigation bar of the DataStudio page. For more information, see Overview.