This topic describes how to create an E-MapReduce (EMR) Hive node. EMR Hive nodes allow you to use SQL-like statements to read data from and write data to data warehouses with large datasets and manage the data warehouses. The data warehouses are stored in a distributed storage system. You can use EMR Hive nodes to analyze and develop large amounts of log data in an efficient manner.

Prerequisites

The preparations for creating a node are complete for EMR and DataWorks. The preparations vary based on the type of your EMR cluster. EMR provides the following types of clusters:

Limits

  • EMR Hive nodes can be run only on an exclusive resource group for scheduling.
  • DataWorks no longer allows you to associate an EMR Hadoop cluster with a DataWorks workspace. However, the EMR Hadoop clusters that are associated with your DataWorks workspace can still be used.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace in which you want to create a MySQL node resides. Find the workspace and click DataStudio in the Actions column.
  2. Create a workflow.
    If you have an existing workflow, skip this step.
    1. Move the pointer over the Create icon and select Create Workflow.
    2. In the Create Workflow dialog box, configure the Workflow Name parameter.
    3. Click Create.
  3. Create an EMR Hive node.
    1. Move the pointer over the Create icon icon and choose Create Node > EMR > EMR Hive.
      You can also find the desired workflow, right-click the workflow, and then choose Create Node > EMR > EMR Hive.
    2. In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters.
      Note The node name must be 1 to 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
    3. Click Commit. Then, the configuration tab of the EMR Hive node appears.
  4. Use an EMR Hive node to develop data.
    1. Select a resource group.
      In the top toolbar, click the Run with Parameters icon. In the Parameters dialog box, select the created resource group for scheduling.
      Note
      • If you want to access a data source over the Internet or a VPC, you must use the resource group for scheduling that is connected to the data source. For more information, see Establish a network connection between a resource group and a data source.
      • If you want to change the resource group in subsequent operations, you can change the resource group in the Parameters dialog box.
    2. Write code for the EMR Hive node.
      In the SQL editor, enter the code for the EMR Hive node. Example:
      show tables;
      select '${var}'; -- You can assign a specific scheduling parameter to the var variable. 
      select * from userinfo ;
      Note
      • The size of the SQL statement for the node cannot exceed 130 KB.
      • If you use an EMR Hive node to query data, up to 10,000 data records can be returned and the total size of the returned data records cannot exceed 10 MB.
      • If multiple EMR compute engine instances are associated with the current workspace, you can select one EMR compute engine instance based on your business requirements. If only one EMR compute engine instance is associated with the current workspace, the EMR compute engine instance is automatically selected.
      If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the value assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?.
      Note For more information about how to use scheduling parameters, see Overview of scheduling parameters.
    3. Save and execute the SQL statements.
      In the top toolbar, click the Save icon to save the SQL statements. Then, click the Run icon to execute the SQL statements.
  5. Configure the parameters on the Advanced Settings tab.
    The following table describes the advanced parameters that are configured for different types of EMR clusters.
    Cluster type Advanced parameter
    EMR data lake cluster
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:
      • true: Multiple SQL statements are executed at a time.
      • false: Only one SQL statement is executed at a time.
      Note This parameter is available only for testing in the data development environment of a DataWorks workspace.
    • "DATAWORKS_SESSION_DISABLE": specifies whether to establish a JDBC connection for each SQL statement that is executed to create the EMR Hive node. Valid values:
      • true: A JDBC connection is established every time an SQL statement is executed.
      • false: The same JDBC connection is used when different SQL statements are executed for the EMR Hive node.
      Default value: false.
      Note If the DATAWORKS_SESSION_DISABLE parameter is set to false, the value of yarn applicationId for the EMR Hive node is not printed. If you want the value of yarn applicationId to be printed, you can set the DATAWORKS_SESSION_DISABLE parameter to true.
    Note You can also add a Hive connection parameter on the Advanced Settings tab for the EMR Hive node.
    EMR Hadoop cluster
    • "queue": the scheduling queue to which jobs are committed. Default value: default.
    • "vcores": the number of vCPUs. Default value: 1.
    • "memory": the memory that is allocated to the launcher. Unit: MB. Default value: 2048.
    • "priority": the priority. Default value: 1.
    • "FLOW_SKIP_SQL_ANALYZE": the manner in which SQL statements are executed. Valid values:
      • true: Multiple SQL statements are executed at a time.
      • false: Only one SQL statement is executed at a time.
    • USE_GATEWAY: specifies whether a gateway cluster is used to commit jobs on the current node. Valid values:
      • true: A gateway cluster is used to commit jobs.
      • false: No gateway cluster is used to commit jobs. Jobs are automatically committed to the master node.
      Note If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to true, jobs may fail to be committed.
  6. Configure scheduling properties for the EMR Presto node.
    If you want the system to periodically run the EMR Presto node, you can click Properties in the right-side navigation pane to configure properties for the node based on your business requirements.
  7. Commit and deploy the MySQL node.
    1. Click the Save icon in the top toolbar to save the node.
    2. Click the Submit icon in the top toolbar to commit the node.
    3. In the Commit Node dialog box, configure the Change description parameter.
    4. Click OK.
    If you use a workspace in standard mode, you must deploy the node in the production environment after you commit the node. On the left side of the top navigation bar, click Deploy. For more information, see Deploy nodes.
  8. View the MySQL node.
    1. Click Operation Center in the upper-right corner of the DataStudio page to go to Operation Center.
    2. View the scheduled MySQL node. For more information, see View and manage auto triggered nodes.