The DataWorks E-MapReduce (EMR) Hive node supports batch analysis of large-scale data. It operates on data stored in distributed systems to simplify big data processing and improve development efficiency. In an EMR Hive node, you can use SQL-like statements to read, write, and manage large datasets. This helps you efficiently analyze and develop tasks that involve massive amounts of log data.
Prerequisites
You have created an Alibaba Cloud EMR cluster and bound it to DataWorks. For more information, see Data Studio: Associate an EMR computing resource.
(Optional) If you are a Resource Access Management (RAM) user, ensure that you have been added to the workspace for task development and have been assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about adding members, see Add members to a workspace.
If you use an Alibaba Cloud account, you can skip this step.
You have configured a Hive data source in DataWorks and verified its connectivity. For more information, see Data Source Management.
Limitations
This type of task can be scheduled only on serverless resource groups (Recommended) or exclusive resource groups.
To manage metadata for DataLake or custom clusters in DataWorks, you must first configure EMR-HOOK on the cluster. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.
NoteIf EMR-HOOK is not configured on the cluster, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related administration tasks.
Step 1: Develop an EMR Hive node
You can develop the EMR Hive node on the node editing page.
Develop the SQL code
You can develop the task code in the SQL editing area. In your code, you can use the ${variable_name} format to define variables and then assign a value to each variable in the Scheduling Configuration pane > Scheduling Parameters section on the right side of the node editing page. This lets you dynamically pass parameters to the code in scheduling scenarios. For more information about the supported formats for scheduling parameters, see Supported formats for scheduling parameters. The following is an example.
SHOW TABLES ;
SELECT '${var}'; --Use with scheduling parameters.
SELECT * FROM userinfo ;The maximum size of an SQL statement is 130 KB.
Step 2: Configure the EMR Hive node
(Optional) Configure advanced parameters
You can configure the unique property parameters listed in the following table. These settings are located in the section of the Scheduling Configuration pane on the right.
The available advanced parameters vary depending on the EMR cluster type, as shown in the following tables.
You can configure more open-source Spark property parameters in the section of the Scheduling Configuration pane on the right.
DataLake clusters/Custom clusters: EMR on ECS
Advanced parameter | Description |
queue | The scheduling queue to which jobs are submitted. The default queue is default. For more information about EMR YARN, see Basic queue configuration. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The execution mode for SQL statements. Valid values:
Note This parameter is supported only for test runs in the development environment. |
DATAWORKS_SESSION_DISABLE | Applies to direct test runs in the development environment. Valid values:
Note If this parameter is set to |
Other | You can also append custom Hive connection parameters directly in the advanced configuration section. |
Hadoop clusters: EMR on ECS
Advanced parameter | Description |
queue | The scheduling queue to which jobs are submitted. The default queue is default. For more information about EMR YARN, see Basic queue configuration. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The execution mode for SQL statements. Valid values:
Note This parameter is supported only for test runs in the development environment. |
USE_GATEWAY | Specifies whether to submit jobs from this node through a gateway cluster. Valid values:
Note If the cluster where this node resides is not associated with a gateway cluster, setting this parameter to |
To run the node task on a schedule, you can configure its scheduling properties. For more information, see Node scheduling configuration.
Step 3: Test and run the node
Execute the SQL task
In Run Configuration, under Computing Resource, configure Computing Resource and Resource Group.
NoteYou can also set the Scheduling CUs based on the resources the task requires. The default value is
0.25.To access data sources over the public internet or in a Virtual Private Cloud (VPC), you must use a scheduling resource group that has passed the connectivity test with the data source. For more information, see Network connectivity solutions.
On the toolbar, in the parameter dialog box, select your Hive data source and click Run to execute the SQL task.
NoteWhen you query data using an EMR Hive node, a query can return a maximum of
10,000records. The total data size cannot exceed10 MB.Click the Save button.
More operations
After you configure the node task, you can publish the node. For more information, see Publish nodes or workflows.
After the task is published, you can view the status of the auto triggered task in the Operation Center. For more information, see Get started with Operation Center.
FAQ
Q: Why does a connection timeout (ConnectException) occur when I run a node?

A: Ensure network connectivity between the resource group and the cluster. Go to the computing resource list page to initialize the resource. In the dialog box that appears, click Re-initialize and verify that the initialization is successful.

