The DataWorks E-MapReduce (EMR) Hive node runs Hive SQL batch jobs on large-scale data stored in distributed systems. Use it to read, write, and manage large datasets — including massive log data — with SQL-like statements in scheduled or on-demand jobs.
Prerequisites
Before you begin, ensure that you have:
-
An Alibaba Cloud EMR cluster bound to DataWorks. See Data Studio: Associate an EMR computing resource.
-
A Hive data source configured in DataWorks with connectivity verified. See Data Source Management.
-
(Optional, RAM users) The Developer or Workspace Administrator role in the target workspace. See Add members to a workspace.
Alibaba Cloud account users can skip the RAM user step. The Workspace Administrator role has extensive permissions — grant it with caution.
Limitations
-
EMR Hive nodes can be scheduled only on serverless resource groups (recommended) or exclusive resource groups.
-
To manage metadata for DataLake or custom clusters in DataWorks, configure EMR-HOOK on the cluster first. See Configure EMR-HOOK for Hive.
Without EMR-HOOK configured, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related administration tasks.
Step 1: Write the Hive SQL
Open the node editing page and write your Hive SQL in the SQL editing area.
To pass values dynamically at scheduling time, define variables using the ${variable_name} format. Assign values to these variables in the Scheduling Configuration pane under Scheduling Parameters on the right side of the page. For supported variable formats, see Supported formats for scheduling parameters.
Example:
| Statement | Description |
|---|---|
SHOW TABLES; |
Lists all tables in the current database. |
SELECT '${var}'; |
Reads the value of a scheduling parameter at runtime. |
SELECT * FROM userinfo; |
Queries all rows from the userinfo table. |
SHOW TABLES;
SELECT '${var}'; -- Use with scheduling parameters.
SELECT * FROM userinfo;
The maximum size of a single SQL statement is 130 KB.
Step 2: Configure advanced parameters (optional)
Advanced parameters control job submission behavior and session management. Set them in the Scheduling Configuration pane under EMR Node Parameters > DataWorks Parameters. The available parameters depend on your cluster type.
To configure open-source Spark property parameters, use the EMR Node Parameters > Spark Parameters section.
DataLake clusters and custom clusters (EMR on ECS)
| Parameter | Description | Default | Applies to |
|---|---|---|---|
queue |
Scheduling queue for job submission. See Basic queue configuration. | default |
— |
priority |
Job priority. | 1 |
— |
FLOW_SKIP_SQL_ANALYZE |
SQL execution mode. true: run multiple SQL statements at once. false: run one statement at a time. |
false |
Test runs in development environment only |
DATAWORKS_SESSION_DISABLE |
Java Database Connectivity (JDBC) session behavior. true: create a new JDBC connection for each SQL statement (also prints yarn applicationId). false: reuse the same JDBC connection across statements in the same node. |
false |
Direct test runs in development environment only |
| Custom parameters | Append custom Hive connection parameters directly in the advanced configuration section. | — | — |
Hadoop clusters (EMR on ECS)
| Parameter | Description | Default | Applies to |
|---|---|---|---|
queue |
Scheduling queue for job submission. See Basic queue configuration. | default |
— |
priority |
Job priority. | 1 |
— |
FLOW_SKIP_SQL_ANALYZE |
SQL execution mode. true: run multiple SQL statements at once. false: run one statement at a time. |
false |
Test runs in development environment only |
USE_GATEWAY |
true: submit jobs through the associated gateway cluster. false: submit jobs directly to the header node. If the cluster has no associated gateway cluster, setting this to true causes subsequent EMR job submissions to fail. |
false |
— |
To schedule the node, configure its scheduling properties. See Node scheduling configuration.
Step 3: Run the node
-
In Run Configuration, set Computing Resource and Resource Group.
Set Scheduling CUs based on the resources your task requires. The default is
0.25. To access data sources over the public internet or in a Virtual Private Cloud (VPC), use a scheduling resource group that has passed the connectivity test with the data source. See Network connectivity solutions. -
On the toolbar, select your Hive data source in the parameter dialog box, then click Run.
A single query returns at most 10,000 records and 10 MB of data.
-
Click Save.
What's next
-
Publish the node to production. See Publish nodes or workflows.
-
Monitor auto-triggered tasks in Operation Center. See Get started with Operation Center.
Troubleshooting
Connection timeout (ConnectException) when running a node
Symptom: The node fails with a ConnectException error.
Cause: The resource group cannot reach the EMR cluster.
Resolution:
-
Go to the computing resource list page.
-
Locate the resource associated with the node.
-
Click Re-initialize in the dialog box.
-
Verify that initialization completes successfully before retrying the node.
