All Products
Search
Document Center

DataWorks:EMR Hive node

Last Updated:Mar 26, 2026

The DataWorks E-MapReduce (EMR) Hive node runs Hive SQL batch jobs on large-scale data stored in distributed systems. Use it to read, write, and manage large datasets — including massive log data — with SQL-like statements in scheduled or on-demand jobs.

Prerequisites

Before you begin, ensure that you have:

Alibaba Cloud account users can skip the RAM user step. The Workspace Administrator role has extensive permissions — grant it with caution.

Limitations

Without EMR-HOOK configured, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related administration tasks.

Step 1: Write the Hive SQL

Open the node editing page and write your Hive SQL in the SQL editing area.

To pass values dynamically at scheduling time, define variables using the ${variable_name} format. Assign values to these variables in the Scheduling Configuration pane under Scheduling Parameters on the right side of the page. For supported variable formats, see Supported formats for scheduling parameters.

Example:

Statement Description
SHOW TABLES; Lists all tables in the current database.
SELECT '${var}'; Reads the value of a scheduling parameter at runtime.
SELECT * FROM userinfo; Queries all rows from the userinfo table.
SHOW TABLES;
SELECT '${var}';  -- Use with scheduling parameters.
SELECT * FROM userinfo;
The maximum size of a single SQL statement is 130 KB.

Step 2: Configure advanced parameters (optional)

Advanced parameters control job submission behavior and session management. Set them in the Scheduling Configuration pane under EMR Node Parameters > DataWorks Parameters. The available parameters depend on your cluster type.

To configure open-source Spark property parameters, use the EMR Node Parameters > Spark Parameters section.

DataLake clusters and custom clusters (EMR on ECS)

Parameter Description Default Applies to
queue Scheduling queue for job submission. See Basic queue configuration. default
priority Job priority. 1
FLOW_SKIP_SQL_ANALYZE SQL execution mode. true: run multiple SQL statements at once. false: run one statement at a time. false Test runs in development environment only
DATAWORKS_SESSION_DISABLE Java Database Connectivity (JDBC) session behavior. true: create a new JDBC connection for each SQL statement (also prints yarn applicationId). false: reuse the same JDBC connection across statements in the same node. false Direct test runs in development environment only
Custom parameters Append custom Hive connection parameters directly in the advanced configuration section.

Hadoop clusters (EMR on ECS)

Parameter Description Default Applies to
queue Scheduling queue for job submission. See Basic queue configuration. default
priority Job priority. 1
FLOW_SKIP_SQL_ANALYZE SQL execution mode. true: run multiple SQL statements at once. false: run one statement at a time. false Test runs in development environment only
USE_GATEWAY true: submit jobs through the associated gateway cluster. false: submit jobs directly to the header node. If the cluster has no associated gateway cluster, setting this to true causes subsequent EMR job submissions to fail. false

To schedule the node, configure its scheduling properties. See Node scheduling configuration.

Step 3: Run the node

  1. In Run Configuration, set Computing Resource and Resource Group.

    Set Scheduling CUs based on the resources your task requires. The default is 0.25. To access data sources over the public internet or in a Virtual Private Cloud (VPC), use a scheduling resource group that has passed the connectivity test with the data source. See Network connectivity solutions.
  2. On the toolbar, select your Hive data source in the parameter dialog box, then click Run.

    A single query returns at most 10,000 records and 10 MB of data.
  3. Click Save.

What's next

Troubleshooting

Connection timeout (ConnectException) when running a node

Symptom: The node fails with a ConnectException error.

image

Cause: The resource group cannot reach the EMR cluster.

Resolution:

  1. Go to the computing resource list page.

  2. Locate the resource associated with the node.

  3. Click Re-initialize in the dialog box.

  4. Verify that initialization completes successfully before retrying the node.

imageimage