All Products
Search
Document Center

DataWorks:Create an EMR Hive node

Last Updated:Mar 26, 2026

An EMR Hive node lets you run SQL-like statements to read, write, and manage large datasets stored on a distributed storage system — making it well-suited for log data analysis and data warehouse development on E-MapReduce (EMR) clusters.

Prerequisites

Before you begin, ensure that you have:

Limitations

  • EMR Hive nodes run only on a serverless resource group or an exclusive resource group for scheduling. Use a serverless resource group.

  • To manage metadata for a DataLake or custom cluster in DataWorks, configure EMR-HOOK in your cluster first. Without EMR-HOOK, metadata is not displayed in real time, audit logs and data lineages are unavailable, and EMR governance tasks cannot run. See Configure EMR-HOOK for Hive.

Step 1: Create an EMR Hive node

  1. Go to the DataStudio page. Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and O\&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

  2. Create the node. Right-click the target workflow and choose Create Node > EMR > EMR Hive.

    Alternatively, hover over Create and select Create Node > EMR > EMR Hive.
  3. In the Create Node dialog box, configure the following parameters and click Confirm. The configuration tab for the EMR Hive node opens.

    ParameterDescription
    NameThe node name. Can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
    Engine InstanceThe EMR cluster to use for this node.
    Node TypeThe node type. Select EMR Hive.
    PathThe workflow path where the node is saved.

Step 2: Develop an EMR Hive task

Double-click the node you created to open the task development page.

Write SQL code

Write Hive SQL in the SQL editor. The following example shows three common patterns in a single node:

show tables;             -- Lists all tables in the current database.
select '${var}';         -- Selects a scheduling parameter value; replace var with your parameter name.
select * from userinfo;  -- Queries all rows from the userinfo table.
StatementWhat it does
show tablesLists all tables in the current database.
select '${var}'Reads the value of a scheduling parameter at runtime. Define parameters under Scheduling Configuration > Scheduling Parameters in the right-side panel.
select * from userinfoReturns all rows from the userinfo table.

All three statements run in sequence when the node executes. The ${variable_name} syntax lets you pass dynamic values at runtime. For supported parameter formats, see Supported formats for scheduling parameters.

The total size of all SQL statements in a node cannot exceed 130 KB.
If multiple EMR computing resources are attached to your workspace, select one before running. If only one is attached, no selection is needed.
To change parameter assignments before running, click Run With Parameters in the toolbar. For details on parameter assignment logic, see Differences in parameter assignment logic between Run, Run with Parameters, and smoke testing.

(Optional) Configure advanced parameters

Set node-specific properties in the Advanced Settings section. Available parameters differ by cluster type.

DataLake or custom cluster (EMR on ECS)

DataLake or custom cluster (EMR on ECS)

ParameterDescriptionDefaultRequired
queueThe YARN scheduling queue for job submission. For queue configuration, see Basic queue configuration.defaultNo
priorityThe job priority.1No
FLOW_SKIP_SQL_ANALYZEControls how SQL statements execute. true: runs multiple SQL statements in a single run. false: runs one statement at a time. Available only when testing workflows in the development environment.falseNo
DATAWORKS_SESSION_DISABLEControls JDBC connection behavior when running tests directly in the development environment. true: establishes a new JDBC connection for each SQL statement. false: reuses the same JDBC connection within a node. When set to false, the yarn applicationId for Hive is not printed; set to true to print it.falseNo
OthersAdd custom Hive connection parameters as needed.No

Hadoop cluster (EMR on ECS)

Hadoop cluster (EMR on ECS)

ParameterDescriptionDefaultRequired
queueThe YARN scheduling queue for job submission. For queue configuration, see YARN schedulers.defaultNo
priorityThe job priority.1No
FLOW_SKIP_SQL_ANALYZEControls how SQL statements execute. true: runs multiple statements per execution. false: runs one statement at a time. Available only when testing workflows in the development environment.falseNo
USE_GATEWAYSpecifies whether to submit jobs through a gateway cluster. true: submits through the gateway cluster. false: submits directly to the master node. If the cluster has no associated gateway cluster, setting this to `true` causes all EMR job submissions to fail.falseNo

Run the SQL task

To access a computing resource on the public internet or in a VPC, use a scheduling resource group that has passed a connectivity test with that computing resource. See Network connectivity solutions.
To switch to a different resource group, click Run With Parameters Advanced Run and select another scheduling resource group.
Query results are capped at 10,000 records and 10 MB total.
  1. In the toolbar, click the Advanced Run icon. In the Parameters dialog box, select the scheduling resource group you created and click Run.

  2. Click the 保存 icon to save the SQL statements.

  3. (Optional) Run smoke testing. After submitting the node, run smoke testing in the development environment to validate the task before deployment. See Perform smoke testing.

Step 3: Configure scheduling properties

To run the task on a periodic schedule, click Properties in the right-side panel and configure scheduling settings based on your requirements. See Scheduling properties overview.

Important

Configure the Rerun and Parent Nodes parameters before committing the task.

Step 4: Deploy the task

Commit and deploy the task to activate periodic scheduling in production.

  1. Click the Save icon in the toolbar to save the task.

  2. Click the Commit icon in the toolbar to commit the task. In the Submit dialog box, enter a Change description. Decide whether to enable code review based on your team's requirements.

    With code review enabled, committed task code can only be deployed after it passes review. See Code review.
  3. (Standard mode workspaces only) Deploy the task to the production environment. After committing, click Deploy in the upper-right corner of the node configuration tab. See Deploy nodes.

What's next

After the task is committed and deployed, it runs on the schedule you configured. To monitor execution, click Operation Center in the upper-right corner of the node configuration tab. See View and manage auto triggered tasks.

FAQ

Why do I get a ConnectException error when I run a node?

image

The resource group and cluster lack network connectivity. Go to the computing resources list page, click Initialize Resource, and then click Re-initialize in the dialog box.

imageimage