An EMR Hive node lets you run SQL-like statements to read, write, and manage large datasets stored on a distributed storage system — making it well-suited for log data analysis and data warehouse development on E-MapReduce (EMR) clusters.
Prerequisites
Before you begin, ensure that you have:
An Alibaba Cloud EMR cluster created and registered to DataWorks. See Associate an EMR computing resource
(Required for RAM users) The RAM user added to the DataWorks workspace with the Develop role. The Workspace Administrator role also works but grants more permissions than needed — assign it with caution. See Add workspace members and assign roles to them
A serverless resource group purchased and configured, including workspace association and network settings. See Create and use a serverless resource group
A workflow created in DataStudio. All node development in DataStudio is organized within workflows. See Create a workflow
Limitations
EMR Hive nodes run only on a serverless resource group or an exclusive resource group for scheduling. Use a serverless resource group.
To manage metadata for a DataLake or custom cluster in DataWorks, configure EMR-HOOK in your cluster first. Without EMR-HOOK, metadata is not displayed in real time, audit logs and data lineages are unavailable, and EMR governance tasks cannot run. See Configure EMR-HOOK for Hive.
Step 1: Create an EMR Hive node
Go to the DataStudio page. Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and O\&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
Create the node. Right-click the target workflow and choose Create Node > EMR > EMR Hive.
Alternatively, hover over Create and select Create Node > EMR > EMR Hive.
In the Create Node dialog box, configure the following parameters and click Confirm. The configuration tab for the EMR Hive node opens.
Parameter Description Name The node name. Can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores ( _), and periods (.).Engine Instance The EMR cluster to use for this node. Node Type The node type. Select EMR Hive. Path The workflow path where the node is saved.
Step 2: Develop an EMR Hive task
Double-click the node you created to open the task development page.
Write SQL code
Write Hive SQL in the SQL editor. The following example shows three common patterns in a single node:
show tables; -- Lists all tables in the current database.
select '${var}'; -- Selects a scheduling parameter value; replace var with your parameter name.
select * from userinfo; -- Queries all rows from the userinfo table.| Statement | What it does |
|---|---|
show tables | Lists all tables in the current database. |
select '${var}' | Reads the value of a scheduling parameter at runtime. Define parameters under Scheduling Configuration > Scheduling Parameters in the right-side panel. |
select * from userinfo | Returns all rows from the userinfo table. |
All three statements run in sequence when the node executes. The ${variable_name} syntax lets you pass dynamic values at runtime. For supported parameter formats, see Supported formats for scheduling parameters.
The total size of all SQL statements in a node cannot exceed 130 KB.
If multiple EMR computing resources are attached to your workspace, select one before running. If only one is attached, no selection is needed.
To change parameter assignments before running, click Run With Parameters in the toolbar. For details on parameter assignment logic, see Differences in parameter assignment logic between Run, Run with Parameters, and smoke testing.
(Optional) Configure advanced parameters
Set node-specific properties in the Advanced Settings section. Available parameters differ by cluster type.
DataLake or custom cluster (EMR on ECS)
DataLake or custom cluster (EMR on ECS)
| Parameter | Description | Default | Required |
|---|---|---|---|
queue | The YARN scheduling queue for job submission. For queue configuration, see Basic queue configuration. | default | No |
priority | The job priority. | 1 | No |
FLOW_SKIP_SQL_ANALYZE | Controls how SQL statements execute. true: runs multiple SQL statements in a single run. false: runs one statement at a time. Available only when testing workflows in the development environment. | false | No |
DATAWORKS_SESSION_DISABLE | Controls JDBC connection behavior when running tests directly in the development environment. true: establishes a new JDBC connection for each SQL statement. false: reuses the same JDBC connection within a node. When set to false, the yarn applicationId for Hive is not printed; set to true to print it. | false | No |
| Others | Add custom Hive connection parameters as needed. | — | No |
Hadoop cluster (EMR on ECS)
Hadoop cluster (EMR on ECS)
| Parameter | Description | Default | Required |
|---|---|---|---|
queue | The YARN scheduling queue for job submission. For queue configuration, see YARN schedulers. | default | No |
priority | The job priority. | 1 | No |
FLOW_SKIP_SQL_ANALYZE | Controls how SQL statements execute. true: runs multiple statements per execution. false: runs one statement at a time. Available only when testing workflows in the development environment. | false | No |
USE_GATEWAY | Specifies whether to submit jobs through a gateway cluster. true: submits through the gateway cluster. false: submits directly to the master node. If the cluster has no associated gateway cluster, setting this to `true` causes all EMR job submissions to fail. | false | No |
Run the SQL task
To access a computing resource on the public internet or in a VPC, use a scheduling resource group that has passed a connectivity test with that computing resource. See Network connectivity solutions.
To switch to a different resource group, click Run With Parametersand select another scheduling resource group.
Query results are capped at 10,000 records and 10 MB total.
In the toolbar, click the
icon. In the Parameters dialog box, select the scheduling resource group you created and click Run.Click the
icon to save the SQL statements.(Optional) Run smoke testing. After submitting the node, run smoke testing in the development environment to validate the task before deployment. See Perform smoke testing.
Step 3: Configure scheduling properties
To run the task on a periodic schedule, click Properties in the right-side panel and configure scheduling settings based on your requirements. See Scheduling properties overview.
Configure the Rerun and Parent Nodes parameters before committing the task.
Step 4: Deploy the task
Commit and deploy the task to activate periodic scheduling in production.
Click the
icon in the toolbar to save the task.Click the
icon in the toolbar to commit the task. In the Submit dialog box, enter a Change description. Decide whether to enable code review based on your team's requirements.With code review enabled, committed task code can only be deployed after it passes review. See Code review.
(Standard mode workspaces only) Deploy the task to the production environment. After committing, click Deploy in the upper-right corner of the node configuration tab. See Deploy nodes.
What's next
After the task is committed and deployed, it runs on the schedule you configured. To monitor execution, click Operation Center in the upper-right corner of the node configuration tab. See View and manage auto triggered tasks.
FAQ
Why do I get a ConnectException error when I run a node?

The resource group and cluster lack network connectivity. Go to the computing resources list page, click Initialize Resource, and then click Re-initialize in the dialog box.

