All Products
Search
Document Center

DataWorks:Create an EMR Trino node

Last Updated:Mar 30, 2026

Use an E-MapReduce (EMR) Trino node in DataWorks to run interactive SQL queries across multiple data sources — such as Hive, MySQL, and others — without moving data between systems. For background on Trino, see Trino.

Prerequisites

Before you begin, make sure that you have:

  • An Alibaba Cloud EMR cluster registered with your DataWorks workspace. Tasks cannot be created until the cluster is bound. See Bind an EMR compute engine in the legacy DataStudio.

  • A serverless resource group purchased, bound to your workspace, and configured with network access. See Use serverless resource groups.

  • A workflow created in DataStudio. All nodes must belong to a workflow. See Create a workflow.

  • (Optional) If developing as a RAM user, the RAM user added to the workspace with the Develop or Workspace Administrator role. The Workspace Administrator role carries extensive permissions — assign it with caution. See Add members to a workspace.

Limitations

  • EMR Trino tasks run only on a serverless resource group.

  • To manage metadata for a DataLake or custom cluster — including real-time metadata display, audit logs, data lineage, and EMR data governance tasks — configure the EMR-HOOK on the cluster first. See Use Hive extensions to record data lineage and access history.

  • If Lightweight Directory Access Protocol (LDAP) authentication is enabled for Trino, download the keystore file from the /etc/taihao-apps/trino-conf directory on the cluster's master node. Then upload it in the DataWorks console: More > Management Center > Cluster Management > Account Mappings > Edit Account Mappings > Upload Keystore File.

  • A single query run returns a maximum of 10,000 records and 10 MB of data.

Step 1: Create an EMR Trino node

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Development and O&M > Data Development, select your workspace from the drop-down list, and click Go to Data Development.

  2. In DataStudio, right-click the target workflow and choose Create Node > EMR > EMR Trino.

  3. In the Create Node dialog box, enter a Name, select an Engine Instance, a Node Type, and a Path, then click Confirm.

    Node names can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Trino task

Double-click the node to open the task development page.

Select an EMR cluster instance (optional)

If multiple EMR clusters are registered with your workspace, select the target cluster from the drop-down at the top of the node configuration page. If only one cluster is registered, it is selected automatically.

image

Configure connectors

Trino accesses data sources through connectors. Configure the appropriate connector before writing queries:

Write SQL

All Trino queries use a three-part path: <catalog>.<schema>.<table>. The catalog maps to the data source, the schema to the database, and the table to a specific table within that schema.

Enter your query in the editor. The following examples cover the most common patterns:

-- Query a Hive table
SELECT * FROM hive.default.hive_table;

-- Query a MySQL table
SELECT * FROM mysql.rt_data.rt_user;

-- Join a Hive table and a MySQL table
SELECT DISTINCT a.id, a.name, b.rt_name
FROM hive.default.hive_table a
INNER JOIN mysql.rt_data.rt_user b ON a.id = b.id;

-- Query a Hive table using a scheduling parameter
SELECT * FROM hive.default.${table_name};
DataWorks scheduling parameters let you pass dynamic values to your SQL at runtime. Define variables using the ${variable_name} format, then assign values in the Properties > Scheduling Parameter section of the right-side pane. See Supported formats for scheduling parameters and Configure and use scheduling parameters.

Run the SQL task

Two run modes are available:

Mode

When to use

Behavior

Run

Routine execution using saved parameter values

Runs the task with the currently configured scheduling parameters

Advanced Run

One-off runs where you need to override parameter values

Opens a dialog to select a resource group and set parameter values for this run only

To run the task:

  1. Click the 高级运行 (Run with Parameters) icon. In the Parameters dialog box, select your scheduling resource group and click Run.

    - The scheduling resource group must have passed a network connectivity test with the compute resources. See Network connectivity solutions. - Each query returns a maximum of 10,000 records with a total size limit of 10 MB.
  2. Click the 保存 icon to save your SQL.

Configure advanced settings (optional)

Adjust SQL execution behavior in the Advanced Settings section of the right-side pane:

Parameter

Description

Default

FLOW_SKIP_SQL_ANALYZE

Controls how multiple SQL statements are executed. Set to true to run all statements at once; false runs one statement at a time.

false

DATAWORKS_SESSION_DISABLE

Applies to test runs in the development environment. Set to true to create a new JDBC connection for each SQL execution; false reuses the same connection across all statements.

false

Step 3: Configure task scheduling

Click Scheduling Configuration in the right-side pane and configure the scheduling properties. Configure the Rerun Property and Upstream Dependent Node before submitting. For full scheduling options, see Overview.

Step 4: Submit and deploy

  1. Click the 保存 icon to save the node.

  2. Click the 提交 icon to submit the task. In the Submit dialog box, enter a Change description and choose whether to require a code review.

    - Configure the Rerun and Parent Nodes properties before submitting. - When code review is enabled, a reviewer must approve the submitted code before it can be deployed. This prevents unverified changes from reaching the production environment. See Code review.
  3. For workspaces in standard mode, click Deploy in the upper-right corner after submission to deploy the task to the production environment. See Deploy tasks.

What's next

After deployment, click Operation Center in the upper-right corner to monitor the task's scheduling status. See Manage periodic tasks.

Troubleshooting

The node run fails with a connection timeout

image

Root cause: The resource group cannot reach the EMR cluster due to a network connectivity issue.

Resolution:

  1. Go to the computing resource list page to initialize the resource.

  2. Find the affected resource and click Re-initialize.

  3. Wait for initialization to complete and verify that the status shows success before rerunning the task.

imageimage