Create an EMR Trino node - DataWorks - Alibaba Cloud Documentation Center

Prerequisites

An Alibaba Cloud EMR cluster is created and registered with your DataWorks workspace.

To create EMR-related nodes and develop EMR tasks, you must register the EMR cluster with your DataWorks workspace. For more information, see Bind EMR compute resources in legacy DataStudio.
(Optional, for RAM users) The RAM user for task development has been added to the workspace and granted the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be assigned with caution. For more information, see Add members to a workspace.
A serverless resource group is purchased and configured. The configuration includes binding the resource group to a workspace and setting up the network. For more information, see Use a serverless resource group.
A workflow is created in DataStudio.

In DataStudio, development operations for different engines are based on workflows. Therefore, you must create a workflow before you create a node. For more information, see Create a workflow.

Limits

This type of task can run only on a serverless resource group.
To manage metadata for DataLake or custom clusters in DataWorks, you must configure EMR-HOOK on the cluster. If EMR-HOOK is not configured, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related governance tasks in DataWorks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Hive.
If Lightweight Directory Access Protocol (LDAP) authentication is enabled for Trino, you must log on to the master node of the EMR cluster and download the keystore file from the /etc/taihao-apps/trino-conf directory. Log on to the DataWorks console. In the left-side navigation pane, click More > Management Center. Select the target workspace from the drop-down list and click Go to Management Center. In the left-side navigation pane, click Cluster Management. Find the registered EMR cluster. On the Account Mappings tab, click Edit Account Mappings, and then click Upload Keystore File to upload the file.

Step 1: Create an EMR Trino node

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
Right-click the target workflow and choose Create Node > EMR > EMR Trino.
In the Create Node dialog box, enter a Name and select an Compute Engine Instance, a Node Type, and a Path. Click Confirm to go to the EMR Trino node editor page.

Note
A node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop an EMR Trino task

Double-click the created node to go to the task development page and perform the following operations.

(Optional) Select an EMR cluster instance

If multiple EMR clusters are registered with your workspace, select the appropriate cluster from the drop-down list at the top of the node editor page. If only one EMR cluster is registered, DataWorks selects it by default.

To access domains that have an IP address whitelist, you must use a dedicated scheduling resource group.

Connector configuration

Before you query MySQL tables, you must configure the built-in EMR Trino connector. For more information, see MySQL connector.
Before you query Hive tables, you must configure the built-in EMR Trino connector. For more information, see Hive connector.
To query tables in other data sources, see Configure connectors for instructions on configuring the built-in connectors.

Edit SQL code

Enter the task code in the SQL editor. The following code provides an example.

-- Usage
-- SELECT * FROM <catalog>.<schema>.<table>;
-- Parameter description
-- <catalog>: the name of the data source to connect to.
-- <schema>: the name of the database to use.
-- <table>: the table to query.
-- Example: To view data in the hive_table table in the default database of a Hive data source:
-- Query a Hive table
SELECT * FROM hive.default.hive_table;
-- Example: To view data in the rt_user table in the user-created rt_data database of a MySQL data source:
-- Query a MySQL table  
SELECT * FROM mysql.rt_data.rt_user;
-- Join a Hive table and a MySQL table
SELECT DISTINCT a.id, a.name,b.rt_name FROM hive.default.hive_table a INNER JOIN mysql.rt_data.rt_user b ON a.id = b.id;
-- Query a Hive table using a scheduling parameter
SELECT * FROM hive.default.${table_name};

Note

DataWorks provides Scheduling Parameter to dynamically pass values to code in periodic scheduling scenarios. You can define variables in the ${variable_name} format in the code of a node task and assign values to the variables on the Scheduling Settings tab in the right-side navigation pane. In the Scheduling Parameter section, assign a value to the variable. For more information about the supported formats and configurations of scheduling parameters, see Supported formats of scheduling parameters and Configure and use scheduling parameters.
If you need to change the values of parameters in the code, click Advanced Run in the toolbar. For more information about how parameter values are assigned, see What is the difference in the value assignment logic among Run, Advanced Run, and smoke testing in the development environment?

Run the SQL task

In the toolbar, click the icon. In the Parameter dialog box, select a scheduling resource group and click Running.
Note
- To access compute resources over the public internet or in a VPC, you must use a scheduling resource group that can connect to the compute resources. For more information, see Network connectivity solutions.
- If you need to change the resource group for subsequent task runs, click the Running with Parameters icon and select the desired resource group.
- Queries on EMR Trino nodes return a maximum of 10,000 records, and the total data size cannot exceed 10 MB.
Click the icon to save the SQL code.

(Optional) Configure advanced parameters

If you need to adjust how the SQL statements are executed, click Advanced Settings in the right-side navigation pane.

Parameter

Description

FLOW_SKIP_SQL_ANALYZE

Specifies how SQL statements are executed. Valid values:

true: Runs multiple SQL statements at a time.
false (default): Runs one SQL statement at a time.

DATAWORKS_SESSION_DISABLE

This parameter applies to test runs in the development environment. Valid values:

true: Creates a new JDBC connection for each SQL statement run.
false (default): Reuses the same JDBC connection when a user runs different SQL statements in a node.

Step 3: Configure task scheduling

To periodically run the node task, click Scheduling on the right side of the node editing page and configure scheduling settings based on your needs. For more information, see Overview of task scheduling properties.

Note

You must configure the node’s Rerun attribute and Parent Nodes before you can submit the node.

Step 4: Commit and deploy the task

After you configure the node task, you must commit and deploy it. After the task is deployed, it runs periodically based on the scheduling configuration.

Click the icon in the toolbar to save the node.
Click the icon in the toolbar to commit the node task.

In the Submission dialog box, enter a Change Description and select whether to perform a code review after the node is committed.
Note
- You must set the Rerun attribute property and specify the Parent Nodes for the node before you can commit it.
- Code review helps control code quality and prevents errors caused by deploying unreviewed code to the production environment. If you enable code review, reviewers must approve the committed code before it can be deployed. For more information, see Code review.

If your workspace runs in standard mode, you must click Deploy in the upper-right corner of the node editor page after you commit the task. This deploys the task to the production environment. For more information, see Deploy tasks.

Next steps

After the task is committed and deployed, it runs periodically based on its scheduling configuration. You can click O&M in the upper-right corner of the node editor page to view the scheduling and operational status of the periodic task. For more information, see Manage periodic tasks.

FAQ

Q: A connection timeout occurs when I run the node. Why?
```
EMR execute task failed!
SQL: {"name":"dw20251018","type":"TRINO_SQL","launcher":{"allocationSpec":{}},"properties":{"envs":{"FLOW_SKIP_SQL_ANALYZE":false},"arguments":["select * from default.dim_customers;"],"tags":[]},"description":"DataWorks"}
TASK-MESSAGE:
FAILED: Error executing query
```
A: Ensure that the resource group and the cluster can connect to each other over the network. Go to the computing resources list and click Resource Initialization. In the dialog box that appears, click Re-initialize. Verify that the initialization is successful.

The dialog box displays the following message at the top: If this is the first time you bind a cluster or if the cluster service configuration (for example, the hive-site file) has changed, initialize the resource group. Otherwise, tasks may fail to run.