EMR Presto nodes let you write Presto SQL queries against an Alibaba Cloud E-MapReduce (EMR) cluster and run them on a recurring schedule in DataWorks.
Limitations
| Constraint | Details |
|---|---|
| Cluster type | Legacy Hadoop data lake clusters only. DataLake and Custom clusters are not supported. |
| Resource group | Serverless resource group or exclusive resource group for scheduling. Use a serverless resource group when possible. |
| SQL statement size | Each SQL statement cannot exceed 130 KB. |
| Query results | A single query returns at most 10,000 records and 10 MB of data. |
| Data lineage | Not supported for EMR Presto nodes. |
Prerequisites
Before you begin, make sure you have:
An EMR cluster created and registered to DataWorks. See DataStudio (old version): Associate an EMR computing resource
(If using a RAM user) The RAM user added to the workspace with the Develop or Workspace Administrator role. The Workspace Administrator role grants more permissions than required for this task — assign it with caution. See Add workspace members and assign roles to them
A serverless resource group purchased and configured with workspace association and network settings. See Create and use a serverless resource group
A workflow created in DataStudio. See Create a workflow
Step 1: Create an EMR Presto node
Go to the DataStudio page. Log in to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
In the workflow directory, right-click the target workflow and choose Create Node > EMR > EMR Presto.
Alternatively, hover over Create and select Create Node > EMR > EMR Presto.
In the Create Node dialog box, configure the following fields and click Confirm. The configuration tab for the EMR Presto node opens.
Field Description Name Name for the node. Allowed characters: uppercase letters, lowercase letters, Chinese characters, digits, underscores ( _), and periods (.).Engine Instance The EMR computing resource to associate with this node. Node Type The type of compute node to use. Path The location within the workflow directory structure.
Step 2: Develop an EMR Presto task
Double-click the node you created. The task development page opens, where you can write SQL, configure advanced parameters, and run the task.
Develop SQL code
Write Presto SQL in the SQL editor. Use ${variable_name} syntax to define variables in your code. Assign values to those variables under Scheduling Configuration > Scheduling Parameters in the right-side panel, in key=value format. This lets you pass dynamic values at scheduling time without modifying the code.
Example:
select '${var}'; -- ${var} is resolved at run time from Scheduling Parameters
select * from userinfo;To assign a value to var: open Scheduling Configuration > Scheduling Parameters and add an entry such as var=2024-01-01. For supported parameter formats, see Supported formats of scheduling parameters.
If your workspace has multiple EMR computing resources attached, select the one to use for this node. If only one is attached, no selection is needed.
To test parameter resolution before scheduling, click Run With Parameters in the top toolbar. For details on how parameter values differ between Run, Run with Parameters, and smoke testing, see Differences in parameter assignment logic.
Configure advanced parameters (optional)
Hadoop cluster: EMR on ECS
In the Advanced Settings section, set the following parameters to control SQL execution behavior and job submission routing. For more information about how to configure the parameters, see Spark Configuration. These apply to Hadoop clusters (EMR on ECS).
| Parameter | Values | Default | Description |
|---|---|---|---|
FLOW_SKIP_SQL_ANALYZE | true / false | false | Controls how SQL statements run. true: run all statements in a single batch. false: run one statement at a time. Applies only to test runs in the development environment. |
USE_GATEWAY | true / false | false | Controls job submission routing. true: submit jobs through a gateway cluster. false: submit jobs directly to the header node. Setting this to true when no gateway cluster is associated causes job submission to fail. |
Run the SQL task
In the toolbar, click the
icon. In the Parameters dialog box, select the scheduling resource group and click Run.The resource group must have passed a connectivity test with the EMR cluster. To access computing resources over the public internet or within a VPC, verify connectivity first. See Network connectivity solutions.
Click the
icon to save your SQL.(Optional) Run smoke testing. Smoke testing in the development environment can be triggered when submitting the node or after submission. See Perform smoke testing.
Step 3: Configure scheduling properties
To run the task on a recurring schedule, click Properties in the right-side navigation pane and configure the scheduling settings.
Configure the Rerun and Parent Nodes parameters before committing the task.
For a full reference on scheduling options, see Overview.
Step 4: Deploy the task
Click the
icon to save the task.Click the
icon to commit the task. In the Submit dialog box, enter a Change description. If code review is enabled in your workspace, the task can only be deployed after the committed code passes review. See Code review.Configure the Rerun and Parent Nodes parameters on the Properties tab before committing.
(Standard mode workspaces only) Deploy the task to the production environment. Click Deploy in the upper-right corner of the node configuration tab. See Deploy nodes.
What's next
After the task is committed and deployed, it runs automatically based on your scheduling configuration. To monitor scheduling status, click Operation Center in the upper-right corner of the node configuration tab. See View and manage auto triggered tasks.
FAQ
Why does "Error executing query" appear?

This error occurs when the cluster type is not supported. EMR Presto nodes only work with legacy Hadoop data lake clusters — DataLake and Custom clusters are not supported.
To resolve this:
In DataStudio, go to the computing resource list.
Verify that the cluster registered to your workspace is a legacy Hadoop data lake cluster.
Why does a connection timeout occur when the node runs?
This indicates a network connectivity issue between the resource group and the EMR cluster.
To resolve this:
In DataStudio, go to the computing resource list page.
Find the resource and click Re-initialize.


Confirm that the initialization completes successfully, then retry the task.
If the issue persists, verify that the resource group has passed a connectivity test with the EMR cluster. See Network connectivity solutions.