All Products
Search
Document Center

DataWorks:Create an EMR Presto node

Last Updated:Mar 26, 2026

EMR Presto nodes let you write Presto SQL queries against an Alibaba Cloud E-MapReduce (EMR) cluster and run them on a recurring schedule in DataWorks.

Limitations

ConstraintDetails
Cluster typeLegacy Hadoop data lake clusters only. DataLake and Custom clusters are not supported.
Resource groupServerless resource group or exclusive resource group for scheduling. Use a serverless resource group when possible.
SQL statement sizeEach SQL statement cannot exceed 130 KB.
Query resultsA single query returns at most 10,000 records and 10 MB of data.
Data lineageNot supported for EMR Presto nodes.

Prerequisites

Before you begin, make sure you have:

Step 1: Create an EMR Presto node

  1. Go to the DataStudio page. Log in to the DataWorks console. In the top navigation bar, select the region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

  2. In the workflow directory, right-click the target workflow and choose Create Node > EMR > EMR Presto.

    Alternatively, hover over Create and select Create Node > EMR > EMR Presto.
  3. In the Create Node dialog box, configure the following fields and click Confirm. The configuration tab for the EMR Presto node opens.

    FieldDescription
    NameName for the node. Allowed characters: uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
    Engine InstanceThe EMR computing resource to associate with this node.
    Node TypeThe type of compute node to use.
    PathThe location within the workflow directory structure.

Step 2: Develop an EMR Presto task

Double-click the node you created. The task development page opens, where you can write SQL, configure advanced parameters, and run the task.

Develop SQL code

Write Presto SQL in the SQL editor. Use ${variable_name} syntax to define variables in your code. Assign values to those variables under Scheduling Configuration > Scheduling Parameters in the right-side panel, in key=value format. This lets you pass dynamic values at scheduling time without modifying the code.

Example:

select '${var}'; -- ${var} is resolved at run time from Scheduling Parameters
select * from userinfo;

To assign a value to var: open Scheduling Configuration > Scheduling Parameters and add an entry such as var=2024-01-01. For supported parameter formats, see Supported formats of scheduling parameters.

If your workspace has multiple EMR computing resources attached, select the one to use for this node. If only one is attached, no selection is needed.

To test parameter resolution before scheduling, click Run With Parameters in the top toolbar. For details on how parameter values differ between Run, Run with Parameters, and smoke testing, see Differences in parameter assignment logic.

Configure advanced parameters (optional)

Hadoop cluster: EMR on ECS

In the Advanced Settings section, set the following parameters to control SQL execution behavior and job submission routing. For more information about how to configure the parameters, see Spark Configuration. These apply to Hadoop clusters (EMR on ECS).

ParameterValuesDefaultDescription
FLOW_SKIP_SQL_ANALYZEtrue / falsefalseControls how SQL statements run. true: run all statements in a single batch. false: run one statement at a time. Applies only to test runs in the development environment.
USE_GATEWAYtrue / falsefalseControls job submission routing. true: submit jobs through a gateway cluster. false: submit jobs directly to the header node. Setting this to true when no gateway cluster is associated causes job submission to fail.

Run the SQL task

  1. In the toolbar, click the Advanced Run icon. In the Parameters dialog box, select the scheduling resource group and click Run.

    The resource group must have passed a connectivity test with the EMR cluster. To access computing resources over the public internet or within a VPC, verify connectivity first. See Network connectivity solutions.
  2. Click the Save icon to save your SQL.

  3. (Optional) Run smoke testing. Smoke testing in the development environment can be triggered when submitting the node or after submission. See Perform smoke testing.

Step 3: Configure scheduling properties

To run the task on a recurring schedule, click Properties in the right-side navigation pane and configure the scheduling settings.

Configure the Rerun and Parent Nodes parameters before committing the task.

For a full reference on scheduling options, see Overview.

Step 4: Deploy the task

  1. Click the 保存 icon to save the task.

  2. Click the 提交 icon to commit the task. In the Submit dialog box, enter a Change description. If code review is enabled in your workspace, the task can only be deployed after the committed code passes review. See Code review.

    Configure the Rerun and Parent Nodes parameters on the Properties tab before committing.
  3. (Standard mode workspaces only) Deploy the task to the production environment. Click Deploy in the upper-right corner of the node configuration tab. See Deploy nodes.

What's next

After the task is committed and deployed, it runs automatically based on your scheduling configuration. To monitor scheduling status, click Operation Center in the upper-right corner of the node configuration tab. See View and manage auto triggered tasks.

FAQ

Why does "Error executing query" appear?

image

This error occurs when the cluster type is not supported. EMR Presto nodes only work with legacy Hadoop data lake clusters — DataLake and Custom clusters are not supported.

To resolve this:

  1. In DataStudio, go to the computing resource list.

  2. Verify that the cluster registered to your workspace is a legacy Hadoop data lake cluster.

Why does a connection timeout occur when the node runs?

This indicates a network connectivity issue between the resource group and the EMR cluster.

To resolve this:

  1. In DataStudio, go to the computing resource list page.

  2. Find the resource and click Re-initialize.

    image

    image

  3. Confirm that the initialization completes successfully, then retry the task.

If the issue persists, verify that the resource group has passed a connectivity test with the EMR cluster. See Network connectivity solutions.