All Products
Search
Document Center

DataWorks:EMR Presto node

Last Updated:Mar 26, 2026

Presto, also known as PrestoDB, is a flexible and scalable distributed SQL query engine that supports interactive analysis and queries of big data using standard SQL. The EMR Presto node in DataWorks lets you write and schedule Presto SQL tasks on an EMR cluster.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud EMR cluster bound to DataWorks. See Data Studio: Associate an EMR computing resource.

  • (Optional) If you are a Resource Access Management (RAM) user: membership in the target workspace with the Developer or Workspace Administrator role assigned.

    The Workspace Administrator role has extensive permissions. Grant it with caution. To add members to a workspace, see Add members to a workspace. Alibaba Cloud account users can skip this step.

Limitations

Supported cluster types

Only earlier versions of Hadoop-based data lake clusters are supported. DataLake clusters and custom clusters are not supported.

Supported resource groups

Run EMR Presto tasks on a Serverless resource group (recommended) or an exclusive resource group for scheduling.

Data lineage

Data lineage is not supported for EMR Presto nodes.

Develop an EMR Presto node

Step 1: Write SQL code

Write SQL in the SQL editor on the node edit page. To pass dynamic values at runtime, define variables using ${variable_name} syntax, then assign values in the Scheduling Parameters section of the Scheduling Configuration tab. For details, see Sources and expressions of scheduling parameters.

Example:

select '${var}'; -- Use with scheduling parameters.

select * from userinfo;
SQL statements are subject to the following limits:
A single SQL statement cannot exceed 130 KB.
A query returns at most 10,000 records. The total size of returned data cannot exceed 10 MB.

Step 2: Configure advanced parameters (optional)

On the Scheduling Configuration tab, configure the following parameters in EMR Node Parameters > DataWorks Parameters. These parameters apply to Hadoop clusters (EMR on ECS).

Hadoop cluster: EMR on ECS
ParameterDefaultDescription
DATAWORKS_SESSION_DISABLEfalseControls Java Database Connectivity (JDBC) session behavior. Set to true to create a new JDBC connection for each SQL run (required to print the yarn applicationId of Hive). Set to false to reuse the same JDBC connection across SQL statements in the same node. Applies to test runs in the development environment only.
FLOW_SKIP_SQL_ANALYZEfalseControls how SQL statements are executed. Set to true to run multiple SQL statements per execution. Set to false to run one SQL statement at a time. Applies to test runs in the development environment only.
priority1Job priority.
queuedefaultThe YARN scheduling queue for job submission. For queue configuration details, see Basic queue configurations.

To configure open-source Spark property parameters, use the EMR Node Parameters > Spark Parameters section. See Spark configuration for available properties.

Step 3: Run the SQL task

  1. In Run Configuration, set the Computing Resource and Resource Group under the Computing Resource section.

    Optionally, adjust Scheduling CUs based on the resources required for task execution. The default value is 0.25. To access data sources over a public network or VPC, use a scheduling resource group that has passed the connectivity test with the data source. See Network connectivity solutions.
  2. In the parameter dialog box on the toolbar, select the data source and click Run.

Step 4: Configure scheduling

To run the node on a recurring schedule, configure its scheduling properties. See Configure node scheduling.

Step 5: Publish the node

Publish the node to make it active. See Publish nodes and workflows.

After publishing, monitor auto-triggered task runs in Operation Center. See Get started with Operation Center.

FAQ

Why does "Error executing query" appear?

image

The cluster is not an earlier version of a Hadoop-based data lake cluster. Only this cluster type is supported — switch to a supported cluster and retry.

Why does a connection timeout occur when the node runs?

The resource group cannot reach the cluster. Go to the computing resource list page, find the resource, and click Re-initialize in the dialog box that appears. Verify that initialization succeeds.

imageimage