All Products
Search
Document Center

DataWorks:EMR Spark SQL node

Last Updated:Mar 26, 2026

Create an E-MapReduce (EMR) Spark SQL node in Data Studio to run distributed SQL queries on your EMR cluster and improve job execution efficiency.

Prerequisites

Before you begin, make sure you have:

  • An Alibaba Cloud EMR cluster bound to DataWorks. See Data Studio: Associate an EMR computing resource.

  • (Resource Access Management (RAM) users only) Been added to the workspace and assigned the Developer or Workspace Administrator role. The Workspace Administrator role carries extensive permissions — grant it with caution. See Add members to a workspace. Alibaba Cloud account users can skip this step.

  • (Optional) A custom image based on the official dataworks_emr_base_task_pod image, if you need a custom component environment — for example, to replace Spark JAR packages or include specific libraries, files, or JAR packages. See Custom images.

Limitations

LimitationDetailsWorkaround
Supported resource groupsRuns only on a serverless resource group (recommended) or an exclusive resource group for scheduling. To use a custom image in Data Studio, you must use a serverless resource group.Switch to a serverless resource group.
EMR-HOOK requirementTo manage metadata for DataLake or custom clusters, configure EMR-HOOK on the cluster first. Without it, DataWorks cannot display metadata in real time, generate audit logs, display data lineage, or run EMR data governance tasks. See Configure EMR-HOOK for Spark SQL.Configure EMR-HOOK before using data governance features.
Data lineageSupported for EMR Serverless Spark clusters. Not supported for EMR on ACK Spark clusters.Use a DataLake or custom cluster if data lineage is required.
Visual function registrationSupported for DataLake and custom clusters. Not supported for EMR on ACK Spark or EMR Serverless Spark clusters.Use a DataLake or custom cluster if visual function registration is required.

Usage notes

If Ranger access control is enabled for Spark in the EMR cluster bound to your workspace:

  • Tasks using the default image support Ranger access control automatically.

  • Tasks using a custom image require an image upgrade. Submit a ticket to request the upgrade.submit a ticket

Develop and run an EMR Spark SQL node

Step 1: Write SQL code

Write your SQL in the editor on the EMR Spark SQL node editing page.

Using scheduling variables

Define variables in your SQL using the ${variable_name} syntax. Assign values to these variables in the Scheduling Parameters section of the Scheduling Configuration panel.

ScenarioHow values are assigned
Scheduled runDataWorks evaluates the expression you set (for example, ${yyyymmdd}) and passes the result at runtime.
Test run in Data StudioEnter the value manually in the parameter dialog box before clicking Run. If no value is entered, the task fails with a variable substitution error.

Example:

SHOW TABLES;

-- ${var} defines a variable named "var".
-- Assign ${yyyymmdd} to use the business date as a table name suffix.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip  STRING COMMENT 'IP address',
  uid STRING COMMENT 'User ID'
)
PARTITIONED BY (dt STRING);

For more information, see Sources and expressions of scheduling parameters.

Step 2: Configure EMR node parameters (optional)

In the Run Configuration panel, expand EMR Node Parameters to configure the following:

Parameter availability by cluster type

ParameterDataLake / custom (EMR on ECS)EMR Serverless SparkEMR on ACKHadoop (EMR on ECS)
queueYesYes
priorityYesYes
FLOW_SKIP_SQL_ANALYZEYesYesYesYes
ENABLE_SPARKSQL_JDBCYes
DATAWORKS_SESSION_DISABLEYesYes
SERVERLESS_RELEASE_VERSIONYes
SERVERLESS_QUEUE_NAMEYes
SERVERLESS_SQL_COMPUTEYes
USE_GATEWAYYes
Custom Spark properties (Other)YesYesYesYes

DataLake and custom clusters (EMR on ECS)

ParameterDescription
queueThe YARN scheduling queue for job submission. Default: default. See Basic queue configurations.
priorityJob priority. Default: 1.
FLOW_SKIP_SQL_ANALYZEControls how multiple SQL statements run during test runs in Data Studio. true: runs all statements at once. false (default): runs one statement at a time. This parameter applies only to test runs, not scheduled runs.
ENABLE_SPARKSQL_JDBCControls how SQL code is submitted. true: submits via JDBC — to Apache Kyuubi if the cluster has it, otherwise to the Spark Thrift Server. Both support data lineage, but the Spark Thrift Server does not return node output. false (default): submits using spark-submit cluster mode, which supports Spark 2 and Spark 3, data lineage, node output, and custom Spark parameters.
Note

spark-submit cluster mode creates temporary files and directories in the /tmp directory of the EMR cluster's Hadoop Distributed File System (HDFS) — make sure you have read and write permissions on that directory.

DATAWORKS_SESSION_DISABLEControls JDBC session behavior during test runs. true: creates a new JDBC connection for each SQL statement (also prints the Hive YARN applicationId). false (default): reuses the same JDBC connection within a single node run.
OtherCustom Spark configuration properties. Format: spark.eventLog.enabled : false. DataWorks appends these as --conf key=value when submitting to the EMR cluster. To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger here or in Set global Spark parameters. You can also set global Spark parameters at the workspace level and control whether they override node-level settings.

EMR Serverless Spark

For full parameter details, see Set parameters for submitting a Spark job.

ParameterDescription
FLOW_SKIP_SQL_ANALYZESame as above. Applies only to test runs in Data Studio.
DATAWORKS_SESSION_DISABLEControls where the task runs during test runs in Data Studio. true: submits to a queue (the default queue configured when the compute resource was bound, unless SERVERLESS_QUEUE_NAME is set). false (default): submits to SQL Compute.
Note

This parameter does not affect scheduled runs — scheduled runs always submit to a queue.

SERVERLESS_RELEASE_VERSIONSpark engine version. Defaults to the Default Engine Version set for the cluster in Computing Resources (Management Center). Takes effect only when the SQL Compute session for the cluster is not in the Running state in the EMR Serverless Spark console.
SERVERLESS_QUEUE_NAMEResource queue for job submission. Defaults to the Default Resource Queue in Cluster Management (Management Center). To use this parameter for test runs in Data Studio, first set DATAWORKS_SESSION_DISABLE to true. For scheduled runs in Operation Center, tasks always go to a queue. Takes effect only when the SQL Compute session is not in the Running state in the EMR Serverless Spark console. Set this parameter directly on the node or via Set global Spark parameters. See Manage resource queues.
SERVERLESS_SQL_COMPUTESQL Compute session to use. Defaults to the Default SQL Compute set for the cluster in Computing Resources (Management Center). See Manage SQL sessions.
OtherCustom Spark configuration properties. Format: spark.eventLog.enabled : false. DataWorks appends these as --conf key=value. See Set global Spark parameters.

EMR on ACK

ParameterDescription
FLOW_SKIP_SQL_ANALYZESame as above. Applies only to test runs in Data Studio.
OtherCustom Spark configuration properties. Format: spark.eventLog.enabled : false. DataWorks appends these as --conf key=value. See Set global Spark parameters.

Hadoop (EMR on ECS)

ParameterDescription
queueThe YARN scheduling queue for job submission. Default: default.
priorityJob priority. Default: 1.
FLOW_SKIP_SQL_ANALYZESame as above. Applies only to test runs in Data Studio.
USE_GATEWAYControls whether jobs are submitted through a gateway cluster. true: submits via the gateway cluster. false (default): submits directly to the header node.
Note

If no gateway cluster is associated with this node's cluster, setting this to true causes job submission to fail.

OtherCustom Spark configuration properties. Format: spark.eventLog.enabled : false. DataWorks appends these as --conf key=value. To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger in Set global Spark parameters.

Step 3: Run the node

  1. In the Run Configuration panel, select a Computing Resource and a Resource Group. Optionally, set Schedule CUs based on the resource requirements of your task. The default value is 0.25. To access data sources over a public network or VPC, use a scheduling resource group that has passed the connectivity test with that data source. See Network connectivity solutions.

  2. In the parameter dialog box in the toolbar, select the data source and click Run.

Step 4: Configure scheduling

To run the node on a schedule, configure its scheduling properties. See Configure scheduling for a node.

Step 5: Publish the node

After configuring the node, publish it to make it available for scheduled runs. See Publish a node or workflow.

After publishing, monitor the node's scheduled status in Operation Center. See Get started with Operation Center.

FAQ

Why does a "DlfMetaStoreClientFactory not found" error occur when running spark-submit in YARN-Cluster mode after enabling Kerberos?

See this troubleshooting entry for the cause and resolution.

The EMR Spark SQL node fails when SQL Compute is not running — what should I check?

Tasks submitted to SQL Compute in Data Studio require the SQL Compute instance to be in the Running state. Check the status in Manage SQL sessions and start the instance if it is stopped.

What's next