All Products
Search
Document Center

DataWorks:Create an EMR Spark SQL node

Last Updated:Mar 26, 2026

An EMR Spark SQL node runs distributed SQL queries on your EMR cluster through DataWorks scheduling. Use it to process structured data, schedule recurring SQL jobs, and promote tested code to production—all within DataStudio.

Prerequisites

Before you begin, make sure you have:

  • An EMR cluster registered with DataWorks. See Legacy Data Development: Attach EMR computing resources.

  • A serverless resource group or exclusive resource group for scheduling, purchased and configured (attached to a workspace with network connectivity). See Add and use a Serverless resource group.

  • A RAM user added to the workspace with the Developer or Workspace Administrator role. Grant Workspace Administrator only when necessary—it carries extensive permissions. See Add workspace members and assign roles to them.

  • A business flow created in DataStudio. All node development is organized within business flows. See Create a business flow.

  • (Optional) A custom image, if you need a specific component environment (for example, to replace Spark JAR packages or bundle specific libraries). Build a custom image from the official dataworks_emr_base_task_pod image. See Custom images.

Limitations

  • EMR Spark SQL nodes run only on a serverless resource group or an exclusive resource group for scheduling. Use a serverless resource group unless you have a specific reason to use an exclusive resource group. If you use a custom image in DataStudio, use a serverless computing resource group.

  • To manage metadata for DataLake or custom clusters, configure EMR-HOOK on the cluster. Without it, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related administration. See Configure EMR-HOOK for Spark SQL.

The following table summarizes feature availability by cluster type:

FeatureDataLake/Custom cluster (EMR on ECS)EMR Serverless SparkEMR on ACK (Spark cluster)Hadoop cluster (EMR on ECS)
Data lineageSupportedSupportedNot supported
Function registration (visualization)SupportedNot supportedNot supported
JDBC submission (ENABLE_SPARKSQL_JDBC)Supported
Gateway submission (USE_GATEWAY)Supported
SQL Compute session controlSupported
Custom Spark parametersSupportedSupportedSupportedSupported

Usage notes

These notes apply if Ranger access control is enabled for Spark in the EMR cluster associated with your workspace.

  • Default image: Ranger access control is enabled automatically.

  • Custom image: Submit a support ticket to request an image upgrade that supports Ranger access control.

Step 1: Create an EMR Spark SQL node

  1. Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

  2. In DataStudio, right-click the target business flow and choose New Node > EMR > EMR Spark SQL. Alternatively, hover over New and choose New Node > EMR > EMR Spark SQL.

  3. In the New Node dialog box, set Name, Engine Instance, Node Type, and Path, then click Confirm. The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

The configuration tab for the new EMR Spark SQL node opens.

Step 2: Develop the SQL task

Write SQL code

In the SQL editing area, write your Spark SQL statements.

To pass dynamic values at runtime—such as date partitions—define variables using the ${variable_name} format, then assign values in Scheduling Configuration > Scheduling Parameters in the right-side pane. These scheduling parameters apply during production scheduled runs.

For development testing, assign constant values in the Run dialog box instead of using scheduling parameter values. This lets you validate SQL logic without depending on a live schedule.

SHOW TABLES;

-- Use ${var} to define a variable.
-- Assign ${yyyymmdd} to var to append a date suffix to table names at runtime.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip  STRING COMMENT 'IP address',
  uid STRING COMMENT 'User ID'
)
PARTITIONED BY (dt STRING);
A single SQL statement cannot exceed 130 KB. If multiple EMR computing resources are attached to your workspace, select the target computing resource before running.

Configure advanced parameters (optional)

On the Advanced Settings tab, set Spark-specific parameters for your cluster type. The available parameters differ by cluster type. For the full list of Spark configuration keys, see Spark Configuration.

DataLake cluster/Custom cluster: EMR on ECS

ParameterDescription
queueThe YARN scheduling queue for job submission. Default: default. See Basic queue configurations.
priorityJob priority. Default: 1.
FLOW_SKIP_SQL_ANALYZESQL execution mode. true: run all SQL statements in one batch. false (default): run one statement at a time. Applies to the development environment only.
ENABLE_SPARKSQL_JDBCSQL submission method. true: submit via Java Database Connectivity (JDBC). If the cluster has Kyuubi, submits to Kyuubi using JDBC, and custom Spark parameters are supported; otherwise submits to the Spark Thrift Server. Both paths support metadata lineage, but Thrift Server submissions omit node output information. false (default): submit via spark-submit cluster mode, which supports metadata lineage, output information, and custom Spark parameters for both Spark 2 and Spark 3. In spark-submit mode, DataWorks creates temporary files in the /tmp directory on the cluster's Hadoop Distributed File System (HDFS)—make sure that directory has read and write permissions. Custom SparkConf parameters added here are automatically appended to the submitted command (for example, "spark.driver.memory": "2g").
DATAWORKS_SESSION_DISABLEJDBC connection behavior for development-environment runs. true: create a new JDBC connection for each SQL statement. false (default): reuse the same JDBC connection across statements in a node. When set to false, the YARN applicationId is not printed; set to true to print it.
OtherAny Spark Configuration key-value pair. Format: "spark.eventLog.enabled": false. DataWorks converts it to --conf key=value before sending to the cluster. To set Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger in Set global Spark parameters.

EMR Serverless Spark cluster

For the full parameter reference, see Set parameters for submitting Spark jobs.

ParameterDescription
FLOW_SKIP_SQL_ANALYZESQL execution mode. Same as above.
DATAWORKS_SESSION_DISABLEJob submission target for development-environment runs. true: submit to a resource queue (uses the default queue configured in Management Center > Cluster Management). false (default): submit to SQL Compute. This parameter takes effect only during execution in Data Development and does not take effect during scheduled runtime. When set to true, set SERVERLESS_QUEUE_NAME to specify the target queue.
SERVERLESS_RELEASE_VERSIONDPI engine version. Default: the Default DPI Engine Version set in Management Center > Cluster Management. This parameter takes effect only when the SQL Compute session for the registered cluster is not running in the EMR Serverless Spark console.
SERVERLESS_QUEUE_NAMEResource queue for job submission. Default: the Default Resource Queue in Management Center > Cluster Management. When running in Data Development, you must first set DATAWORKS_SESSION_DISABLE to true so the job is submitted to a queue; only then does this parameter take effect. During scheduled execution in Operation Center, the task is forcibly submitted to a queue and cannot be submitted to SQL Compute. For queue management, see Manage resource queues. You can also set the queue globally via Set global Spark parameters.
SERVERLESS_SQL_COMPUTESQL Compute session to use. Default: the Default SQL Compute in Management Center > Cluster Management. To manage sessions, see Manage SQL sessions.
OtherAny Spark configuration key-value pair. Format: "spark.eventLog.enabled": "false". DataWorks converts it to --conf key=value. See Set global Spark parameters.

Spark cluster: EMR on ACK

ParameterDescription
FLOW_SKIP_SQL_ANALYZESQL execution mode. Same as above.
OtherAny Spark configuration key-value pair. Format: "spark.eventLog.enabled": false. DataWorks converts it to --conf key=value. See Set global Spark parameters.

Hadoop cluster: EMR on ECS

ParameterDescription
queueYARN scheduling queue. Default: default. See Basic queue configurations.
priorityJob priority. Default: 1.
FLOW_SKIP_SQL_ANALYZESQL execution mode. Same as above.
USE_GATEWAYWhether to submit the job through a gateway cluster. true: route through a gateway cluster. false (default): submit directly to the header node. Setting this to true on a cluster without an associated gateway cluster causes subsequent job submissions to fail.
OtherAny Spark configuration key-value pair. Format: "spark.eventLog.enabled": false. DataWorks converts it to --conf key=value. To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger in Set global Spark parameters.

Save and run the task

DataLake cluster / Custom cluster

Click the 保存 icon to save. Click the 运行 icon to run.

In the Run dialog box, select a resource group that has passed the network connectivity test—this makes sure DataWorks can reach your Spark service. If your SQL uses variables, assign constant values for the test run.

To override parameter values before running, click Advanced Run on the toolbar. For details on assignment logic, see What is the difference in assignment logic among Run, Advanced Run, and smoke testing in the development environment?

For debugging guidance, see Task debugging process.

EMR on ACK (Hadoop cluster)

  1. Click the 高级运行 icon. In the Parameters dialog box, select your scheduling resource group and click Run. Queries return a maximum of 10,000 records and 10 MB of data.

    The resource group must have passed the connectivity test with the computing resource. For network setup options, see Network connectivity solutions. To switch resource groups for future runs, click Run With Parameters 高级运行 and select the target group.
  2. Click the 保存 icon to save.

  3. (Optional) Run a smoke test to validate the node before scheduling. See Perform a smoke test.

Step 3: Configure scheduling

To run the node on a recurring schedule, click Properties in the right-side pane and configure the scheduling options.

Configure Rerun and Parent Nodes before committing the task. If you need a custom component environment, create a custom image from the official dataworks_emr_base_task_pod image and use it in DataStudio.

For the full scheduling configuration reference, see Overview.

Step 4: Publish the node

After configuring the node, commit and deploy it. The system then runs it on the schedule you configured.

  1. Click the save icon icon to save.

  2. Click the commit icon icon to commit. In the Submit dialog box, fill in the Change description and decide whether to trigger a code review.

    Configure Rerun and Parent Nodes on the Properties tab before committing. If code review is enabled, the committed code must pass review before it can be deployed. See Code review.
  3. If your workspace is in standard mode, deploy the node to the production environment after committing. Click Deploy in the upper-right corner of the configuration tab. See Deploy nodes.

What's next

After deploying, the node runs automatically on its configured schedule. To monitor execution, click Operation Center in the upper-right corner of the configuration tab. See View and manage auto triggered tasks.

FAQ

The EMR Spark task must be submitted to SQL Compute, but the task fails immediately.

Check whether the SQL Compute service is in Running status. If it is stopped or starting, the submission fails. See Manage SQL sessions to view and start the session.

A `DlfMetaStoreClientFactory not found` error occurs after enabling Kerberos on the EMR cluster.

This error appears when using YARN-Cluster mode with spark-submit in an EMR Spark node after Kerberos is enabled. See the troubleshooting entry for the resolution steps.