Create an EMR Spark SQL Node to Run Distributed Queries - DataWorks

An EMR Spark SQL node runs distributed SQL queries on your EMR cluster through DataWorks scheduling. Use it to process structured data, schedule recurring SQL jobs, and promote tested code to production—all within DataStudio.

Prerequisites

Before you begin, make sure you have:

An EMR cluster registered with DataWorks. See Legacy Data Development: Attach EMR computing resources.
A serverless resource group or exclusive resource group for scheduling, purchased and configured (attached to a workspace with network connectivity). See Add and use a Serverless resource group.
A RAM user added to the workspace with the Developer or Workspace Administrator role. Grant Workspace Administrator only when necessary—it carries extensive permissions. See Add workspace members and assign roles to them.
A business flow created in DataStudio. All node development is organized within business flows. See Create a business flow.
(Optional) A custom image, if you need a specific component environment (for example, to replace Spark JAR packages or bundle specific libraries). Build a custom image from the official dataworks_emr_base_task_pod image. See Custom images.

Limitations

EMR Spark SQL nodes run only on a serverless resource group or an exclusive resource group for scheduling. Use a serverless resource group unless you have a specific reason to use an exclusive resource group. If you use a custom image in DataStudio, use a serverless computing resource group.
To manage metadata for DataLake or custom clusters, configure EMR-HOOK on the cluster. Without it, DataWorks cannot display real-time metadata, generate audit logs, show data lineage, or perform EMR-related administration. See Configure EMR-HOOK for Spark SQL.

The following table summarizes feature availability by cluster type:

Feature	DataLake/Custom cluster (EMR on ECS)	EMR Serverless Spark	EMR on ACK (Spark cluster)	Hadoop cluster (EMR on ECS)
Data lineage	Supported	Supported	Not supported	—
Function registration (visualization)	Supported	Not supported	Not supported	—
JDBC submission (`ENABLE_SPARKSQL_JDBC`)	Supported	—	—	—
Gateway submission (`USE_GATEWAY`)	—	—	—	Supported
SQL Compute session control	—	Supported	—	—
Custom Spark parameters	Supported	Supported	Supported	Supported

Usage notes

These notes apply if Ranger access control is enabled for Spark in the EMR cluster associated with your workspace.

Default image: Ranger access control is enabled automatically.
Custom image: Submit a support ticket to request an image upgrade that supports Ranger access control.

Step 1: Create an EMR Spark SQL node

Log on to the DataWorks console. In the top navigation bar, select your region. In the left-side navigation pane, choose Data Development and O&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.
In DataStudio, right-click the target business flow and choose New Node > EMR > EMR Spark SQL. Alternatively, hover over New and choose New Node > EMR > EMR Spark SQL.
In the New Node dialog box, set Name, Engine Instance, Node Type, and Path, then click Confirm. The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

The configuration tab for the new EMR Spark SQL node opens.

Step 2: Develop the SQL task

Write SQL code

In the SQL editing area, write your Spark SQL statements.

To pass dynamic values at runtime—such as date partitions—define variables using the ${variable_name} format, then assign values in Scheduling Configuration > Scheduling Parameters in the right-side pane. These scheduling parameters apply during production scheduled runs.

For development testing, assign constant values in the Run dialog box instead of using scheduling parameter values. This lets you validate SQL logic without depending on a live schedule.

SHOW TABLES;

-- Use ${var} to define a variable.
-- Assign ${yyyymmdd} to var to append a date suffix to table names at runtime.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip  STRING COMMENT 'IP address',
  uid STRING COMMENT 'User ID'
)
PARTITIONED BY (dt STRING);

A single SQL statement cannot exceed 130 KB. If multiple EMR computing resources are attached to your workspace, select the target computing resource before running.

Configure advanced parameters (optional)

On the Advanced Settings tab, set Spark-specific parameters for your cluster type. The available parameters differ by cluster type. For the full list of Spark configuration keys, see Spark Configuration.

DataLake cluster/Custom cluster: EMR on ECS

Parameter	Description
`queue`	The YARN scheduling queue for job submission. Default: `default`. See Basic queue configurations.
`priority`	Job priority. Default: `1`.
`FLOW_SKIP_SQL_ANALYZE`	SQL execution mode. `true`: run all SQL statements in one batch. `false` (default): run one statement at a time. Applies to the development environment only.
`ENABLE_SPARKSQL_JDBC`	SQL submission method. `true`: submit via Java Database Connectivity (JDBC). If the cluster has Kyuubi, submits to Kyuubi using JDBC, and custom Spark parameters are supported; otherwise submits to the Spark Thrift Server. Both paths support metadata lineage, but Thrift Server submissions omit node output information. `false` (default): submit via `spark-submit cluster` mode, which supports metadata lineage, output information, and custom Spark parameters for both Spark 2 and Spark 3. In `spark-submit` mode, DataWorks creates temporary files in the `/tmp` directory on the cluster's Hadoop Distributed File System (HDFS)—make sure that directory has read and write permissions. Custom SparkConf parameters added here are automatically appended to the submitted command (for example, `"spark.driver.memory": "2g"`).
`DATAWORKS_SESSION_DISABLE`	JDBC connection behavior for development-environment runs. `true`: create a new JDBC connection for each SQL statement. `false` (default): reuse the same JDBC connection across statements in a node. When set to `false`, the YARN `applicationId` is not printed; set to `true` to print it.
Other	Any Spark Configuration key-value pair. Format: `"spark.eventLog.enabled": false`. DataWorks converts it to `--conf key=value` before sending to the cluster. To set Ranger access control, add `spark.hadoop.fs.oss.authorization.method=ranger` in Set global Spark parameters.

EMR Serverless Spark cluster

For the full parameter reference, see Set parameters for submitting Spark jobs.

Parameter	Description
`FLOW_SKIP_SQL_ANALYZE`	SQL execution mode. Same as above.
`DATAWORKS_SESSION_DISABLE`	Job submission target for development-environment runs. `true`: submit to a resource queue (uses the default queue configured in Management Center > Cluster Management). `false` (default): submit to SQL Compute. This parameter takes effect only during execution in Data Development and does not take effect during scheduled runtime. When set to `true`, set `SERVERLESS_QUEUE_NAME` to specify the target queue.
`SERVERLESS_RELEASE_VERSION`	DPI engine version. Default: the Default DPI Engine Version set in Management Center > Cluster Management. This parameter takes effect only when the SQL Compute session for the registered cluster is not running in the EMR Serverless Spark console.
`SERVERLESS_QUEUE_NAME`	Resource queue for job submission. Default: the Default Resource Queue in Management Center > Cluster Management. When running in Data Development, you must first set `DATAWORKS_SESSION_DISABLE` to `true` so the job is submitted to a queue; only then does this parameter take effect. During scheduled execution in Operation Center, the task is forcibly submitted to a queue and cannot be submitted to SQL Compute. For queue management, see Manage resource queues. You can also set the queue globally via Set global Spark parameters.
`SERVERLESS_SQL_COMPUTE`	SQL Compute session to use. Default: the Default SQL Compute in Management Center > Cluster Management. To manage sessions, see Manage SQL sessions.
Other	Any Spark configuration key-value pair. Format: `"spark.eventLog.enabled": "false"`. DataWorks converts it to `--conf key=value`. See Set global Spark parameters.

Spark cluster: EMR on ACK

Parameter	Description
`FLOW_SKIP_SQL_ANALYZE`	SQL execution mode. Same as above.
Other	Any Spark configuration key-value pair. Format: `"spark.eventLog.enabled": false`. DataWorks converts it to `--conf key=value`. See Set global Spark parameters.

Hadoop cluster: EMR on ECS

Parameter	Description
`queue`	YARN scheduling queue. Default: `default`. See Basic queue configurations.
`priority`	Job priority. Default: `1`.
`FLOW_SKIP_SQL_ANALYZE`	SQL execution mode. Same as above.
`USE_GATEWAY`	Whether to submit the job through a gateway cluster. `true`: route through a gateway cluster. `false` (default): submit directly to the header node. Setting this to `true` on a cluster without an associated gateway cluster causes subsequent job submissions to fail.
Other	Any Spark configuration key-value pair. Format: `"spark.eventLog.enabled": false`. DataWorks converts it to `--conf key=value`. To enable Ranger access control, add `spark.hadoop.fs.oss.authorization.method=ranger` in Set global Spark parameters.

Save and run the task

DataLake cluster / Custom cluster

Click the icon to save. Click the icon to run.

In the Run dialog box, select a resource group that has passed the network connectivity test—this makes sure DataWorks can reach your Spark service. If your SQL uses variables, assign constant values for the test run.

To override parameter values before running, click Advanced Run on the toolbar. For details on assignment logic, see What is the difference in assignment logic among Run, Advanced Run, and smoke testing in the development environment?

For debugging guidance, see Task debugging process.

EMR on ACK (Hadoop cluster)

Click the icon. In the Parameters dialog box, select your scheduling resource group and click Run. Queries return a maximum of 10,000 records and 10 MB of data.
The resource group must have passed the connectivity test with the computing resource. For network setup options, see Network connectivity solutions. To switch resource groups for future runs, click Run With Parameters and select the target group.
Click the icon to save.
(Optional) Run a smoke test to validate the node before scheduling. See Perform a smoke test.

Step 3: Configure scheduling

To run the node on a recurring schedule, click Properties in the right-side pane and configure the scheduling options.

Configure Rerun and Parent Nodes before committing the task. If you need a custom component environment, create a custom image from the official dataworks_emr_base_task_pod image and use it in DataStudio.

For the full scheduling configuration reference, see Overview.

Step 4: Publish the node

After configuring the node, commit and deploy it. The system then runs it on the schedule you configured.

Click the icon to save.
Click the icon to commit. In the Submit dialog box, fill in the Change description and decide whether to trigger a code review.
Configure Rerun and Parent Nodes on the Properties tab before committing. If code review is enabled, the committed code must pass review before it can be deployed. See Code review.
If your workspace is in standard mode, deploy the node to the production environment after committing. Click Deploy in the upper-right corner of the configuration tab. See Deploy nodes.

What's next

After deploying, the node runs automatically on its configured schedule. To monitor execution, click Operation Center in the upper-right corner of the configuration tab. See View and manage auto triggered tasks.

FAQ

The EMR Spark task must be submitted to SQL Compute, but the task fails immediately.

Check whether the SQL Compute service is in Running status. If it is stopped or starting, the submission fails. See Manage SQL sessions to view and start the session.

A `DlfMetaStoreClientFactory not found` error occurs after enabling Kerberos on the EMR cluster.

This error appears when using YARN-Cluster mode with spark-submit in an EMR Spark node after Kerberos is enabled. See the troubleshooting entry for the resolution steps.