All Products
Search
Document Center

DataWorks:Serverless Spark SQL node

Last Updated:Jun 30, 2026

Create a Serverless Spark SQL node to process structured data with a distributed SQL query engine based on an EMR Serverless Spark compute resource. This approach improves job execution efficiency.

Usage notes

  • Compute resource limitations: Only bound EMR Serverless Spark compute resources are supported. You must ensure that the resource group and the compute resource can communicate over the network.

  • Resource group constraints: This task runs only in a Serverless resource group.

  • (Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

    If you are using a root account, skip this step.

Create a node

For instructions, see Create a node.

Develop the node

Develop your task code in the SQL editor. The syntax supports catalog.database.tablename. If you omit catalog, the cluster's default Data Catalog is used. If you omit catalog.database, the default database in the cluster's default Data Catalog is used.

For more information about Data Catalog, see Manage a Data Catalog in EMR Serverless Spark.

-- Replace <catalog.database.tablename> with your actual identifiers.
SELECT * FROM <catalog.database.tablename> 

Define variables in your code by using the ${variable_name} format. Then, assign values to these variables in the Scheduling Parameters section of the Scheduling Settings pane. This enables dynamic parameter passing in scheduled scenarios. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.

SHOW TABLES; 
-- Use ${var} to define a variable named var. If you assign the value ${yyyymmdd} to this variable, you can use a scheduled task to create a table with the business date as a suffix.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip STRING COMMENT'IP address',
  uid STRING COMMENT'User ID'
)PARTITIONED BY(
  dt STRING
); --This can be used with scheduling parameters.
Note

An SQL statement cannot exceed 130 KB.

Debug the node

  1. In the Run Configuration section, select a Compute Resource and Resource Group.

    Parameter

    Description

    Compute Resource

    Select a bound EMR Serverless Spark compute resource. If no compute resource is available, select Create Compute Resource from the drop-down list.

    Resource Group

    Select a resource group that is associated with the workspace.

    Script Parameters

    When you configure the node content, define variables by using the ${parameter_name} format. Then, configure the Parameter name and Parameter Value in the Script Parameters section. These variables are dynamically replaced with actual values at runtime. For more information, see Sources and expressions of scheduling parameters.

    ServerlessSpark Node Parameters

    Spark runtime parameters. The following types are supported:

  2. On the toolbar at the top of the node editor, click Run to run the SQL task.

    Important

    Before deployment, synchronize the Serverlessspark Node Parameters from Run Configuration to the Serverlessspark Node Parameters in Scheduling Settings.

SQL execution modes

DataWorks provides two SQL execution modes: running within the node and running from the workflow panel. These two modes have different execution contexts and may produce different results.

  • Run within the node: Select all SQL statements in the SQL editor and click Run. Only the selected SQL statements are executed. This mode uses the compute resource and parameters configured for the node and does not trigger dependency relationships with upstream or downstream nodes.

  • Run from the workflow panel: Right-click the current node in the workflow panel and run the current node and its downstream nodes. The current node and all its downstream nodes are executed sequentially based on schedule dependencies. This mode uses the compute resource and parameters specified in schedule settings, which may differ from the runtime environment of running within the node.

Common causes for differences between the two modes include different scheduling parameter values, inconsistent compute resource configurations, and the output of upstream nodes affecting the execution context of the current node. To troubleshoot, compare the scheduling parameters and compute resource configurations used by the two modes.

Next steps

  • Configure node scheduling: If you need to run a node periodically, configure its Scheduling Policy in the Scheduling Settings panel on the right.

  • Publish a node: To run a task in the production environment, click the image icon to publish the node. A node runs on schedule only after it is published to the production environment.

  • Task O&M: After a task is published, you can monitor the status of its periodic runs in the Operation Center. For more information, see Get started with Operation Center.

Appendix: DataWorks parameters

Advanced parameter

Description

FLOW_SKIP_SQL_ANALYZE

The SQL statement execution mode. Valid values:

  • true: Multiple SQL statements are executed at a time.

  • false (default): A single SQL statement is executed at a time.

Note

This parameter is supported only for test runs in the data development environment.

DATAWORKS_SESSION_DISABLE

The task submission method. During data development, tasks are submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed through SQL Compute or submitted to a queue for execution.

  • true: The task is submitted to a queue for execution. By default, the default queue specified when the compute resource was associated is used. When DATAWORKS_SESSION_DISABLE is set to true, you can configure the SERVERLESS_QUEUE_NAME parameter to specify the queue to which tasks are submitted during data development.

  • false (default): The task is submitted to SQL Compute for execution.

    Note

    This parameter takes effect only during data development execution, not during scheduled runs.

SERVERLESS_RELEASE_VERSION

The Spark engine version. By default, the Default Engine Version configured in the cluster settings under Compute Resource in Management Center is used. To set a different engine version for a specific task, configure this parameter.

Note

The SERVERLESS_RELEASE_VERSION parameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.

SERVERLESS_QUEUE_NAME

The resource queue to which tasks are submitted. When a task is configured to be submitted to a queue for execution, the Default Resource Queue configured in the cluster settings under Clusters in Management Center is used by default. If you have resource isolation and management requirements, you can add queues. For more information, see Manage resource queues.

Configuration methods:

  • Specify the resource queue for task submission by configuring node parameters.

  • Specify the resource queue for task submission by configuring global Spark parameters.

Note
  • The SERVERLESS_QUEUE_NAME parameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.

  • During data development execution: You must first set DATAWORKS_SESSION_DISABLE to true so that tasks are submitted to a queue for execution. Only then does the SERVERLESS_QUEUE_NAME parameter take effect for specifying the task queue.

  • During Operation Center scheduled execution: Tasks are forcibly submitted to a queue for execution and cannot be submitted to SQL Compute.

SERVERLESS_SQL_COMPUTE

The SQL Compute (SQL session) to use. By default, the Default SQL Compute configured in the cluster settings under Compute Resource in Management Center is used. To set a different SQL session for a specific task, configure this parameter. For information about how to create and manage SQL sessions, see Manage SQL Compute sessions.

Others

Custom Spark Configuration parameters. Add Spark-specific property parameters.

Configuration format: "spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format: --conf key=value.

Note

DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters.