All Products
Search
Document Center

DataWorks:Serverless Spark SQL node

Last Updated:Nov 13, 2025

You can create a Serverless Spark SQL node to process structured data. This node uses a distributed SQL query engine based on EMR Serverless Spark computing resources to execute jobs more efficiently.

Scope

  • Computing resource limitations: You can only attach EMR Serverless Spark computing resources. Ensure that network connectivity is available between the resource group and the computing resources.

  • Resource group limitation: You can run this type of job only in a Serverless resource group.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to your DataWorks workspace as a member and is assigned the Development or Workspace Manager role. The Workspace Manager role has more permissions than necessary. Exercise caution when you assign the Workspace Manager role. For more information about how to add a member, see Add workspace members and assign roles to them.

    Note

    If you use an Alibaba Cloud account, you can skip this operation.

Create a node

For more information, see Create a node.

Develop the node

You can develop task code in the SQL editing area. The syntax supports catalog.database.tablename. If catalog is omitted, the task defaults to the cluster's default catalog. If catalog.database is omitted, the task defaults to the default database within the cluster's default catalog.

For more information about data catalogs, see Manage data catalogs in EMR Serverless Spark.
-- Replace <catalog.database.tablename> with your actual information. 
SELECT * FROM <catalog.database.tablename> 

Define a variable in your code using the ${variable name} format. You can then assign a value to the variable on the right side of the node editing page, in the Scheduling Configuration section under Scheduling Parameters. This lets you dynamically pass parameters to your code in scheduling scenarios. For more information about the supported formats for scheduling parameters, see Supported formats for scheduling parameters. An example is provided below.

SHOW TABLES; 
-- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix using a scheduled task.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
  ip STRING COMMENT'IP address',
  uid STRING COMMENT'User ID'
)PARTITIONED BY(
  dt STRING
); --This can be used with scheduling parameters.
Note

The maximum size of an SQL statement is 130 KB.

Test the node

  1. In the Debug Configuration section, select the Computing Resource and Resource Group.

    Configuration Item

    Description

    Computing Resource

    Select an attached EMR Serverless Spark computing resource. If no computing resources are available, select Create Computing Resource from the drop-down list.

    Resource Group

    Select a resource group that is attached to the workspace.

    Script Parameters

    When you configure the node content, you can define a variable using the ${Parameter Name} format. You must configure the Parameter Name and Parameter Value in the Script Parameters section. When the task runs, the variable is dynamically replaced with its actual value. For more information, see Supported formats for scheduling parameters.

    Serverless Spark Node Parameters

    Runtime parameters for the Spark program. The following parameters are supported:

  2. In the toolbar at the top of the node editing page, click Run to run the SQL task.

    Important

    Before you publish the node, you must synchronize the ServerlessSpark Node Parameters under Debug Configuration with the ServerlessSpark Node Parameters under Scheduling Configuration.

What to do next

  • Node Scheduling: If a node in a project folder must run on a recurring schedule, you can set a Scheduling Policy and configure the scheduling properties in the Scheduling Configuration section on the right side of the node.

  • Publish the node: To publish the task to the production environment, click the image icon. Nodes in a project folder are scheduled to run periodically only after they are published to the production environment.

  • Task O&M: After a task is published, you can view the running status of auto triggered tasks in the Operation Center. For more information, see Getting started with Operation Center.

Appendix: DataWorks parameters

Advanced parameter

Description

FLOW_SKIP_SQL_ANALYZE

The execution mode of SQL statements. Valid values:

  • true: Executes multiple SQL statements at a time.

  • false (default): Executes one SQL statement at a time.

Note

This parameter is only for testing the flow in the data development environment.

DATAWORKS_SESSION_DISABLE

The task submission method. When you execute a task in Data Development, the task is submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed by SQL Compute or submitted to a queue.

  • true: The task is submitted to a queue for execution. The default queue specified when you attach the computing resource is used. When DATAWORKS_SESSION_DISABLE is true, you can configure the SERVERLESS_QUEUE_NAME parameter to specify the queue to which the task is submitted for execution in Data Development.

  • false (default): The task is submitted to SQL Compute for execution.

    Note

    This parameter takes effect only during execution in Data Development. It does not take effect during scheduled runs.

SERVERLESS_RELEASE_VERSION

Specifies the Spark engine version. The default value is the Default Engine Version that is configured for the cluster under Computing Resource in the Management Center. You can configure this parameter to specify different engine versions for different tasks.

Note

The SERVERLESS_RELEASE_VERSION parameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.

SERVERLESS_QUEUE_NAME

The resource queue to which tasks are submitted. The default is the Default Resource Queue configured for the cluster in Cluster Management of the Management Center. You can add queues to meet resource isolation and management requirements. For more information, see Manage Resource Queues.

Configuration methods:

Note
  • The SERVERLESS_QUEUE_NAME parameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.

  • During execution in Data Development: You must first set DATAWORKS_SESSION_DISABLE to true. The task is then submitted to a queue for execution. The SERVERLESS_QUEUE_NAME parameter takes effect only after you perform this step.

  • During scheduled execution in Operation Center: The task is forcibly submitted to a queue for execution and cannot be submitted to SQL Compute.

SERVERLESS_SQL_COMPUTE

Specifies the SQL Compute (SQL session), which by default is the Default SQL Compute configured for the cluster under Computing Resource in the Management Center. You can configure this parameter to specify different SQL sessions for different types of tasks. For more information, see Manage SQL sessions.

Other

Custom Spark Configuration parameters. Add Spark-specific property parameters.

Format: spark.eventLog.enabled : false . DataWorks automatically completes the code sent to the EMR cluster in the format of --conf key=value.

Note

You can configure global Spark parameters at the workspace level for DataWorks services. You can specify whether the global Spark parameters configured at the workspace level have a higher priority than the Spark parameters configured to run a single task in a specific DataWorks service. For more information, see Configure global Spark parameters.