Create a Serverless Spark SQL node to process structured data with a distributed SQL query engine based on an EMR Serverless Spark compute resource. This approach improves job execution efficiency.
Usage notes
-
Compute resource limitations: Only bound EMR Serverless Spark compute resources are supported. You must ensure that the resource group and the compute resource can communicate over the network.
-
Resource group constraints: This task runs only in a Serverless resource group.
-
(Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.
If you are using a root account, skip this step.
Create a node
For instructions, see Create a node.
Develop the node
Develop your task code in the SQL editor. The syntax supports catalog.database.tablename. If you omit catalog, the cluster's default Data Catalog is used. If you omit catalog.database, the default database in the cluster's default Data Catalog is used.
For more information about Data Catalog, see Manage a Data Catalog in EMR Serverless Spark.
-- Replace <catalog.database.tablename> with your actual identifiers.
SELECT * FROM <catalog.database.tablename> Define variables in your code by using the ${variable_name} format. Then, assign values to these variables in the Scheduling Parameters section of the Scheduling Settings pane. This enables dynamic parameter passing in scheduled scenarios. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.
SHOW TABLES;
-- Use ${var} to define a variable named var. If you assign the value ${yyyymmdd} to this variable, you can use a scheduled task to create a table with the business date as a suffix.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT'IP address',
uid STRING COMMENT'User ID'
)PARTITIONED BY(
dt STRING
); --This can be used with scheduling parameters.
An SQL statement cannot exceed 130 KB.
Debug the node
-
In the Run Configuration section, select a Compute Resource and Resource Group.
Parameter
Description
Compute Resource
Select a bound EMR Serverless Spark compute resource. If no compute resource is available, select Create Compute Resource from the drop-down list.
Resource Group
Select a resource group that is associated with the workspace.
Script Parameters
When you configure the node content, define variables by using the
${parameter_name}format. Then, configure the Parameter name and Parameter Value in the Script Parameters section. These variables are dynamically replaced with actual values at runtime. For more information, see Sources and expressions of scheduling parameters.ServerlessSpark Node Parameters
Spark runtime parameters. The following types are supported:
-
DataWorks custom runtime parameters. For more information, see Appendix: DataWorks parameters.
-
Spark built-in property parameters. For more information, see Open-source Spark property parameters. You can directly load a Spark configuration template from Serverless Spark without manual input, which simplifies the configuration process and ensures consistency.
-
-
On the toolbar at the top of the node editor, click Run to run the SQL task.
ImportantBefore deployment, synchronize the Serverlessspark Node Parameters from Run Configuration to the Serverlessspark Node Parameters in Scheduling Settings.
SQL execution modes
DataWorks provides two SQL execution modes: running within the node and running from the workflow panel. These two modes have different execution contexts and may produce different results.
-
Run within the node: Select all SQL statements in the SQL editor and click Run. Only the selected SQL statements are executed. This mode uses the compute resource and parameters configured for the node and does not trigger dependency relationships with upstream or downstream nodes.
-
Run from the workflow panel: Right-click the current node in the workflow panel and run the current node and its downstream nodes. The current node and all its downstream nodes are executed sequentially based on schedule dependencies. This mode uses the compute resource and parameters specified in schedule settings, which may differ from the runtime environment of running within the node.
Common causes for differences between the two modes include different scheduling parameter values, inconsistent compute resource configurations, and the output of upstream nodes affecting the execution context of the current node. To troubleshoot, compare the scheduling parameters and compute resource configurations used by the two modes.
Next steps
-
Configure node scheduling: If you need to run a node periodically, configure its Scheduling Policy in the Scheduling Settings panel on the right.
-
Publish a node: To run a task in the production environment, click the
icon to publish the node. A node runs on schedule only after it is published to the production environment. -
Task O&M: After a task is published, you can monitor the status of its periodic runs in the Operation Center. For more information, see Get started with Operation Center.
Appendix: DataWorks parameters
Advanced parameter | Description |
FLOW_SKIP_SQL_ANALYZE | The SQL statement execution mode. Valid values:
Note This parameter is supported only for test runs in the data development environment. |
DATAWORKS_SESSION_DISABLE | The task submission method. During data development, tasks are submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed through SQL Compute or submitted to a queue for execution.
|
SERVERLESS_RELEASE_VERSION | The Spark engine version. By default, the Default Engine Version configured in the cluster settings under Compute Resource in Management Center is used. To set a different engine version for a specific task, configure this parameter. Note The |
SERVERLESS_QUEUE_NAME | The resource queue to which tasks are submitted. When a task is configured to be submitted to a queue for execution, the Default Resource Queue configured in the cluster settings under Clusters in Management Center is used by default. If you have resource isolation and management requirements, you can add queues. For more information, see Manage resource queues. Configuration methods:
Note
|
SERVERLESS_SQL_COMPUTE | The SQL Compute (SQL session) to use. By default, the Default SQL Compute configured in the cluster settings under Compute Resource in Management Center is used. To set a different SQL session for a specific task, configure this parameter. For information about how to create and manage SQL sessions, see Manage SQL Compute sessions. |
Others | Custom Spark Configuration parameters. Add Spark-specific property parameters. Configuration format: Note DataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters. |