A Serverless Spark SQL node provides a distributed SQL query engine that runs on an EMR Serverless Spark compute resource. You can use this node to process structured data and improve job execution efficiency.
Prerequisites
Compute resource requirements: You can only use an EMR Serverless Spark compute resource. Ensure network connectivity between the resource group and the compute resource.
Resource group: Only Serverless resource groups can be used to run this type of task.
(Optional) If you are a Resource Access Management (RAM) user, ensure that you have been added to the workspace for task development and have been assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about adding members, see Add members to a workspace.
If you use an Alibaba Cloud account, you can skip this step.
Create a node
For more information, see Create a node.
Develop the node
Write your SQL code in the editor. The syntax catalog.database.tablename is supported. Omitting the catalog uses the cluster's default catalog. Omitting catalog.database uses the default catalog's default database.
For more information about catalogs, see Manage data catalogs in EMR Serverless Spark.
-- Replace <catalog.database.tablename> with your actual values
SELECT * FROM <catalog.database.tablename> Define variables in your code with the ${variable_name} format and assign their values in the Scheduling Parameters section of the Scheduling Configurations pane. This lets you dynamically pass parameters to scheduled tasks. For more information about how to use scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.
SHOW TABLES;
-- Define a variable named var by using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with a business date suffix when the scheduled task runs.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT 'IP address',
uid STRING COMMENT 'User ID'
) PARTITIONED BY (
dt STRING
); --This can be used with scheduling parameters.The maximum SQL statement size is 130 KB.
Debug the node
In the Run Configuration pane, select a Compute resource and a Resource group.
Parameter
Description
Compute resource
Select a bound EMR Serverless Spark compute resource. If no compute resource is available, select Create Compute Resource from the drop-down list.
Resource group
Select a resource group that is bound to the workspace.
Script parameter
If you define variables using the
${parameter_name}format in the node content, you must specify the Parameter Name and Parameter Value in the Script Parameter section. The variables are dynamically replaced with their actual values at runtime. For more information, see Sources and expressions of scheduling parameters.Serverless Spark node parameters
The runtime parameters for the Spark application. The following types are supported:
Custom runtime parameters in DataWorks. For more information, see Appendix: DataWorks parameters.
Native Spark properties. For more information, see Open source Spark properties. You can directly load a Spark configuration template from EMR Serverless Spark without manual input. This simplifies configuration and ensures consistency.
In the toolbar at the top of the node editor, click Run to run the SQL task.
ImportantBefore you deploy the node, you must copy the Run Configuration from the Runtime Configurations pane to the Serverless Spark Node Parameters section in the Scheduling Configurations pane.
Next steps
Schedule a node: If a node in the project folder needs to run periodically, you can set the Scheduling Policies and configure scheduling properties in the Scheduling section on the right side of the node page.
Publish a node: If the task needs to run in the production environment, click the
icon to publish the task. A node in the project folder runs on a schedule only after it is published to the production environment.Node O&M: After you publish the task, you can view the status of the auto triggered task in the Operation Center. For more information, see Get started with Operation Center.
Appendix: DataWorks parameters
Parameter | Description |
FLOW_SKIP_SQL_ANALYZE | Specifies how SQL statements are executed. Valid values:
Note This parameter is applicable only to test runs in the Data Development environment. |
DATAWORKS_SESSION_DISABLE | Specifies the job submission method. When you run a job in Data Development, the job is submitted to SQL Compute by default. You can use this parameter to specify whether to submit the job to SQL Compute or a resource queue.
|
SERVERLESS_RELEASE_VERSION | Specifies the Spark engine version. By default, the job uses the default engine version configured for the cluster in the Compute Engines section of the Management Center. Use this parameter to specify a different engine version for a specific job. Note The |
SERVERLESS_QUEUE_NAME | Specifies the resource queue for job submission. By default, jobs are sent to the default resource queue configured for the cluster in the Cluster Management section of the Management Center. If you have resource isolation and management requirements, you can add queues and use this parameter to select a different queue. For more information, see Manage resource queues. Configuration methods:
Note
|
SERVERLESS_SQL_COMPUTE | Specifies the SQL Compute (SQL session). By default, the default SQL Compute instance configured for the cluster in the Compute Engines section of the Management Center is used. If you need to set different SQL sessions for different jobs, you can configure this parameter. For more information about how to create and manage SQL sessions, see Manage SQL sessions. |
Others | Custom Spark Configuration parameters. You can add Spark-specific properties. Use the following format: Note DataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters. For more information about how to set global Spark parameters, see Configure global Spark parameters. |