You can create a Serverless Spark SQL node to process structured data. This node uses a distributed SQL query engine based on EMR Serverless Spark computing resources to execute jobs more efficiently.
Scope
Computing resource limitations: You can only attach EMR Serverless Spark computing resources. Ensure that network connectivity is available between the resource group and the computing resources.
Resource group limitation: You can run this type of job only in a Serverless resource group.
(Required if you use a RAM user to develop tasks) The RAM user is added to your DataWorks workspace as a member and is assigned the Development or Workspace Manager role. The Workspace Manager role has more permissions than necessary. Exercise caution when you assign the Workspace Manager role. For more information about how to add a member, see Add workspace members and assign roles to them.
NoteIf you use an Alibaba Cloud account, you can skip this operation.
Create a node
For more information, see Create a node.
Develop the node
You can develop task code in the SQL editing area. The syntax supports catalog.database.tablename. If catalog is omitted, the task defaults to the cluster's default catalog. If catalog.database is omitted, the task defaults to the default database within the cluster's default catalog.
For more information about data catalogs, see Manage data catalogs in EMR Serverless Spark.
-- Replace <catalog.database.tablename> with your actual information.
SELECT * FROM <catalog.database.tablename> Define a variable in your code using the ${variable name} format. You can then assign a value to the variable on the right side of the node editing page, in the Scheduling Configuration section under Scheduling Parameters. This lets you dynamically pass parameters to your code in scheduling scenarios. For more information about the supported formats for scheduling parameters, see Supported formats for scheduling parameters. An example is provided below.
SHOW TABLES;
-- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix using a scheduled task.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT'IP address',
uid STRING COMMENT'User ID'
)PARTITIONED BY(
dt STRING
); --This can be used with scheduling parameters.The maximum size of an SQL statement is 130 KB.
Test the node
In the Debug Configuration section, select the Computing Resource and Resource Group.
Configuration Item
Description
Computing Resource
Select an attached EMR Serverless Spark computing resource. If no computing resources are available, select Create Computing Resource from the drop-down list.
Resource Group
Select a resource group that is attached to the workspace.
Script Parameters
When you configure the node content, you can define a variable using the
${Parameter Name}format. You must configure the Parameter Name and Parameter Value in the Script Parameters section. When the task runs, the variable is dynamically replaced with its actual value. For more information, see Supported formats for scheduling parameters.Serverless Spark Node Parameters
Runtime parameters for the Spark program. The following parameters are supported:
DataWorks custom runtime parameters. For more information, see Appendix: DataWorks parameters.
Native Spark properties. For more information, see Open-source Spark properties.
In the toolbar at the top of the node editing page, click Run to run the SQL task.
ImportantBefore you publish the node, you must synchronize the ServerlessSpark Node Parameters under Debug Configuration with the ServerlessSpark Node Parameters under Scheduling Configuration.
What to do next
Node Scheduling: If a node in a project folder must run on a recurring schedule, you can set a Scheduling Policy and configure the scheduling properties in the Scheduling Configuration section on the right side of the node.
Publish the node: To publish the task to the production environment, click the
icon. Nodes in a project folder are scheduled to run periodically only after they are published to the production environment.Task O&M: After a task is published, you can view the running status of auto triggered tasks in the Operation Center. For more information, see Getting started with Operation Center.
Appendix: DataWorks parameters
Advanced parameter | Description |
FLOW_SKIP_SQL_ANALYZE | The execution mode of SQL statements. Valid values:
Note This parameter is only for testing the flow in the data development environment. |
DATAWORKS_SESSION_DISABLE | The task submission method. When you execute a task in Data Development, the task is submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed by SQL Compute or submitted to a queue.
|
SERVERLESS_RELEASE_VERSION | Specifies the Spark engine version. The default value is the Default Engine Version that is configured for the cluster under Computing Resource in the Management Center. You can configure this parameter to specify different engine versions for different tasks. Note The |
SERVERLESS_QUEUE_NAME | The resource queue to which tasks are submitted. The default is the Default Resource Queue configured for the cluster in Cluster Management of the Management Center. You can add queues to meet resource isolation and management requirements. For more information, see Manage Resource Queues. Configuration methods:
Note
|
SERVERLESS_SQL_COMPUTE | Specifies the SQL Compute (SQL session), which by default is the Default SQL Compute configured for the cluster under Computing Resource in the Management Center. You can configure this parameter to specify different SQL sessions for different types of tasks. For more information, see Manage SQL sessions. |
Other | Custom Spark Configuration parameters. Add Spark-specific property parameters. Format: Note You can configure global Spark parameters at the workspace level for DataWorks services. You can specify whether the global Spark parameters configured at the workspace level have a higher priority than the Spark parameters configured to run a single task in a specific DataWorks service. For more information, see Configure global Spark parameters. |