You can create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine, improving job execution efficiency.
Prerequisites
-
Before you start node development, if you need a custom component environment, create a custom image based on the official
dataworks_emr_base_task_podimage and then use it in Data Studio. For more information, see Create a custom image and Use images in data development.For example, you can replace Spark JAR packages or add dependencies on specific
libraries,files, orJAR packageswhen you create a custom image. Create an Alibaba Cloud EMR cluster and register it with DataWorks. For more information, see New Data Studio: Attach an EMR compute resource.
-
(Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.
If you are using a root account, skip this step.
You can use a Custom image custom image to build a specific development environment for your task.
Limitations
-
EMR Shell nodes can run only on a serverless resource group (recommended) or an exclusive resource group for scheduling. Using a custom image for data development requires a serverless resource group.
To manage metadata for a DataLake or custom cluster in DataWorks, you must first configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Spark SQL.
NoteWithout EMR-HOOK configured on the cluster, you cannot view real-time metadata, generate audit logs, view data lineage, or perform EMR-related governance tasks in DataWorks.
EMR on ACK Spark clusters do not support viewing data lineage. EMR Serverless Spark clusters support viewing data lineage.
Function registration through the UI is supported on DataLake and custom clusters, but not on EMR on ACK Spark or EMR Serverless Spark clusters.
Notes
If you have enabled Ranger access control for Spark in the EMR cluster associated with the current workspace:
-
Spark tasks that use the default image automatically support this feature.
-
To run a Spark task with a custom image, submit a ticket to technical support to request an image upgrade.
Procedure
On the EMR Spark SQL node configuration tab, perform the following development operations.
Develop SQL code
In the SQL editor, write your task code. Define variables using the ${variable_name} format and assign their values in the Scheduling Settings > Scheduling Parameters section on the right side of the node configuration tab to dynamically pass parameters to scheduled tasks. For more information about how to use scheduling parameters, see Sources and expressions of scheduling parameters. The following code provides an example.
SHOW TABLES; -- Define a variable named var by using ${var}. If you set this variable to ${yyyymmdd}, you can create a table with the business date as a suffix. CREATE TABLE IF NOT EXISTS userinfo_new_${var} ( ip STRING COMMENT'IP address', uid STRING COMMENT'User ID' )PARTITIONED BY( dt STRING ); -- Can be used with scheduling parameters.(Optional) Configure EMR node parameters
In the section of the Run Configuration pane, configure the following parameters:
Spark parameter: Spark built-in property parameters. For more information, see open source Spark configuration. You can directly load a Spark configuration template from Serverless Spark without manual input, simplifying the configuration process and ensuring consistency.
DataWorks parameters: The advanced parameters that can be configured vary by EMR cluster type. The following table describes the details.
DataLake cluster/Custom cluster: EMR on ECS
Advanced parameter
Description
queue
The scheduling queue for job submission. The default value is default. For more information about EMR YARN, see Basic queue configurations.
priority
The priority. Default value: 1.
FLOW_SKIP_SQL_ANALYZE
The SQL statement execution mode. Valid values:
true: Multiple SQL statements are executed at a time.false(default): A single SQL statement is executed at a time.
NoteThis parameter is supported only for test runs in the data development environment.
ENABLE_SPARKSQL_JDBC
The method used to submit SQL code. Valid values:
true: SQL code is submitted through JDBC. If the EMR cluster does not have the Kyuubi service, SQL code is submitted to Spark Thrift-Server. If the EMR cluster has the Kyuubi service, SQL code is submitted to Kyuubi through JDBC, and custom Spark parameters are supported.Both methods support metadata lineage. However, tasks submitted to Thrift-Server lack output information for the corresponding node tasks in the metadata.
false(default): SQL code is submitted using the Spark-submit cluster method. In this mode, both Spark2 and Spark3 support metadata lineage and output information. Custom Spark parameters are also supported.NoteThe Spark-submit cluster method creates temporary files and directories in the
/tmpdirectory on the HDFS of the EMR cluster by default. Make sure that the directory has read and write permissions.When you use the Spark-submit cluster method, you can directly append custom SparkConf parameters in the advanced configurations. DataWorks automatically adds the new parameters to the command when you submit the code. Example:
"spark.driver.memory" : "2g".
DATAWORKS_SESSION_DISABLE
This parameter is applicable to test runs in the development environment. Valid values:
true: A new JDBC connection is created each time an SQL statement is run.false(default): The same JDBC connection is reused when different SQL statements are run within a single node.
NoteWhen this parameter is set to
false, the Hiveyarn applicationIdis not printed. To print theyarn applicationId, set this parameter totrue.Others
Custom Spark Configuration parameters. Add Spark-specific property parameters.
Configuration format:
"spark.eventLog.enabled" : false,"spark.eventLog.memory" : "12g". DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format:--conf key=value. For more information about parameter configurations, see Configure global Spark parameters.NoteDataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module.
To enable Ranger access control, add
spark.hadoop.fs.oss.authorization.method=rangerto the global Spark parameter configurations to ensure that Ranger access control takes effect.
EMR Serverless Spark cluster
For information about related parameter settings, see Spark task parameter settings.
Advanced parameter
Description
FLOW_SKIP_SQL_ANALYZE
The SQL statement execution mode. Valid values:
true: Multiple SQL statements are executed at a time.false(default): A single SQL statement is executed at a time.
NoteThis parameter is supported only for test runs in the data development environment.
DATAWORKS_SESSION_DISABLE
The task submission method. During data development, tasks are submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed through SQL Compute or submitted to a queue for execution.
true: The task is submitted to a queue for execution. By default, the default queue specified when the compute resource was associated is used. When DATAWORKS_SESSION_DISABLE is set to true, you can configure theSERVERLESS_QUEUE_NAMEparameter to specify the queue to which tasks are submitted during data development.false(default): The task is submitted to SQL Compute for execution.NoteThis parameter takes effect only during data development execution, not during scheduled runs.
SERVERLESS_RELEASE_VERSION
The Spark engine version. By default, the Default Engine Version configured in the cluster settings under Compute Resource in Management Center is used. To set a different engine version for a specific task, configure this parameter.
NoteThe
SERVERLESS_RELEASE_VERSIONparameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.SERVERLESS_QUEUE_NAME
The resource queue to which tasks are submitted. When a task is configured to be submitted to a queue for execution, the Default Resource Queue configured in the cluster settings under Clusters in Management Center is used by default. If you have resource isolation and management requirements, you can add queues. For more information, see Manage resource queues.
Configuration methods:
Specify the resource queue for task submission by configuring node parameters.
Specify the resource queue for task submission by configuring global Spark parameters.
NoteThe
SERVERLESS_QUEUE_NAMEparameter in the advanced configurations takes effect only when the SQL Compute (session) specified for the registered cluster is in a stopped state in the EMR Serverless Spark console.During data development execution: You must first set
DATAWORKS_SESSION_DISABLEtotrueso that tasks are submitted to a queue for execution. Only then does theSERVERLESS_QUEUE_NAMEparameter take effect for specifying the task queue.During Operation Center scheduled execution: Tasks are forcibly submitted to a queue for execution and cannot be submitted to SQL Compute.
SERVERLESS_SQL_COMPUTE
The SQL Compute (SQL session) to use. By default, the Default SQL Compute configured in the cluster settings under Compute Resource in Management Center is used. To set a different SQL session for a specific task, configure this parameter. For information about how to create and manage SQL sessions, see Manage SQL Compute sessions.
Others
Custom Spark Configuration parameters. Add Spark-specific property parameters.
Configuration format:
"spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format:--conf key=value.NoteDataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters.
Spark cluster: EMR on ACK
Advanced parameter
Description
FLOW_SKIP_SQL_ANALYZE
The SQL statement execution mode. Valid values:
true: Multiple SQL statements are executed at a time.false(default): A single SQL statement is executed at a time.
NoteThis parameter is supported only for test runs in the data development environment.
Others
Custom Spark Configuration parameters. Add Spark-specific property parameters.
Configuration format:
"spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format:--conf key=value.NoteDataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module. For more information about configuring global Spark parameters, see Configure global Spark parameters.
Hadoop cluster: EMR on ECS
Advanced parameter
Description
queue
The scheduling queue for job submission. The default value is
default. For more information about EMR YARN, see Basic queue configurations.priority
The priority. Default value: 1.
FLOW_SKIP_SQL_ANALYZE
The SQL statement execution mode. Valid values:
true: Multiple SQL statements are executed at a time.false(default): A single SQL statement is executed at a time.
NoteThis parameter is supported only for test runs in the data development environment.
USE_GATEWAY
Specifies whether to submit jobs through a Gateway cluster. Valid values:
true: Jobs are submitted through a Gateway cluster.false(default): Jobs are not submitted through a Gateway cluster. Instead, jobs are submitted to the header node by default.
NoteIf the cluster of the current node is not associated with a Gateway cluster, manually setting this parameter to
truecauses subsequent EMR job submissions to fail.Others
Custom Spark Configuration parameters. Add Spark-specific property parameters.
Configuration format:
"spark.eventLog.enabled": false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the format:--conf key=value. For more information about parameter configurations, see Configure global Spark parameters.NoteDataWorks allows you to configure global Spark parameters at the workspace level for each DataWorks module. You can specify whether the priority of the global Spark parameters is higher than that of Spark parameters configured within a specific module.
To enable Ranger access control, add
spark.hadoop.fs.oss.authorization.method=rangerto the global Spark parameter configurations to ensure that Ranger access control takes effect.
Run an SQL task
In Run Configuration, select and configure the Compute Resource and Resource Group.
NoteYou can also configure CUs for Scheduling based on the resources required for task execution. The default CU value is
0.25.To access data sources over the Internet or in a VPC, use a resource group for scheduling that has passed the connectivity test with the data source. For more information, see Configure network connectivity.
In the toolbar parameter dialog, select the corresponding data source and click Run to run the SQL task.
(Optional) Configure assignment parameters
To pass the query results of this node to downstream nodes, go to the Node Context Parameters section in the Scheduling Settings pane on the right side, and click Add Assignment Parameter. The system automatically adds an output parameter named
outputs. The value of this parameter is the query result of the last line of code in this node. For more information, see Use node context parameters.To run the node task periodically, configure scheduling information based on your business requirements. For more information, see Configure schedule settings.
NoteIf you need to customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_podand use the image in Data Development.For example, when you create a custom image, you can replace Spark JAR packages or depend on specific
libraries,files, orJAR packages.After the node task is configured, you must deploy the node. For more information, see Deploy a node.
After the task is deployed, you can view the running status of scheduled tasks in Operation Center. For more information, see View scheduled tasks.
FAQ
When you run an EMR Spark SQL node task during data development and want to submit the task to SQL Compute for execution, make sure that the SQL Compute is in the Running state. Otherwise, the task fails. To check the SQL Compute status, see Manage SQL Compute sessions.