You can create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine, improving job execution efficiency.
Prerequisites
To customize the component environment for a node, create a custom image based on the official
dataworks_emr_base_task_podimage. For more information, see Custom images and use the image in Data Development.For example, you can replace a Spark JAR package or add dependencies on specific
libraries,files, orJAR packageswhen you create a custom image.You have created an Alibaba Cloud E-MapReduce (EMR) Cluster and registered it with DataWorks. For more information, see Data Studio: Associate an EMR computing resource.
(Optional, required for RAM users) Add the Resource Access Management (RAM) user responsible for task development to the Workspace and assign them the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information about adding members, see Add members to a workspace.
If you are using an Alibaba Cloud account, you can skip this step.
If your task requires a specific environment, you can use the DataWorks custom image feature to build an image with the necessary components. For more information, see Custom images.
Limitations
This task type runs only on a Serverless resource group (recommended) or an exclusive resource group for scheduling. If you use an image in Data Development, you must use a Serverless resource group.
To manage metadata for DataLake or custom clusters in DataWorks, you must configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Spark SQL.
NoteIf EMR-HOOK is not configured on the cluster, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related data governance tasks in DataWorks.
EMR on ACK Spark clusters do not support viewing data lineage, while EMR Serverless Spark clusters do.
DataLake and custom clusters support visual function registration, but EMR on ACK and EMR Serverless Spark clusters do not.
Notes
If you enabled Ranger access control for Spark in the EMR cluster that is bound to the current workspace:
This feature is available by default when you run Spark tasks that use the default image.
To run Spark tasks that use a custom image, you must submit a ticket to upgrade the image to support this feature.
Procedure
In the EMR Spark SQL node editor, complete the following steps.
Develop SQL code
Write your task code in the SQL editor. You can define variables by using the
${variable_name}format in your code. Assign a value to the variable in the Schedule section under Parameters on the right side of the node editor page to enable dynamic parameter passing for scheduled jobs. For more information about scheduling parameters, see Sources and expressions of scheduling parameters. The following is an example.SHOW TABLES; -- Defines a variable named var by using ${var}. If you assign ${yyyymmdd} to this variable, you can create a table with the business date as a suffix. CREATE TABLE IF NOT EXISTS userinfo_new_${var} ( ip STRING COMMENT 'IP address', uid STRING COMMENT 'User ID' )PARTITIONED BY( dt STRING ); -- Can be used with scheduling parameters.(Optional) Configure advanced parameters
You can configure the advanced parameters listed in the following tables in the EMR Node Parameters or DataWorks Parameters section under on the right side of the page.
NoteThe available advanced parameters vary based on the EMR cluster type, as shown in the following tables.
Additional open-source Spark properties can be configured in the EMR Node ParametersSpark Parameters on the right side of the page.
Datalake and custom clusters: EMR on ECS
Parameter
Description
queue
The scheduling queue for job submission. The default queue is
default. For more information about EMR YARN, see Basic queue configuration.priority
The job priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE
Specifies how SQL statements are executed. Valid values:
true: Executes multiple SQL statements at a time.false(default): Executes one SQL statement at a time.
NoteThis parameter is applicable only to test runs in the Data Development environment.
ENABLE_SPARKSQL_JDBC
Specifies the method for submitting SQL code. Valid values:
true: Submits SQL code by using JDBC (Java Database Connectivity). If the EMR cluster does not have a Kyuubi service, the SQL code is submitted to the Spark Thrift-Server. If the EMR cluster has a Kyuubi service, the SQL code is submitted to Kyuubi by using JDBC. This mode supports custom Spark parameters.Both methods support metadata and data lineage. However, when submitting jobs to the Thrift-Server, the output information of the corresponding node is not included in the metadata.
false(default): Submits SQL code by using thespark-submit clustermode. In this mode, both Spark 2 and Spark 3 support metadata, data lineage, and output information. You can also customize Spark parameters.NoteBy default, the
spark-submit clustermode creates temporary files and directories in the/tmpdirectory of the EMR cluster HDFS. You must have read and write permissions for this directory.When you use the `spark-submit cluster` mode, you can add custom SparkConf parameters in the advanced configuration section. DataWorks automatically adds these parameters to the command when submitting the code. For example,
"spark.driver.memory" : "2g".
DATAWORKS_SESSION_DISABLE
Specifies the JDBC connection behavior for test runs in the Data Development environment. Valid values:
true: Creates a new JDBC connection for each SQL statement execution.false(default): Reuses the same JDBC connection when you run different SQL statements within a single node.
NoteIf this parameter is set to
false, the Hiveyarn applicationIdis not printed in the logs. To print theyarn applicationId, set this parameter totrue.Others
Custom Spark Configuration parameters. You can add Spark-specific properties.
Use the following format:
spark.eventLog.enabled : false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the--conf key=valueformat. For more information about parameter configuration, see Configure global Spark parameters.NoteDataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters.
To enable Ranger permission control, add the spark.hadoop.fs.oss.authorization.method=ranger configuration in the Spark Parameters section to ensure that permission control takes effect.
EMR Serverless Spark
For information about how to set the relevant parameters, see Set parameters for submitting a Spark job.
Parameter
Description
FLOW_SKIP_SQL_ANALYZE
Specifies how SQL statements are executed. Valid values:
true: Executes multiple SQL statements at a time.false(default): Executes one SQL statement at a time.
NoteThis parameter is applicable only to test runs in the Data Development environment.
DATAWORKS_SESSION_DISABLE
Specifies the job submission method. When you run a job in Data Development, the job is submitted to SQL Compute by default. You can use this parameter to specify whether to submit the job to SQL Compute or a resource queue.
true: The job is submitted to a resource queue. By default, it uses the default queue specified when the compute resources were bound. If you set DATAWORKS_SESSION_DISABLE totrue, you can also configure theSERVERLESS_QUEUE_NAMEparameter to specify the queue for job submission during development and execution.false(default): The job is submitted to SQL Compute.NoteThis parameter takes effect only when you run jobs in Data Development. It does not take effect for scheduled runs.
SERVERLESS_RELEASE_VERSION
Specifies the Spark engine version. By default, the job uses the default engine version configured for the cluster in the Compute Engines section of the Management Center. Use this parameter to specify a different engine version for a specific job.
NoteThe
SERVERLESS_RELEASE_VERSIONparameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.SERVERLESS_QUEUE_NAME
Specifies the resource queue for job submission. By default, jobs are sent to the default resource queue configured for the cluster in the Cluster Management section of the Management Center. If you have resource isolation and management requirements, you can add queues and use this parameter to select a different queue. For more information, see Manage resource queues.
Configuration methods:
Set node parameters to specify the resource queue for job submission.
Specify the resource queue for job submission by using Configure global Spark parameters.
NoteThe
SERVERLESS_QUEUE_NAMEparameter in the advanced settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console.When running a job in Data Development: You must set
DATAWORKS_SESSION_DISABLEtotrueto submit the job to a queue. TheSERVERLESS_QUEUE_NAMEparameter takes effect only in this scenario.When running a scheduled job from the Operation Center: The job is always submitted to a queue and cannot be submitted to SQL Compute.
SERVERLESS_SQL_COMPUTE
Specifies the SQL Compute (SQL session). By default, the default SQL Compute instance configured for the cluster in the Compute Engines section of the Management Center is used. If you need to set different SQL sessions for different jobs, you can configure this parameter. For more information about how to create and manage SQL sessions, see Manage SQL sessions.
Others
Custom Spark Configuration parameters. You can add Spark-specific properties.
Use the following format:
spark.eventLog.enabled : false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the--conf key=valueformat.NoteDataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters. For more information about how to set global Spark parameters, see Configure global Spark parameters.
EMR on ACK Spark
Parameter
Description
FLOW_SKIP_SQL_ANALYZE
Specifies how SQL statements are executed. Valid values:
true: Executes multiple SQL statements at a time.false(default): Executes one SQL statement at a time.
NoteThis parameter is applicable only to test runs in the Data Development environment.
Others
Custom Spark Configuration parameters. You can add Spark-specific properties.
Use the following format:
spark.eventLog.enabled : false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the--conf key=valueformat.NoteDataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters. For more information about how to set global Spark parameters, see Configure global Spark parameters.
EMR on ECS Hadoop
Parameter
Description
queue
The scheduling queue for job submission. The default queue is
default. For more information about EMR YARN, see Basic queue configuration.priority
The job priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE
Specifies how SQL statements are executed. Valid values:
true: Executes multiple SQL statements at a time.false(default): Executes one SQL statement at a time.
NoteThis parameter is applicable only to test runs in the Data Development environment.
USE_GATEWAY
Specifies whether to submit jobs for this node through a gateway cluster. Valid values:
true: Submits jobs through a gateway cluster.false(default): Submits jobs to the header node instead of a gateway cluster.
NoteIf you set this parameter to
truebut the cluster is not associated with a gateway cluster, the EMR job submission fails.Others
Custom Spark Configuration parameters. You can add Spark-specific properties.
Use the following format:
spark.eventLog.enabled : false. DataWorks automatically appends the parameters to the code submitted to the EMR cluster in the--conf key=valueformat. For more information about parameter configuration, see Configure global Spark parameters.NoteDataWorks allows you to set global Spark parameters at the workspace level. These parameters are applied to all DataWorks modules. You can specify whether these global parameters take priority over module-specific Spark parameters.
To enable Ranger permission control, add the spark.hadoop.fs.oss.authorization.method=ranger configuration in the Spark Parameters section to ensure that permission control takes effect.
Run SQL job
Configure the Run ConfigurationCompute Engines and resource groups.
NoteYou can also configure the scheduling CUs based on the job's resource requirements. The default value is
0.25.To access a data source in a public network or a VPC, you must use a scheduling resource group that has established connectivity with the data source. For more information, see Network connectivity solutions.
In the parameter dialog box on the toolbar, select the appropriate data source and click Run to execute the SQL job.
To run the task on a recurring basis, configure its schedule based on your business requirements. For more information about the configuration, see Node scheduling configuration.
NoteTo customize the component environment, create a custom
dataworks_emr_base_task_podbased on the official image and Custom images, and use the image in Data Development.For example, you can replace a Spark JAR package or add dependencies on specific
libraries,files, orjar packageswhen you create the custom image.After configuring the task, you must deploy the node. For more information, see Node and workflow deployment.
After the task is deployed, you can view its status in Operation Center. For more information, see Getting started with Operation Center.
FAQ
When running an EMR Spark SQL task in Data Development that submits jobs to SQL Compute, ensure the SQL Compute instance is in the Running state. Otherwise, the job will fail. To view the status of an SQL Compute instance, see Manage SQL sessions.