Spark is a general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It supports complex in-memory computing, making it ideal for building large-scale, low-latency data analytics applications. DataWorks provides the EMR Spark node, which lets you develop and schedule Spark jobs. This topic explains how to configure and use an EMR Spark node and provides examples of its functionality.
Prerequisites
To customize the component environment for a node, create a custom image based on the official
dataworks_emr_base_task_podimage. For more information, see Custom images and use the image in Data Development.For example, you can replace a Spark JAR package or add dependencies on specific
libraries,files, orJAR packageswhen you create a custom image.You have created an Alibaba Cloud E-MapReduce (EMR) Cluster and registered it with DataWorks. For more information, see Data Studio: Associate an EMR computing resource.
(Optional, required for RAM users) Add the Resource Access Management (RAM) user responsible for task development to the Workspace and assign them the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions, so grant it with caution. For more information about adding members, see Add members to a workspace.
If you are using an Alibaba Cloud account, you can skip this step.
If your job requires a specific development environment, use the custom image feature in DataWorks to build an image with the necessary components. For more information, see Custom images.
Limitations
This task type runs only on a Serverless resource group (recommended) or an exclusive resource group for scheduling. If you use an image in Data Development, you must use a Serverless resource group.
To manage metadata in DataWorks for a DataLake or custom cluster, you must first configure EMR-HOOK on the cluster. For more information, see Configure EMR-HOOK for Spark SQL.
NoteIf EMR-HOOK is not configured on the cluster, you cannot view metadata in real time, generate audit logs, display data lineage, or perform EMR-related data governance tasks in DataWorks.
You cannot view the lineage of Spark clusters that are deployed on E-MapReduce on Container Service for Kubernetes (EMR on ACK). You can view the lineage of EMR Serverless Spark clusters.
On EMR on ACK and EMR Serverless Spark clusters, you can reference resources only from Object Storage Service (OSS) by using OSS REF and can upload resources only to OSS. Uploading resources to Hadoop Distributed File System (HDFS) is not supported.
DataLake and custom clusters support referencing OSS resources by using OSS REF, uploading resources to OSS, and uploading resources to HDFS.
Notes
If you enabled Ranger access control for Spark in the EMR cluster that is bound to the current workspace:
This feature is available by default when you run Spark tasks that use the default image.
To run Spark tasks that use a custom image, you must submit a ticket to upgrade the image to support this feature.
Develop and package the Spark job
Before scheduling an EMR Spark job in DataWorks, you must first develop the job code in E-MapReduce (EMR), compile it, and generate a JAR package. For more information about how to develop an EMR Spark job, see Overview.
To schedule the EMR Spark job, you must upload the JAR package to DataWorks.
Procedure
On the EMR Spark node editing page, follow these steps to configure your job.
Develop a Spark job
Choose one of the following options based on your use case.
Option 1: Upload and reference EMR JAR
You can upload and reference resources from your local machine in Data Studio. After you compile your EMR Spark job, obtain the JAR package. Choose a storage method based on the JAR package size.
Upload the JAR package to create an EMR resource in DataWorks and submit it, or store it directly in HDFS on EMR. EMR on ACK and EMR Serverless Spark clusters do not support uploading resources to HDFS.
JAR smaller than 500 MB
Create an EMR JAR resource.
If the JAR package is smaller than 500 MB, you can upload it from your local machine to create an EMR JAR resource in DataWorks. This allows for visual management in the DataWorks console. After creating the resource, you must submit it. For more information, see Create and use an EMR resource.
Upload the JAR package from your Local machine to the directory where JAR resources are stored. For more information, see Resource management.
Click Upload to upload the JAR resource.
Select the Storage Path, Data Source, and Resource Group.
Click Save.

Reference the EMR JAR resource.
Open the created EMR Spark node to go to the code editor.
In the left-side navigation pane, find the resource you want to reference, right-click it, and select Reference Resource.
After you select the resource, a reference to it is automatically added to the code editor of the EMR Spark node:
##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"} spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jarThis code confirms the reference. In this code,
spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jaris the name of the EMR JAR resource that you uploaded.Add the
spark-submitcommand to the EMR Spark node code. The following code provides an example.NoteDo not add comments to your job code, as they will cause an error when the node runs. Modify your code based on the following example.
##@resource_reference{"spark-examples_2.11-2.4.0.jar"} spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100Noteorg.apache.spark.examples.SparkPi: The main class of the job in your compiled JAR package.spark-examples_2.11-2.4.0.jar: The name of the EMR JAR resource that you uploaded.You can use the other parameters as shown in the example or run the
spark-submit --helpcommand to view the help documentation and modify the command as needed.If you need to use simplified parameters for the spark-submit command in the Spark node, you must add them to the code. For example, add
--executor-memory 2G.Spark nodes only support submitting jobs by using YARN in cluster mode.
For jobs submitted by using
spark-submit, set thedeploy-modetocluster moderather thanclient mode.
JAR 500 MB or more
Create an EMR JAR resource.
If the JAR package is 500 MB or larger, you cannot upload it from your local machine to create a DataWorks resource. Instead, store the JAR package in HDFS on EMR and record its storage path. This allows you to reference the path when you schedule the Spark job in DataWorks.
Upload the JAR package from your Local machine to the directory where JAR resources are stored. For more information, see Resource management.
Click Upload to upload the JAR resource.
Select the Storage Path, Data Source, and Resource Group.
Click Save.

Reference the EMR JAR resource.
If the JAR package is stored in HDFS, reference it by specifying its path in the EMR Spark node's code.
Double-click the created EMR Spark node to open the code editor.
Write the
spark-submitcommand. The following code provides an example.spark-submit --master yarn --deploy-mode cluster --name SparkPi --driver-memory 4G --driver-cores 1 --num-executors 5 --executor-memory 4G --executor-cores 1 --class org.apache.spark.examples.JavaSparkPi hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100Notehdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar: The actual path of the JAR package in HDFS.org.apache.spark.examples.JavaSparkPi: Themain classof the job in your compiled JAR package.The other parameters are for the EMR cluster and must be configured based on your actual cluster settings. You can also run the
spark-submit --helpcommand to view the help documentation and modify the command as needed.If you need to use simplified parameters for the
spark-submitcommand in the Spark node, you must add them to the code. For example, add--executor-memory 2G.Spark nodes only support submitting jobs by using YARN in cluster mode.
For jobs submitted by using
spark-submit, set thedeploy-modetocluster moderather thanclient mode.
Option 2: Directly reference OSS resource
You can directly reference an OSS resource in the node by using OSS REF. When the EMR node runs, DataWorks automatically loads the referenced OSS resource for the job to use. This method is often used for scenarios such as running JAR dependencies in an EMR job or when an EMR job depends on a script.
Develop the JAR resource.
Prepare code dependencies.
You can find the required code dependencies in the
/usr/lib/emr/spark-current/jars/path on the master node of your EMR cluster. The following example uses Spark 3.4.2. In your IDEA project, add the specified pom dependencies and reference the relevant plug-ins.Add pom dependencies
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.4.2</version> </dependency> <!-- Apache Spark SQL --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.12</artifactId> <version>3.4.2</version> </dependency> </dependencies>Reference plug-ins
<build> <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.7.0</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <configuration> <recompileMode>incremental</recompileMode> </configuration> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> </plugins> </build>The following code provides an example.
package com.aliyun.emr.example.spark import org.apache.spark.sql.SparkSession object SparkMaxComputeDemo { def main(args: Array[String]): Unit = { // Create a SparkSession. val spark = SparkSession.builder() .appName("HelloDataWorks") .getOrCreate() // Print the Spark version. println(s"Spark version: ${spark.version}") } }After you edit the Scala code, generate a JAR package.
The generated JAR package in this example is
SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar.
Upload the JAR resource.
After developing the code, log on to the OSS console. In the navigation pane on the left, click Bucket List.
Click the name of the destination bucket to go to the File Management page.
This example uses the
onaliyun-bucket-2bucket.Click Create Directory to create a directory for the JAR resource.
Set the Directory Name to
emr/jarsto create the directory.Upload the JAR resource to the directory.
Go to the directory and click Upload File. In the Files to Upload section, click Scan Files, select the
SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jarfile, and then click Upload File.
Reference the JAR resource.
Edit the code to reference the JAR resource.
On the editing page of the created EMR Spark node, edit the code to reference the JAR resource.
spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo --master yarn ossref://onaliyun-bucket-2/emr/jars/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jarThe following table describes the parameters.
Parameter
Description
classThe full name of the main class to run.
masterThe mode in which the Spark application runs.
ossreffile pathThe format is
ossref://{endpoint}/{bucket}/{object}.endpoint: The public endpoint for Object Storage Service (OSS). If left empty, the OSS bucket must be in the same region as the EMR cluster.
Bucket: The OSS container used to store objects. Each bucket has a unique name. Log on to the OSS console to view all buckets under the current account.
object: A specific object (file name or path) stored in a bucket.
Run the EMR Spark node job.
After editing, click the
icon and select the Serverless resource group you created to run the EMR Spark node. After the job is complete, record the applicationIdprinted in the console, for example,application_1730367929285_xxxx.View the result.
Create an EMR Shell node and run the
yarn logs -applicationId application_1730367929285_xxxxcommand on the node to view the result.
(Optional) Configure advanced parameters
You can configure the parameters described in the following table in the EMR Node Parameters and DataWorks Parameters sections of the pane on the right side of the node.
NoteThe advanced parameters that you can configure vary for different types of EMR clusters, as shown in the following table.
You can configure more open source Spark properties in the EMR Node Parameters and Spark Parameters sections of the pane.
DataLake and custom (ECS)
Advanced parameter
Description
queue
The scheduling queue for submitting jobs. The default is the
defaultqueue.If you configured a workspace-level YARN resource queue when you registered the EMR cluster with the DataWorks workspace, the following logic applies:
If Prioritize Global Configuration is set to Yes, DataWorks uses the queue configured during EMR cluster registration to run the Spark job.
If this option is not selected, DataWorks uses the queue configured in the EMR Spark node to run the Spark job.
For more information about EMR YARN, see Basic queue configurations. For more information about queue configuration during EMR cluster registration, see Configure a global YARN queue.
priority
The job priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE
The execution mode for SQL statements. Valid values:
true: Executes multiple SQL statements at a time.false(default): Executes one SQL statement at a time.
NoteThis parameter is supported only for test runs in the data development environment.
Others
You can add custom Spark parameters in the advanced settings. For example, you can add
spark.eventLog.enabled : false. DataWorks automatically formats the parameter as--conf key=valuebefore sending it to the EMR cluster.You can also configure global Spark parameters. For more information, see Configure global Spark parameters.
NoteTo enable Ranger permission control, add the configuration
spark.hadoop.fs.oss.authorization.method=rangerin Configure global Spark parameters to ensure that it takes effect.
Spark (ACK)
Advanced parameter
Description
FLOW_SKIP_SQL_ANALYZE
The execution mode for SQL statements. Valid values:
true: Executes multiple SQL statements at a time.false: Executes one SQL statement at a time.
NoteThis parameter is supported only for test runs in the data development environment.
Others
You can add custom Spark parameters in the advanced settings. For example, you can add
spark.eventLog.enabled : false. DataWorks automatically formats the parameter as--conf key=valuebefore sending it to the EMR cluster.You can also configure global Spark parameters. For more information, see Configure global Spark parameters.
Hadoop (ECS)
Advanced parameter
Description
queue
The scheduling queue for submitting jobs. The default queue is `default`.
If you configured a workspace-level YARN Resource Queue when you registered the EMR cluster with the DataWorks workspace:
If you enable Global Settings Take Precedence, the queue configured during EMR cluster registration is used for the Spark task.
If you do not enable this option, the queue is determined by the configuration of the EMR Spark node when the Spark task runs.
For more information about EMR YARN, see Basic queue configuration. For more information about queue configuration during EMR cluster registration, see Set a global YARN resource queue.
priority
The job priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE
The execution mode for SQL statements. Valid values:
true: Executes multiple SQL statements at a time.false: Executes one SQL statement at a time.
NoteThis parameter is supported only for test runs in the data development environment.
USE_GATEWAY
Specifies whether to submit the node's job through a gateway cluster. Valid values:
true: Submits the job through a gateway cluster.false: Does not submit the job through a gateway cluster. The job is submitted to the header node by default.
NoteIf the node's cluster is not associated with a gateway cluster, setting this parameter to
truecauses the job submission to fail.Others
You can add custom Spark parameters in the advanced settings. For example, you can add
spark.eventLog.enabled : false. DataWorks automatically formats the parameter as--conf key=valuebefore sending it to the EMR cluster.You can also configure global Spark parameters. For more information, see Configure global Spark parameters.
NoteTo enable Ranger permission control, add the configuration
spark.hadoop.fs.oss.authorization.method=rangerin Configure global Spark parameters to ensure that it takes effect.
EMR Serverless Spark
For information about how to set the related parameters, see Set parameters for submitting a Spark job.
Advanced parameter
Description
queue
The scheduling queue for submitting jobs. The default is the
dev_queuequeue.priority
The job priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE
The execution mode for SQL statements. Valid values:
true: Executes multiple SQL statements at a time.false: Executes one SQL statement at a time.
NoteThis parameter is supported only for test runs in the data development environment.
SERVERLESS_RELEASE_VERSION
The Spark engine version. By default, the system uses the Default Engine Version configured for the cluster in Cluster Management in the Management Center. To use a different engine version for a specific job, you can override the default here.
SERVERLESS_QUEUE_NAME
Specifies the resource queue. By default, the system uses the Default Resource Queue configured for the cluster in Cluster Management in the Management Center. If you need to isolate or manage resources, you can specify a different queue here. For more information, see Manage resource queues.
Others
You can add custom Spark parameters in the advanced settings. For example, you can add
spark.eventLog.enabled : false. DataWorks automatically formats the parameter as--conf key=valuebefore sending it to the EMR cluster.You can also configure global Spark parameters. For more information, see Configure global Spark parameters.
Run the Spark job
In the Run Configuration Compute Resource section, configure the Compute Resource and DataWorks Resource Group.
NoteYou can also configure Scheduling CU based on the resources required for job execution. The default CU is
0.25.To access data sources in a public network or a VPC, you must use a scheduling resource group that has established connectivity to the data source. For more information, see Network connectivity solutions.
In the parameter dialog box on the toolbar, select the corresponding data source and click Run to run the Spark job.
To run the node job on a regular basis, configure scheduling information based on your business requirements. For more information, see Node scheduling configuration.
NoteTo customize the component environment, create a custom
dataworks_emr_base_task_podbased on the official image and Custom images, and use the image in Data Development.For example, you can replace a Spark JAR package or add dependencies on specific
libraries,files, orjar packageswhen you create the custom image.After configuring the node, you must publish it. For more information, see Node and workflow deployment.
After publishing the node, you can view the status of its scheduled task in the Operations Center. For more information, see Getting started with Operation Center.
FAQ
Q: Why does a connection timeout error occur when I run a node?
A: Verify the Network Connectivity between the Resource Group and the Cluster. Go to the computing resources page, find the resource, and click Initialize Resource. In the dialog box that appears, click Re-initialize and ensure that the initialization is successful.

