Spark is a general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It supports complex in-memory computing, which makes it ideal for building large-scale, low-latency data analytics applications. DataWorks provides the Serverless Spark Batch node that lets you develop and periodically schedule Spark jobs on an EMR Serverless Spark cluster.
Prerequisites
-
Compute resource limits: Only an EMR Serverless Spark compute resource is supported. Ensure that the resource group and the compute resource are connected over the network.
-
Resource group constraints: This task runs only in a Serverless resource group.
-
(Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.
If you are using a root account, skip this step.
Create a node
For instructions, see Create a node.
Develop the node
Before you develop a Serverless Spark Batch job, you must first develop the Spark job code in E-MapReduce (EMR) and compile it into a JAR package. For guidance on Spark development, see Spark tutorials.
Choose the approach that best fits your use case:
Option 1: Upload and reference EMR JAR
DataWorks lets you upload a resource from your local computer to Data Studio and reference it. After compiling a Serverless Spark Batch job, you get the compiled JAR package. Choose a storage method based on the JAR package size. If the JAR package is smaller than 500 MB, you can upload it from your local computer as a DataWorks EMR JAR resource.
-
Create an EMR JAR resource.
-
In the left-side navigation pane, click the
icon to go to the Resource Management page. -
On the Resource Management page, click the
icon, select , and name the resource spark-examples_2.11-2.4.0.jar. -
Click Click Upload and upload the spark-examples_2.11-2.4.0.jar file.
-
Select a Storage Path, Data Sources, and Resource Group.
ImportantFor the data source, you must select the bound EMR Serverless Spark cluster.
-
Click Save.
-
-
Reference the EMR JAR resource.
-
Open the Serverless Spark Batch node that you created and stay on the code editor page.
-
In the Resource Management pane on the left, find the resource you want to use, right-click it, and select Insert Resource Path.
-
The resource is successfully referenced when the following code is automatically added to the code editor of the Serverless Spark Batch node:
##@resource_reference{"spark-examples_2.11-2.4.0.jar"} spark-examples_2.11-2.4.0.jarIn the preceding code, spark-examples_2.11-2.4.0.jar is the name of the EMR JAR resource that you uploaded.
-
Modify the node code by adding the spark-submit command. The following example shows the modified code.
Important-
Modify the job code exactly as shown in the example. Do not add comments, as this will cause the node to fail.
-
For EMR Serverless Spark, you do not need to set the
deploy-modeparameter for thespark-submitcommand. Onlyclustermode is supported.
##@resource_reference{"spark-examples_2.11-2.4.0.jar"} spark-submit --class org.apache.spark.examples.SparkPi spark-examples_2.11-2.4.0.jar 100Command
Description
classThe main class of the application in the JAR file. In this example, the value is
org.apache.spark.examples.SparkPi.NoteFor more information about parameters, see Submit a job using spark-submit.
-
-
Option 2: Reference OSS resource
You can directly reference a resource stored in Object Storage Service (OSS). When you run the node, DataWorks automatically loads the specified OSS resource for local use. This approach is typically used when an EMR job requires JAR dependencies or depends on scripts.
-
Develop the JAR resource: This topic uses SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar as an example.
-
Upload the JAR resource.
-
Log on to the OSS console. In the left-side navigation pane, click buckets.
-
Click the name of the target bucket to go to the file management page.
-
Click Create Directory to create a directory to store the JAR resource.
-
Go to the directory and upload the
SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jarfile to the bucket.
-
-
Reference the JAR resource.
-
In the code editor for the Serverless Spark Batch node that you created, write the code to reference the JAR resource.
ImportantIn the following code, the OSS bucket name is
mybucketand the directory isemr. Replace them with your actual values.spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo oss://mybucket/emr/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jarParameter description:
Parameter
Description
classThe fully qualified name of the main class to run.
ossfile pathThe format is
oss://{bucket}/{object}.-
bucket: A container that stores objects in OSS. Every bucket has a unique name. You can log on to the OSS console to view all buckets under your current account.
-
object: A specific file or path stored in a bucket.
NoteFor more information about parameters, see Submit a job using spark-submit.
-
-
Debug the node
-
In the Run Configuration pane, configure settings such as Compute resource and resource group.
Parameter
Description
Compute resource
Select a bound EMR Serverless Spark compute resource. If no compute resources are available, you can select Create Compute Resource from the drop-down list.
resource group
Select a resource group that is bound to the workspace.
Script parameters
You can define variables in the node's code by using the
${Parameter name}format. You must then define the Parameter name and Parameter Value in the Script Parameters section. At runtime, DataWorks replaces these variables with their specified values. For more information, see Scheduling parameters.ServerlessSpark node parameters
Runtime parameters for the Spark program. The following types are supported:
-
Custom runtime parameters for DataWorks. For more information, see Appendix: DataWorks parameters.
-
Spark built-in properties. For more information, see Open-source Spark properties and Custom Spark Conf parameters.
The configuration format is as follows:
"spark.eventLog.enabled": false. DataWorks automatically adds this setting to the code that is submitted to Serverless Spark in the--conf key=valueformat.NoteDataWorks allows you to set global Spark parameters at the workspace level. You can specify whether the global Spark parameters have a higher priority than the parameters set within a specific module. For details, see Configure global Spark parameters.
-
-
In the toolbar at the top of the node editor, click Run.
ImportantBefore you publish, you must synchronize the ServerlessSpark node parameters in the Run Configuration to the ServerlessSpark node parameters in the Scheduling Settings.
Next steps
-
Configure node scheduling: If you need to run a node periodically, configure its Scheduling Policy in the Scheduling Settings panel on the right.
-
Publish a node: To run a task in the production environment, click the
icon to publish the node. A node runs on schedule only after it is published to the production environment. -
Task O&M: After a task is published, you can monitor the status of its periodic runs in the Operation Center. For more information, see Get started with Operation Center.
Related documentation
Appendix: DataWorks parameters
|
Parameter |
Description |
|
SERVERLESS_QUEUE_NAME |
Specifies the resource queue to which the job is submitted. By default, jobs are submitted to the Default Resource Queue configured for the cluster under Clusters in the Management Center. If you need to isolate and manage resources, you can add queues. For more information, see Manage resource queues. Configuration methods:
|