All Products
Search
Document Center

DataWorks:Serverless Spark Batch node

Last Updated:Jun 21, 2026

Spark is a general-purpose big data analytics engine known for its high performance, ease of use, and broad applicability. It supports complex in-memory computing, which makes it ideal for building large-scale, low-latency data analytics applications. DataWorks provides the Serverless Spark Batch node that lets you develop and periodically schedule Spark jobs on an EMR Serverless Spark cluster.

Prerequisites

  • Compute resource limits: Only an EMR Serverless Spark compute resource is supported. Ensure that the resource group and the compute resource are connected over the network.

  • Resource group constraints: This task runs only in a Serverless resource group.

  • (Optional, for RAM users) The Resource Access Management (RAM) user for task development must be added to the workspace and assigned the Development or Workspace Administrator role (this role includes extensive permissions and must be granted with caution). For more information, see Add workspace members.

    If you are using a root account, skip this step.

Create a node

For instructions, see Create a node.

Develop the node

Note

Before you develop a Serverless Spark Batch job, you must first develop the Spark job code in E-MapReduce (EMR) and compile it into a JAR package. For guidance on Spark development, see Spark tutorials.

Choose the approach that best fits your use case:

Option 1: Upload and reference EMR JAR

DataWorks lets you upload a resource from your local computer to Data Studio and reference it. After compiling a Serverless Spark Batch job, you get the compiled JAR package. Choose a storage method based on the JAR package size. If the JAR package is smaller than 500 MB, you can upload it from your local computer as a DataWorks EMR JAR resource.

  1. Create an EMR JAR resource.

    1. In the left-side navigation pane, click the image icon to go to the Resource Management page.

    2. On the Resource Management page, click the image icon, select New Resource, and name the resource spark-examples_2.11-2.4.0.jar.

    3. Click Click Upload and upload the spark-examples_2.11-2.4.0.jar file.

    4. Select a Storage Path, Data Sources, and Resource Group.

      Important

      For the data source, you must select the bound EMR Serverless Spark cluster.

    5. Click Save.

  2. Reference the EMR JAR resource.

    1. Open the Serverless Spark Batch node that you created and stay on the code editor page.

    2. In the Resource Management pane on the left, find the resource you want to use, right-click it, and select Insert Resource Path.

    3. The resource is successfully referenced when the following code is automatically added to the code editor of the Serverless Spark Batch node:

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-examples_2.11-2.4.0.jar

      In the preceding code, spark-examples_2.11-2.4.0.jar is the name of the EMR JAR resource that you uploaded.

    4. Modify the node code by adding the spark-submit command. The following example shows the modified code.

      Important
      • Modify the job code exactly as shown in the example. Do not add comments, as this will cause the node to fail.

      • For EMR Serverless Spark, you do not need to set the deploy-mode parameter for the spark-submit command. Only cluster mode is supported.

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-submit --class org.apache.spark.examples.SparkPi spark-examples_2.11-2.4.0.jar 100

      Command

      Description

      class

      The main class of the application in the JAR file. In this example, the value is org.apache.spark.examples.SparkPi.

      Note

      For more information about parameters, see Submit a job using spark-submit.

Option 2: Reference OSS resource

You can directly reference a resource stored in Object Storage Service (OSS). When you run the node, DataWorks automatically loads the specified OSS resource for local use. This approach is typically used when an EMR job requires JAR dependencies or depends on scripts.

  1. Develop the JAR resource: This topic uses SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar as an example.

  2. Upload the JAR resource.

    1. Log on to the OSS console. In the left-side navigation pane, click buckets.

    2. Click the name of the target bucket to go to the file management page.

    3. Click Create Directory to create a directory to store the JAR resource.

    4. Go to the directory and upload the SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar file to the bucket.

  3. Reference the JAR resource.

    1. In the code editor for the Serverless Spark Batch node that you created, write the code to reference the JAR resource.

      Important

      In the following code, the OSS bucket name is mybucket and the directory is emr. Replace them with your actual values.

      spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo oss://mybucket/emr/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar

      Parameter description:

      Parameter

      Description

      class

      The fully qualified name of the main class to run.

      oss file path

      The format is oss://{bucket}/{object}.

      • bucket: A container that stores objects in OSS. Every bucket has a unique name. You can log on to the OSS console to view all buckets under your current account.

      • object: A specific file or path stored in a bucket.

      Note

      For more information about parameters, see Submit a job using spark-submit.

Debug the node

  1. In the Run Configuration pane, configure settings such as Compute resource and resource group.

    Parameter

    Description

    Compute resource

    Select a bound EMR Serverless Spark compute resource. If no compute resources are available, you can select Create Compute Resource from the drop-down list.

    resource group

    Select a resource group that is bound to the workspace.

    Script parameters

    You can define variables in the node's code by using the ${Parameter name} format. You must then define the Parameter name and Parameter Value in the Script Parameters section. At runtime, DataWorks replaces these variables with their specified values. For more information, see Scheduling parameters.

    ServerlessSpark node parameters

    Runtime parameters for the Spark program. The following types are supported:

    The configuration format is as follows: "spark.eventLog.enabled": false. DataWorks automatically adds this setting to the code that is submitted to Serverless Spark in the --conf key=value format.

    Note

    DataWorks allows you to set global Spark parameters at the workspace level. You can specify whether the global Spark parameters have a higher priority than the parameters set within a specific module. For details, see Configure global Spark parameters.

  2. In the toolbar at the top of the node editor, click Run.

    Important

    Before you publish, you must synchronize the ServerlessSpark node parameters in the Run Configuration to the ServerlessSpark node parameters in the Scheduling Settings.

Next steps

  • Configure node scheduling: If you need to run a node periodically, configure its Scheduling Policy in the Scheduling Settings panel on the right.

  • Publish a node: To run a task in the production environment, click the image icon to publish the node. A node runs on schedule only after it is published to the production environment.

  • Task O&M: After a task is published, you can monitor the status of its periodic runs in the Operation Center. For more information, see Get started with Operation Center.

Related documentation

Appendix: DataWorks parameters

Parameter

Description

SERVERLESS_QUEUE_NAME

Specifies the resource queue to which the job is submitted. By default, jobs are submitted to the Default Resource Queue configured for the cluster under Clusters in the Management Center. If you need to isolate and manage resources, you can add queues. For more information, see Manage resource queues.

Configuration methods:

  • Specify the resource queue for the job by using node parameters.

  • Specify the resource queue for the job by using global Spark parameters.