All Products
Search
Document Center

DataWorks:Serverless Spark Batch node

Last Updated:Feb 08, 2026

Spark is a high-performance, easy-to-use engine for large-scale data analytics. It supports a wide range of applications, including complex in-memory computing, which makes it ideal for building large-scale, low-latency data analysis applications. DataWorks provides Serverless Spark Batch nodes that allow you to easily develop and periodically schedule Spark tasks on EMR Serverless Spark clusters in DataWorks.

Applicability

  • Computing resource limitations: You can only attach EMR Serverless Spark computing resources. Ensure that network connectivity is available between the resource group and the computing resources.

  • Resource group: Only Serverless resource groups can be used to run this type of task.

  • (Optional) If you are a Resource Access Management (RAM) user, ensure that you have been added to the workspace for task development and have been assigned the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions. Grant this role with caution. For more information about adding members, see Add members to a workspace.

    If you use an Alibaba Cloud account, you can skip this step.

Create a node

For more information, see Create a node.

Develop a node

Note

Before you develop a Serverless Spark Batch task, you must first develop the Spark task code in EMR and compile it into a Java Archive (JAR) package. For more information about Spark development, see Spark Tutorials.

Choose an option based on your scenario:

Option 1: Upload and reference an EMR JAR resource

In DataWorks, you can upload a resource from your local machine to DataStudio and then reference it. After you compile the Serverless Spark Batch task, obtain the compiled JAR package. We recommend that you choose a storage method for the JAR package based on its size. If the JAR package is smaller than 500 MB, you can upload it from your local machine as a DataWorks EMR JAR resource.

  1. Create an EMR JAR resource.

    1. In the navigation pane, click the Resource Management icon image to open the Resource Management page.

    2. On the Resource Management page, click the image icon, select Create Resource > EMR JAR, and enter the name spark-examples_2.11-2.4.0.jar.

    3. Click Upload to upload spark-examples_2.11-2.4.0.jar.

    4. Select a Storage Path, Data Source, and Resource Group.

      Important

      For Data Source, select the bound Serverless Spark cluster.

    5. Click the Save button.

    image

  2. Reference the EMR JAR resource.

    1. Open the code editor for the created Serverless Spark Batch node.

    2. In the navigation pane, expand Resource Management. Find the resource that you want to reference, right-click the resource, and then select Reference Resource.

    3. After you select the resource, a success message is displayed in the code editor of the Serverless Spark Batch node. This indicates that the resource is referenced.

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-examples_2.11-2.4.0.jar

      The resource is referenced when a reference statement is automatically added to the code editor. In this statement, spark-examples_2.11-2.4.0.jar is the name of the EMR JAR resource that you uploaded.

    4. Rewrite the code of the Serverless Spark Batch node to add the spark-submit command. The following code provides an example.

      Important
      • The code editor for Serverless Spark Batch nodes does not support comment statements. Use the following example to rewrite the task code. Do not add comments. Adding comments will cause an error when you run the node.

      • For EMR Serverless Spark, you do not need to specify the deploy-mode parameter in the spark-submit command. Only cluster mode is supported.

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-submit --class org.apache.spark.examples.SparkPi spark-examples_2.11-2.4.0.jar 100

      Command

      Description

      class

      The main class of the task in the compiled JAR package. In this example, the main class is org.apache.spark.examples.SparkPi.

      Note

      For more information about the parameters, see Submit a task using spark-submit.

Option 2: Directly reference an OSS resource

You can directly reference an OSS resource in the node. When you run the EMR node, DataWorks automatically downloads the OSS resource to your local machine. This method is often used in scenarios where an EMR task must run with JAR dependencies or an EMR task depends on scripts.

  1. Develop a JAR resource: This topic uses SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar as an example.

  2. Upload the JAR resource.

    1. Log on to the OSS console. In the navigation pane, click Buckets.

    2. Click the name of the destination bucket to open the file management page.

    3. Click Create Directory to create a folder to store the JAR resource.

    4. Go to the folder and upload the SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar file to the bucket.

  3. Reference the JAR resource.

    1. On the editor page for the created Serverless Spark Batch node, edit the code to reference the JAR resource.

      Important

      In the following code, the OSS bucket name is mybucket and the folder is emr. Replace them with your actual bucket name and folder path.

      spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo oss://mybucket/emr/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar

      Parameter description:

      Parameter

      Description

      class

      The full name of the main class to run.

      oss file path

      The format is oss://{bucket}/{object}

      • Bucket: A container in OSS for storing objects. Each Bucket has a unique name. Log on to the OSS Management Console to view all Buckets under the current account.

      • object: A specific object, such as a file name or path, stored in a bucket.

      Note

      For more information about the parameters, see Submit a task using spark-submit.

Debugging nodes

  1. In the Run Configuration section, configure parameters, such as Computing Resource and Resource Group.

    Configuration Item

    Description

    Computing Resource

    Select a bound EMR Serverless Spark computing resource. If no computing resources are available, select Create Computing Resource from the drop-down list.

    Resource Group

    Select a resource group that is bound to the workspace.

    Script Parameters

    When you configure the node content, you can define variables in the ${ParameterName} format. You must then specify the Parameter Name and Parameter Value in the Script Parameters section. These variables are dynamically replaced with their actual values at runtime. For more information, see Sources and expressions of scheduling parameters.

    ServerlessSpark Node Parameters

    The runtime parameters for the Spark program. The following parameters are supported:

    Configure the parameters in the following format: spark.eventLog.enabled : false . DataWorks automatically adds the parameters to the code that is submitted to Serverless Spark in the following format: --conf key=value.

    Note

    DataWorks lets you configure global Spark parameters for each module at the workspace level. You can also set the priority of these global parameters over module-specific parameters. For more information, see Configure global Spark parameters.

  2. On the toolbar of the node editor, click Run.

    Important

    Before you publish, you must synchronize the ServerlessSpark Node Parameters from the Run Configuration to the ServerlessSpark Node Parameters for the Scheduling.

Next steps

  • Schedule a node: If a node in the project folder needs to run periodically, you can set the Scheduling Policies and configure scheduling properties in the Scheduling section on the right side of the node page.

  • Publish a node: If the task needs to run in the production environment, click the image icon to publish the task. A node in the project folder runs on a schedule only after it is published to the production environment.

  • Node O&M: After you publish the task, you can view the status of the auto triggered task in the Operation Center. For more information, see Get started with Operation Center.

References

Appendix: DataWorks parameters

Parameter

Description

SERVERLESS_QUEUE_NAME

Specifies the resource queue to which the task is submitted. By default, tasks are submitted to the Default Resource Queue that is configured for the cluster in the Cluster Management section of the Management Center. If you require resource isolation and management, you can add queues. For more information, see Manage resource queues.

Configuration methods: