All Products
Search
Document Center

DataWorks:Serverless PySpark node

Last Updated:Mar 25, 2026

The Serverless PySpark node lets you develop and run distributed PySpark jobs on EMR Serverless Spark without managing cluster infrastructure. After completing this guide, you will have:

  • Uploaded a Python resource file to DataWorks

  • Written a PySpark job in the Python code editor

  • Configured a spark-submit command to submit the job

  • Run the job and reviewed its output logs

Prerequisites

Before you begin, make sure you have:

How it works

The Serverless PySpark node uses a dual-pane editor:

  • Upper pane (Python code editor): Write your core business logic. Reference uploaded resource files such as .py modules.

  • Lower pane (Submit command editor): Enter the spark-submit command to submit the job to EMR Serverless Spark.

When you run the node, DataWorks automatically resolves code dependencies, injects resources, and submits the job using the built-in spark-submit tool. For parameter specifications, common options, and best practices, see Submit a job by using spark-submit.

Create a node

  1. Log on to the DataWorks console. Switch to the target region, then in the left-side navigation pane, choose Data Development and O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Development.

  2. On the Data Development (Data Studio) page, create a Serverless PySpark node.

  3. Set the Path and Name for the node. This guide uses serverless_pyspark_test1 as the node name.

Develop the node

The following example uses the Monte Carlo method to estimate the value of pi (π) in a distributed manner.

Step 1: Upload dependent resources

Upload your custom Python file to the Resource Management module in DataWorks so you can reference it in your node code. This example uses utils.py, which defines the core logic for Monte Carlo simulation sampling within a single Spark task.

For more information about uploading resources, see EMR resources and functions.

  1. On the resource management page in Data Development, click Create Resource, select EMR File as the resource type, and set the resource Name.

  2. Click Re-Upload to upload the example utils.py file.

  3. Select the Storage path, Connection, and Resource Groups, then click Save.

    image

Step 2: Write the Python code

In the Python code editor, enter the following code. This program distributes Monte Carlo sampling across partitions in a Spark cluster and aggregates the results to estimate pi.

Note

Resource references through the graphical user interface (GUI) are not currently supported. To reference a resource, add the resource reference statement directly to your Python code: ##@resource_reference{"ResourceName"}.

##@resource_reference{"utils.py"}
from pyspark.sql import SparkSession
from utils import estimate_pi_in_task
import sys

def main():
    # Create a SparkSession
    spark = SparkSession.builder.appName("EstimatePi").getOrCreate()

    sc = spark.sparkContext

    # Total number of samples
    total_samples = int(sys.argv[1])
    num_partitions = ${test1}

    # Samples per partition
    samples_per_partition = total_samples // num_partitions

    # Create an RDD where each partition runs estimate_pi_in_task once
    rdd = sc.parallelize(range(num_partitions), num_partitions)

    # Map each partition to execute the sampling task
    inside_counts = rdd.map(lambda _: estimate_pi_in_task(samples_per_partition))

    # Aggregate the results from all partitions
    total_inside = inside_counts.sum()
    pi_estimate = 4.0 * total_inside / total_samples

    print(f"Total samples: {total_samples}")
    print(f"Samples within the circle: {total_inside}")
    print(f"Estimated value of π: {pi_estimate:.6f}")

    spark.stop()

if __name__ == "__main__":
    main()
Note

Only the entire Python file can be submitted as a single Spark job. Running a selection of code is not supported.

The script uses two types of parameters:

ParameterTypeValueDescription
sys.argv[1]Command-line parameterThe value following the script name in the spark-submit commandTotal number of samples, for example, 10000
${test1}Scheduling parameterDataWorks dynamically replaces this variable with its value at runtimeNumber of partitions, for example, 100

Step 3: Write the spark-submit command

In the Submit command editor, enter the following command. All arguments use the spark-submit tool to package the Python script and submit it to EMR Serverless Spark.

spark-submit \
 --py-files utils.py \
serverless_pyspark_test1.py 10000
Important

Follow these rules to avoid job submission failures:

  • Filename consistency: The main Python script filename in the spark-submit command must match the Name of the current node with a .py extension. For example, if the node is named serverless_pyspark_test1, use serverless_pyspark_test1.py.

  • Dependency declaration: Declare all external .py files explicitly using the --py-files option. For parameter specifications, common options, and best practices, see Submit a job by using spark-submit.

Run the node

  1. In the Run Settings pane, configure the compute resource and resource group.

    ParameterDescription
    Compute resourceSelect a bound EMR Serverless Spark compute resource. If none are available, select Create Compute Resource from the drop-down list.
    Resource groupSelect a resource group bound to the workspace.
    Script ParametersIf your node code defines variables in ${ParameterName} format, enter a value in the Value for This Run field. The Value for This Run is effective only for the current run. The system prioritizes the Value for This Run. If it is not specified, the system uses the Parameter Value from scheduling settings. The parameter value is synchronized with the value configured in the scheduling settings.
  2. In the toolbar, click Run. DataWorks merges the complete Python script including resource references, submits it to EMR Serverless Spark using spark-submit, and returns the execution logs and results.

    Note

    After the run, log on to the EMR Serverless Spark console to view job details. The Job History page in the Operations Center displays execution status, duration, and resource usage. To view detailed logs, use the Spark UI. For more information, see Step 5: View the Spark UI.

What's next

  • Schedule the node: To run the node on a recurring schedule, configure the Scheduling Policy and related scheduling properties in the Scheduling pane.

  • Publish the node: To run the node in the production environment, click the image icon to start the publishing process. Periodic scheduling only takes effect after the node is published to the production environment.

  • Node O&M: After publishing, monitor the auto-triggered task status in the Operations Center. For more information, see Get started with Operation Center.

Related topics