The Serverless PySpark node lets you develop and run distributed PySpark jobs on EMR Serverless Spark without managing cluster infrastructure. After completing this guide, you will have:
Uploaded a Python resource file to DataWorks
Written a PySpark job in the Python code editor
Configured a
spark-submitcommand to submit the jobRun the job and reviewed its output logs
Prerequisites
Before you begin, make sure you have:
An EMR Serverless Spark compute resource with network connectivity to the resource group
A Serverless resource group bound to your workspace
A primary account or the Dev or Workspace Administrator role in the workspace. To add members, see Add members to a workspace
How it works
The Serverless PySpark node uses a dual-pane editor:
Upper pane (Python code editor): Write your core business logic. Reference uploaded resource files such as
.pymodules.Lower pane (Submit command editor): Enter the
spark-submitcommand to submit the job to EMR Serverless Spark.
When you run the node, DataWorks automatically resolves code dependencies, injects resources, and submits the job using the built-in spark-submit tool. For parameter specifications, common options, and best practices, see Submit a job by using spark-submit.
Create a node
Log on to the DataWorks console. Switch to the target region, then in the left-side navigation pane, choose Data Development and O&M > Data Development. Select the desired workspace from the drop-down list and click Go to Data Development.
On the Data Development (Data Studio) page, create a Serverless PySpark node.
Set the Path and Name for the node. This guide uses
serverless_pyspark_test1as the node name.
Develop the node
The following example uses the Monte Carlo method to estimate the value of pi (π) in a distributed manner.
Step 1: Upload dependent resources
Upload your custom Python file to the Resource Management module in DataWorks so you can reference it in your node code. This example uses utils.py, which defines the core logic for Monte Carlo simulation sampling within a single Spark task.
For more information about uploading resources, see EMR resources and functions.
On the resource management page in Data Development, click Create Resource, select
EMR Fileas the resource type, and set the resource Name.Click Re-Upload to upload the example utils.py file.
Select the Storage path, Connection, and Resource Groups, then click Save.

Step 2: Write the Python code
In the Python code editor, enter the following code. This program distributes Monte Carlo sampling across partitions in a Spark cluster and aggregates the results to estimate pi.
Resource references through the graphical user interface (GUI) are not currently supported. To reference a resource, add the resource reference statement directly to your Python code: ##@resource_reference{"ResourceName"}.
##@resource_reference{"utils.py"}
from pyspark.sql import SparkSession
from utils import estimate_pi_in_task
import sys
def main():
# Create a SparkSession
spark = SparkSession.builder.appName("EstimatePi").getOrCreate()
sc = spark.sparkContext
# Total number of samples
total_samples = int(sys.argv[1])
num_partitions = ${test1}
# Samples per partition
samples_per_partition = total_samples // num_partitions
# Create an RDD where each partition runs estimate_pi_in_task once
rdd = sc.parallelize(range(num_partitions), num_partitions)
# Map each partition to execute the sampling task
inside_counts = rdd.map(lambda _: estimate_pi_in_task(samples_per_partition))
# Aggregate the results from all partitions
total_inside = inside_counts.sum()
pi_estimate = 4.0 * total_inside / total_samples
print(f"Total samples: {total_samples}")
print(f"Samples within the circle: {total_inside}")
print(f"Estimated value of π: {pi_estimate:.6f}")
spark.stop()
if __name__ == "__main__":
main()Only the entire Python file can be submitted as a single Spark job. Running a selection of code is not supported.
The script uses two types of parameters:
| Parameter | Type | Value | Description |
|---|---|---|---|
sys.argv[1] | Command-line parameter | The value following the script name in the spark-submit command | Total number of samples, for example, 10000 |
${test1} | Scheduling parameter | DataWorks dynamically replaces this variable with its value at runtime | Number of partitions, for example, 100 |
Step 3: Write the spark-submit command
In the Submit command editor, enter the following command. All arguments use the spark-submit tool to package the Python script and submit it to EMR Serverless Spark.
spark-submit \
--py-files utils.py \
serverless_pyspark_test1.py 10000Follow these rules to avoid job submission failures:
Filename consistency: The main Python script filename in the
spark-submitcommand must match the Name of the current node with a.pyextension. For example, if the node is namedserverless_pyspark_test1, useserverless_pyspark_test1.py.Dependency declaration: Declare all external
.pyfiles explicitly using the--py-filesoption. For parameter specifications, common options, and best practices, see Submit a job by using spark-submit.
Run the node
In the Run Settings pane, configure the compute resource and resource group.
Parameter Description Compute resource Select a bound EMR Serverless Spark compute resource. If none are available, select Create Compute Resource from the drop-down list. Resource group Select a resource group bound to the workspace. Script Parameters If your node code defines variables in ${ParameterName}format, enter a value in the Value for This Run field. The Value for This Run is effective only for the current run. The system prioritizes the Value for This Run. If it is not specified, the system uses the Parameter Value from scheduling settings. The parameter value is synchronized with the value configured in the scheduling settings.In the toolbar, click Run. DataWorks merges the complete Python script including resource references, submits it to EMR Serverless Spark using
spark-submit, and returns the execution logs and results.NoteAfter the run, log on to the EMR Serverless Spark console to view job details. The Job History page in the Operations Center displays execution status, duration, and resource usage. To view detailed logs, use the Spark UI. For more information, see Step 5: View the Spark UI.
What's next
Schedule the node: To run the node on a recurring schedule, configure the Scheduling Policy and related scheduling properties in the Scheduling pane.
Publish the node: To run the node in the production environment, click the
icon to start the publishing process. Periodic scheduling only takes effect after the node is published to the production environment.Node O&M: After publishing, monitor the auto-triggered task status in the Operations Center. For more information, see Get started with Operation Center.