All Products
Search
Document Center

DataWorks:EMR PySpark node

Last Updated:Mar 26, 2026

The EMR PySpark node lets you run Python-based Spark jobs directly from DataWorks. It combines a Python code editor and a spark-submit command editor in a single dual-pane interface, and integrates with Alibaba Cloud E-MapReduce (EMR) semi-managed clusters and EMR Serverless Spark (fully managed).

Prerequisites

Before you begin, ensure that you have:

  • A supported compute resource. The node supports two cluster types:

    • EMR compute resources (semi-managed): Only DataLake clusters and Custom clusters are supported. To enable EMR PySpark nodes on a semi-managed cluster, submit a ticket and specify the cluster type, EMR version, and your use case. We will assess the feasibility and provide support.

    • EMR Serverless Spark compute resources (fully managed)

  • A Serverless resource group. Only Serverless resource groups are supported.

  • The Dev or Workspace Administrator role. Log in as a root account or a RAM user with the Dev or Workspace Administrator role. To add members, see Add a member to a workspace.

How it works

The EMR PySpark node uses a dual-pane editor:

  • Upper pane: A Python code editor for your core business logic. Reference uploaded resource files (such as .py modules) using the ##@resource_reference annotation.

  • Lower pane: A command editor where you write the spark-submit command to submit your job to an EMR cluster.

When you click Run, DataWorks merges the Python script and its referenced resources, submits the job to the EMR cluster using spark-submit, and displays the execution logs and results.

Only the entire Python file can be submitted as a Spark job. Running a selected portion of the code is not supported.

Create a node

  1. Log on to the DataWorks console and switch to the target region. In the left-side navigation pane, click Data Development and O&M > Data Development. Select a workspace from the drop-down list and click Go to Data Development.

  2. On the Data Studio page, create an EMR PySpark node.

  3. Set the Path and Name for the node. This example uses emr_pyspark_test as the node name.

Develop a node

The following example estimates the value of pi (π) using the Monte Carlo method on a distributed cluster.

Step 1: Upload dependent resources

Upload your custom Python file to the Resource Management module so the node can reference it at runtime.

  1. In Data Development, go to the resource management page and click Create Resource. Select EMR File as the resource type and set the resource Name. For detailed steps, see EMR resources and functions.

  2. Click Re-Upload to upload the example utils.py file. This file defines the Monte Carlo sampling logic executed within each Spark task.

  3. Select the Storage path, Connection, and Resource Groups, then click Save.

    image

Step 2: Write the Python code

In the Python code editor (upper pane), write the following code. The program divides the workload across partitions, runs Monte Carlo sampling in parallel on worker nodes, and then aggregates the results to estimate pi.

##@resource_reference{"utils.py"}
from pyspark.sql import SparkSession
from utils import estimate_pi_in_task
import sys

def main():
    # Create a SparkSession
    spark = SparkSession.builder \
        .appName("EstimatePi") \
        .getOrCreate()

    sc = spark.sparkContext

    # Total number of samples
    total_samples = int(sys.argv[1])
    num_partitions = ${test1}

    # Number of samples per partition
    samples_per_partition = total_samples // num_partitions

    # Create an RDD where each partition executes estimate_pi_in_task once
    rdd = sc.parallelize(range(num_partitions), num_partitions)

    # Map each partition to execute the sampling task
    inside_counts = rdd.map(lambda _: estimate_pi_in_task(samples_per_partition))

    # Aggregate the results from all partitions
    total_inside = inside_counts.sum()
    pi_estimate = 4.0 * total_inside / total_samples

    print(f"Total samples: {total_samples}")
    print(f"Samples inside the circle: {total_inside}")
    print(f"Estimated value of Pi: {pi_estimate:.6f}")

    spark.stop()

if __name__ == "__main__":
    main()

The code uses two types of parameters:

ParameterTypeDescriptionExample value
sys.argv[1]Command-line argumentPassed after the script name in the spark-submit command10000 (total samples)
${test1}Script parameterReplaced by DataWorks with the actual value at runtime100 (number of partitions)

Step 3: Write the spark-submit command

In the command editor (lower pane), write the spark-submit command. The required parameters differ depending on your cluster type.

For semi-managed clusters (EMR compute resources) — uses the standard open-source Apache Spark spark-submit tool:

# Submit to a semi-managed EMR cluster (DataLake or Custom)
spark-submit \
  --master yarn \         # Specify YARN as the Spark cluster manager.
  --deploy-mode cluster \ # Specify the cluster deployment mode.
  --py-files utils.py \  # Explicitly declare the dependent Python resource file.
  emr_pyspark_test.py 10000  # Main entry file of the program.

For fully managed clusters (EMR Serverless Spark) — uses the Alibaba Cloud EMR Serverless spark-submit toolkit, which handles cluster management automatically:

# Submit to EMR Serverless Spark
spark-submit \
  --py-files utils.py \
  emr_pyspark_test.py 10000

For a full list of supported parameters:

Run a node

  1. In the Run Settings section, configure the following parameters:

    ParameterDescription
    Compute resourceSelect an EMR compute resource (semi-managed) or EMR Serverless Spark compute resource (fully managed). If none are available, select create a compute resource from the drop-down list.
    Resource groupSelect a resource group bound to the workspace.
    Script ParameterIf your code uses ${parameter_name} variables, specify the Value for this run here. DataWorks replaces the variable with this value at runtime. This value takes precedence over the default parameter value and applies only to the current execution.
  2. In the toolbar, click Run. DataWorks merges the Python script and its referenced resources, submits the job to the EMR cluster, and displays the execution logs and results.

What's next

  • Schedule a node: Set Scheduling Policies in the Scheduling section on the right side of the node page to run the node on a recurring schedule.

  • Publish a node: Click the image icon to publish the node to the production environment. A node runs on a schedule only after it is published.

  • Node O&M: After publishing, monitor the auto-triggered task status in Operation Center. See Get started with Operation Center.

Related topics