EMR PySpark 节点 - DataWorks - Alibaba Cloud Documentation Center

The EMR PySpark node lets you run Python-based Spark jobs directly from DataWorks. It combines a Python code editor and a spark-submit command editor in a single dual-pane interface, and integrates with Alibaba Cloud E-MapReduce (EMR) semi-managed clusters and EMR Serverless Spark (fully managed).

Prerequisites

Before you begin, ensure that you have:

A supported compute resource. The node supports two cluster types:
- EMR compute resources (semi-managed): Only DataLake clusters and Custom clusters are supported. To enable EMR PySpark nodes on a semi-managed cluster, submit a ticket and specify the cluster type, EMR version, and your use case. We will assess the feasibility and provide support.
- EMR Serverless Spark compute resources (fully managed)
A Serverless resource group. Only Serverless resource groups are supported.
The Dev or Workspace Administrator role. Log in as a root account or a RAM user with the Dev or Workspace Administrator role. To add members, see Add a member to a workspace.

How it works

The EMR PySpark node uses a dual-pane editor:

Upper pane: A Python code editor for your core business logic. Reference uploaded resource files (such as .py modules) using the ##@resource_reference annotation.
Lower pane: A command editor where you write the spark-submit command to submit your job to an EMR cluster.

When you click Run, DataWorks merges the Python script and its referenced resources, submits the job to the EMR cluster using spark-submit, and displays the execution logs and results.

Only the entire Python file can be submitted as a Spark job. Running a selected portion of the code is not supported.

Create a node

Log on to the DataWorks console and switch to the target region. In the left-side navigation pane, click Data Development and O&M > Data Development. Select a workspace from the drop-down list and click Go to Data Development.
On the Data Studio page, create an EMR PySpark node.
Set the Path and Name for the node. This example uses emr_pyspark_test as the node name.

Develop a node

The following example estimates the value of pi (π) using the Monte Carlo method on a distributed cluster.

Step 1: Upload dependent resources

Upload your custom Python file to the Resource Management module so the node can reference it at runtime.

In Data Development, go to the resource management page and click Create Resource. Select EMR File as the resource type and set the resource Name. For detailed steps, see EMR resources and functions.
Click Re-Upload to upload the example utils.py file. This file defines the Monte Carlo sampling logic executed within each Spark task.
Select the Storage path, Connection, and Resource Groups, then click Save.

Step 2: Write the Python code

In the Python code editor (upper pane), write the following code. The program divides the workload across partitions, runs Monte Carlo sampling in parallel on worker nodes, and then aggregates the results to estimate pi.

##@resource_reference{"utils.py"}
from pyspark.sql import SparkSession
from utils import estimate_pi_in_task
import sys

def main():
    # Create a SparkSession
    spark = SparkSession.builder \
        .appName("EstimatePi") \
        .getOrCreate()

    sc = spark.sparkContext

    # Total number of samples
    total_samples = int(sys.argv[1])
    num_partitions = ${test1}

    # Number of samples per partition
    samples_per_partition = total_samples // num_partitions

    # Create an RDD where each partition executes estimate_pi_in_task once
    rdd = sc.parallelize(range(num_partitions), num_partitions)

    # Map each partition to execute the sampling task
    inside_counts = rdd.map(lambda _: estimate_pi_in_task(samples_per_partition))

    # Aggregate the results from all partitions
    total_inside = inside_counts.sum()
    pi_estimate = 4.0 * total_inside / total_samples

    print(f"Total samples: {total_samples}")
    print(f"Samples inside the circle: {total_inside}")
    print(f"Estimated value of Pi: {pi_estimate:.6f}")

    spark.stop()

if __name__ == "__main__":
    main()

The code uses two types of parameters:

Parameter	Type	Description	Example value
`sys.argv[1]`	Command-line argument	Passed after the script name in the `spark-submit` command	`10000` (total samples)
`${test1}`	Script parameter	Replaced by DataWorks with the actual value at runtime	`100` (number of partitions)

Step 3: Write the spark-submit command

In the command editor (lower pane), write the spark-submit command. The required parameters differ depending on your cluster type.

For semi-managed clusters (EMR compute resources) — uses the standard open-source Apache Spark spark-submit tool:

# Submit to a semi-managed EMR cluster (DataLake or Custom)
spark-submit \
  --master yarn \         # Specify YARN as the Spark cluster manager.
  --deploy-mode cluster \ # Specify the cluster deployment mode.
  --py-files utils.py \  # Explicitly declare the dependent Python resource file.
  emr_pyspark_test.py 10000  # Main entry file of the program.

For fully managed clusters (EMR Serverless Spark) — uses the Alibaba Cloud EMR Serverless spark-submit toolkit, which handles cluster management automatically:

# Submit to EMR Serverless Spark
spark-submit \
  --py-files utils.py \
  emr_pyspark_test.py 10000

For a full list of supported parameters:

Semi-managed clusters: see the Apache Spark documentation on submitting applications
EMR Serverless Spark: see Submit a job by using spark-submit

Run a node

In the Run Settings section, configure the following parameters:

Parameter	Description
Compute resource	Select an EMR compute resource (semi-managed) or EMR Serverless Spark compute resource (fully managed). If none are available, select create a compute resource from the drop-down list.
Resource group	Select a resource group bound to the workspace.
Script Parameter	If your code uses `${parameter_name}` variables, specify the Value for this run here. DataWorks replaces the variable with this value at runtime. This value takes precedence over the default parameter value and applies only to the current execution.

In the toolbar, click Run. DataWorks merges the Python script and its referenced resources, submits the job to the EMR cluster, and displays the execution logs and results.

What's next

Schedule a node: Set Scheduling Policies in the Scheduling section on the right side of the node page to run the node on a recurring schedule.
Publish a node: Click the icon to publish the node to the production environment. A node runs on a schedule only after it is published.
Node O&M: After publishing, monitor the auto-triggered task status in Operation Center. See Get started with Operation Center.

DataWorks:EMR PySpark node