All Products
Search
Document Center

E-MapReduce:Quick start for PySpark development

Last Updated:Mar 26, 2026

Write a PySpark script with your business logic and submit it to EMR Serverless Spark to run as a batch job. This tutorial walks you through the full process using a provided sample script that uses the Apache Spark framework to process data in OSS.

In this tutorial, you:

  1. Download the sample Python file and data file.

  2. Upload the Python file to EMR Serverless Spark and the data file to OSS.

  3. Create a PySpark job, configure it, and run it.

  4. View the execution logs to verify the job completed successfully.

  5. Publish the job so it can be used in a workflow.

  6. Check the job details in the Spark UI.

Prerequisites

Before you begin, ensure that you have:

Step 1: download the sample files

This tutorial uses two sample files:

  • DataFrame.py — A PySpark script that uses the Apache Spark framework to process data in OSS.

  • employee.csv — A sample dataset containing employee names, departments, and salaries.

Download both files before proceeding:

Step 2: upload the files

Upload the Python file to EMR Serverless Spark and the data file to OSS separately.

Upload the Python file to EMR Serverless Spark

  1. Log on to the EMR console.

  2. In the left navigation pane, choose EMR Serverless > Spark.

  3. On the Spark page, click the name of your workspace.

  4. In the left navigation pane, click Artifacts.

  5. On the Artifacts page, click Upload File.

  6. In the Upload File dialog box, click the upload area to select DataFrame.py, or drag the file into the area.

Upload the data file to OSS

Upload employee.csv to an OSS bucket. For detailed steps, see Upload files.

Note the OSS path after the upload completes — you'll use it when configuring the job in the next step. The path follows this format: oss://<yourBucketName>/employee.csv.

Step 3: create and run the job

  1. On the EMR Serverless Spark page, click Development in the left navigation pane.

  2. On the Development tab, click the image icon to create a job.

  3. In the dialog box, enter a name, select Application(Batch) > PySpark for the type, and click OK.

  4. In the upper-right corner, select a queue.

    To add or manage queues, see Manage resource queues.

  5. Configure the following parameters, then click Run. Keep the default settings for all other parameters.

    ParameterDescription
    Main Python ResourceSelect DataFrame.py — the file you uploaded to the Artifacts page.
    Execution ParametersEnter the OSS path to employee.csv. For example: oss://<yourBucketName>/employee.csv.
  6. After the job runs, scroll to the Execution Records section below the editor. Click Logs in the Actions column.

  7. On the Log Exploration tab, confirm the job completed without errors.

    image

Step 4: publish the job

Publishing a job makes it available as a node in a workflow for scheduling.

  1. Click Publish in the upper-right corner of the job tab.

  2. In the Publish Job dialog box, enter the release information and click OK.

Step 5: view the Spark UI

After the job completes successfully, you can view its status on the Spark UI.

  1. In the left navigation pane, click Job History.

  2. On the Application page, click Spark UI in the Actions column for the job.

  3. On the Spark Jobs page, review the job details.

    image

What's next