All Products
Search
Document Center

E-MapReduce:Quick start for PySpark development

Last Updated:Dec 04, 2025

You can develop PySpark jobs by writing a Python script with your business logic and uploading it to EMR Serverless Spark. This topic provides an example to guide you through the development process.

Prerequisites

Procedure

Step 1: Prepare test files

In EMR Serverless Spark, you can develop Python files on an on-premises or standalone development platform and then submit the files to EMR Serverless Spark for execution. This Quick Start provides test files to help you quickly become familiar with PySpark jobs. Download the test files to use in the following steps.

Click DataFrame.py and employee.csv to download the test files.

Note
  • The DataFrame.py file contains code that uses the Apache Spark framework to process data in OSS.

  • The employee.csv file contains a list of data, including employee names, departments, and salaries.

Step 2: Upload the test files

  1. Upload the Python file to EMR Serverless Spark.

    1. Go to the resource upload page.

      1. Log on to the EMR console.

      2. In the navigation pane on the left, choose EMR Serverless > Spark.

      3. On the Spark page, click the name of the target workspace.

      4. On the EMR Serverless Spark page, in the left navigation pane, click Artifacts.

    2. On the Artifacts page, click Upload File.

    3. In the Upload File dialog box, click the upload area to select the Python file, or drag the file into the area.

      In this example, upload the DataFrame.py file.

  2. Upload the data file (employee.csv) to the Object Storage Service (OSS) console. For more information, see Upload files.

Step 3: Develop and run the job

  1. On the EMR Serverless Spark page, click Development in the navigation pane on the left.

  2. On the Development tab, click the image icon.

  3. In the dialog box that appears, enter a name, select Application(Batch) > PySpark for Type, and then click OK.

  4. In the upper-right corner, select a queue.

    For more information about how to add a queue, see Manage resource queues.

  5. On the new job tab, configure the following parameters. Keep the default settings for the other parameters. Then, click Run.

    Parameter

    Description

    Main Python Resource

    Select the Python file that you uploaded on the Artifacts page in the previous step. In this example, select DataFrame.py.

    Execution Parameters

    Enter the path of the data file (employee.csv) that is uploaded to OSS. Example: oss://<yourBucketName>/employee.csv.

  6. After the job runs, in the Execution Records section below, click Logs in the Actions column for the job.

  7. On the Log Exploration tab, you can view the log information.

    image

Step 4: Publish the job

Important

A published job can be used as a node in a workflow.

  1. After the job runs, click Publish on the right.

  2. In the Publish Job dialog box, enter the release information and click OK.

Step 5: View the Spark UI

After the job runs successfully, you can view its status on the Spark UI.

  1. In the navigation pane on the left, click Job History.

  2. On the Application page, in the Actions column for the target job, click Spark UI.

  3. On the Spark Jobs page, you can view the job details.

    image

References