Build an Interactive PySpark Notebook for Data Analytics - E-MapReduce

EMR Serverless Spark supports interactive development through notebooks. This guide walks you through creating a notebook, loading sample data, running PySpark code, visualizing results, and publishing your first notebook.

Prerequisites

Before you begin, ensure that you have:

An Alibaba Cloud account — see Alibaba Cloud account registrationAlibaba Cloud account registration
The required roles granted to your account — see Grant roles to an Alibaba Cloud account
A workspace created — see Create a workspace
A notebook session instance running — see Manage notebook sessions

Step 1: Prepare a test file

Download employee.csv to use in this guide. The file contains employee names, departments, and salaries.

Step 2: Upload the test file to OSS

Notebooks in EMR Serverless Spark read data from Object Storage Service (OSS). Upload employee.csv to an OSS bucket so the notebook can access it. See Upload files for instructions.

Note the OSS path of the uploaded file — you will use it in the next step.

Step 3: Create and run a notebook

Create a notebook

On the EMR Serverless Spark page, click Development in the left navigation pane.
On the Development tab, click the icon.
In the dialog box, enter a name, select Interactive Analytic > Notebook as the type, and then click OK.

Attach a notebook session

In the upper-right corner, select a running notebook session instance from the drop-down list.

Multiple notebooks can share a single notebook session instance, so you can access the same session resources from multiple notebooks simultaneously without creating a new session for each one. To create a new session instead, select Create Notebook Session from the drop-down list. For details, see Manage notebook sessions.

Run a PySpark job

The following code reads the CSV file from OSS, displays the first five rows, and calculates the total salary per department.

Copy the following code into the Python cell of the notebook. Replace oss://path/to/file with the OSS path of the file you uploaded in Step 2.

# Read the CSV file from OSS
df = spark.read.option("delimiter", ",").option("header", True).csv("oss://path/to/file")
# Display the first five rows
df.show(5)
# Calculate total salary by department
sum_salary_per_department = df.groupBy("department").agg({"salary": "sum"}).show()

Click Execute All Cells to run the notebook. To run a single cell, click the icon next to that cell.
(Optional) To inspect job execution, hover over the icon for the current session in the session drop-down list and click Spark UI. The Spark Jobs page opens, where you can view detailed Spark job information.

Visualize data with third-party libraries

Notebook sessions come with matplotlib, numpy, and pandas pre-installed. The following example uses matplotlib to plot a simple line chart.

For other third-party libraries, see Use third-party Python libraries in a notebook.

Copy the following code into a cell.

import matplotlib.pyplot as plt

l = sc.parallelize(range(20)).collect()
plt.plot(l)
plt.ylabel('some numbers')
plt.show()

Click Execute All Cells to run the notebook. To run a single cell, click the icon next to that cell.

Step 4: Publish the notebook

After the notebook finishes running, click Publish in the upper-right corner. In the Publish dialog box, configure the parameters and click OK to save the notebook as a new version.

What's next

Manage notebook sessions — stop your session to avoid unnecessary charges, or adjust session resources.
Use third-party Python libraries in a notebook — install additional libraries beyond the pre-installed matplotlib, numpy, and pandas.