Use Jupyter Notebook to interact with EMR Serverless Spark - E-MapReduce

Jupyter Notebook is a powerful interactive development tool. You can use this tool to write and run code on the web interface and view the results in real time, without needing to precompile or separately run scripts. This topic describes how to build a development environment for efficiently interacting with E-MapReduce (EMR) Serverless Spark.

Background

Apache Livy interacts with Spark by calling RESTful APIs. This significantly simplifies communication between Spark and the application server. For information about the Livy API, see REST API.

When you use Jupyter Notebook for development, you can use the methods described in the following table to interact with EMR Serverless Spark. You can select a method based on your business requirements.

Method	Scenarios
Method 1: Use a Docker image to quickly build and start an environment	You can select this method if you want to quickly build a standalone development environment or you need to reproduce the same settings on different machines.
Method 2: Use the sparkmagic plugin to build and start an environment	The sparkmagic plugin of Jupyter Notebook interacts with Spark by calling RESTful APIs. The sparkmagic plugin supports the Livy, Livy Lighter, and Ilum protocols. You can configure the sparkmagic plugin and call the Livy API of Serverless Spark to build a development environment for interacting with remote Spark clusters.

Prerequisites

Make sure that the corresponding requirement is met for the selected method.
- Method 1: Use a Docker image to quickly build and start an environment: Docker is installed. For more information, see Docker official documentation.
- Method 2: Use the sparkmagic plugin to build and start an environment: Jupyter Notebook is installed and started. For more information, see Project Jupyter | Installing Jupyter.
  In this example, Jupyter Notebook and Python 3.8 are used.
A workspace is created. For more information, see Create a workspace.

Method 1: Use a Docker image to quickly build and start an environment

Step 1: Create a gateway and a token

Create and start a gateway.
1. Go to the Gateways page.
  1. Log on to the EMR console.
  2. In the left navigation bar, select EMR Serverless > Spark.
  3. On the Spark page, click the name of the target workspace.
  4. On the EMR Serverless Spark page, click O&M Center > Gateway in the left-side navigation pane.
2. Click the Livy Gateway tab.
3. On the Livy Gateway page, click Create Livy Gateway.
4. On the Create Gateway page, enter a Name (for example, Livy-gateway) and click Create.
  You can configure other parameters based on your business requirements. For more information about the parameters, see Manage gateways.
5. On the Livy Gateway page, find the created gateway and click Start in the Actions column.
Create a token.
1. On the Gateway page, find Livy-gateway and click Tokens in the Actions column.
2. Click Create Token.
3. In the Create Token dialog box, enter a Name (for example, Livy-token) and click OK.
4. Copy the token.
  Important
  After the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If the token expires or is lost, reset the token or create a new token.

Step 2: Use Docker to pull and start an image

Run the following command to pull an image.

docker pull emr-registry-registry.cn-hangzhou.cr.aliyuncs.com/serverless-spark-public/emr-spark-jupyter:latest

Run the following command to start the image.

docker run -p <host_port>:8888 emr-registry-registry.cn-hangzhou.cr.aliyuncs.com/serverless-spark-public/emr-spark-jupyter:latest <endpoint> <token>

The following table describes the parameters.

Parameter	Description
`<host_port>`	The port of the host.
`<endpoint>`	The endpoint of the Livy gateway. You can perform the following operations to view the endpoint of the Livy gateway: On the Livy Gateway page, click the name of the created Livy gateway. Then, you can view the endpoint on the Overview tab.
`<token>`	The token that you copied in Step 1.

After the image is started, the following information is returned:

[I 2024-09-23 05:38:14.429 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-09-23 05:38:14.432 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-09-23 05:38:14.436 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-09-23 05:38:14.439 ServerApp] notebook | extension was successfully linked.
[I 2024-09-23 05:38:14.439 ServerApp] Writing Jupyter server cookie secret to /root/.local/share/jupyter/runtime/jupyter_cookie_secret
[I 2024-09-23 05:38:14.596 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-09-23 05:38:14.624 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-09-23 05:38:14.625 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-09-23 05:38:14.626 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-09-23 05:38:14.627 LabApp] JupyterLab extension loaded from /root/miniforge3/envs/livy/lib/python3.8/site-packages/jupyterlab
[I 2024-09-23 05:38:14.627 LabApp] JupyterLab application directory is /root/miniforge3/envs/livy/share/jupyter/lab
[I 2024-09-23 05:38:14.628 LabApp] Extension Manager is 'pypi'.
[I 2024-09-23 05:38:14.637 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-09-23 05:38:14.640 ServerApp] notebook | extension was successfully loaded.
[I 2024-09-23 05:38:14.640 ServerApp] Serving notebooks from local directory: /root
[I 2024-09-23 05:38:14.640 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2024-09-23 05:38:14.640 ServerApp] http://6eca53b95ca2:8888/lab?token=258c0dd75e22a10fb6e2c87ac738c2a7ba6a314c6b******
[I 2024-09-23 05:38:14.640 ServerApp]     http://127.0.0.1:8888/lab?token=258c0dd75e22a10fb6e2c87ac738c2a7ba6a314c6b******

Access the UI of Jupyter.
Copy http://127.0.0.1:8888/lab?token=258c0dd75e22a10fb6e2c87ac738c2a7ba6a314c6b****** in the returned information to the browser address bar to connect to EMR Serverless Spark by using Jupyter.
Note
- If you are connecting to EMR Serverless Spark from a remote server, you must replace 127.0.0.1 with the actual IP address of the server.
- If the host_port is not 8888, you must replace the port number with the actual port number.

Step 3: Test the connectivity

On the JupyterLab page, click PySpark in the Notebook section.
Run the following command to query all accessible databases:
```
spark.sql("show databases").show()
```
The output that is shown in the following figure is returned.

Method 2: Use the sparkmagic plug-in to build and start an environment

Step 1: Create a gateway and a token

Create and start a gateway.
1. Go to the Gateways page.
  1. Log on to the EMR console.
  2. In the left navigation bar, select EMR Serverless > Spark.
  3. On the Spark page, click the name of the target workspace.
  4. On the EMR Serverless Spark page, click O&M Center > Gateway in the left-side navigation pane.
2. Click the Livy Gateway tab.
3. On the Livy Gateway page, click Create Livy Gateway.
4. On the Create Gateway page, enter a Name (for example, Livy-gateway) and click Create.
  You can configure other parameters based on your business requirements. For more information about the parameters, see Manage gateways.
5. On the Livy Gateway page, find the created gateway and click Start in the Actions column.
Create a token.
1. On the Gateway page, find Livy-gateway and click Tokens in the Actions column.
2. Click Create Token.
3. In the Create Token dialog box, enter a Name (for example, Livy-token) and click OK.
4. Copy the token.
  Important
  After the token is created, you must immediately copy the token. You can no longer view the token after you leave the page. If the token expires or is lost, reset the token or create a new token.

Step 2: Install and enable the sparkmagic plug-in

Run the following command to install the sparkmagic plugin:
```
pip install sparkmagic
```
Enable the sparkmagic plugin based on the Jupyter environment that you use.
- If you use Jupyter Notebook, run the following command:
```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```
- If you use JupyterLab, run the following command:
```
jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
```

For more information about the sparkmagic plugin, see sparkmagic.

Step 3: Configure and start a Spark session

Access the UI of Jupyter. For more information, see JupyterLab.
Import the sparkmagic plugin.
```
%load_ext sparkmagic.magics
```
Modify the startup configurations of the session.
1. Extend the startup timeout period to prevent timeout failures caused by resource scheduling delays.
```
import sparkmagic.utils.configuration as conf
conf.override("livy_session_startup_timeout_seconds", 1000)
```
2. Optional. Customize Spark resource configurations.
  You can add parameters such as ttl and conf. For more information, see Livy Docs - REST API.
  The following configuration example shows how to modify the resource configurations of a driver.
```
%% spark config
{
   "conf": {
       "spark.driver.cores": "1",
       "spark.driver.memory": "7g"
   }
}
```

Create a session.

Create a Spark session in Jupyter Notebook by using Python or Scala based on your business requirements.

Python

%spark add -s <session_name> -l python -u https://<endpoint> -a username -p <token>

Scala

%spark add -s <session_name> -l scala -u https://<endpoint> -a username -p <token>

You can replace the following information according to your actual situation.

Parameter	Description
`<session_name>`	The name of the Spark session. You can specify a custom name.
`<endpoint>`	The Endpoint(Public) or Endpoint(Private) that you obtained on the Overview tab. If you use an internal endpoint, make sure that the machine where Jupyter is running and the Livy gateway are deployed in the same region, and `https://` is changed to `http://` for the internal endpoint.
`<token>`	The token that you copied in Step 1.

The following figure shows the output when you create a Spark session by using Python.

Wait for 1 to 5 minutes until idle appears in the State column. This indicates that the session is created and ready for use. The session details are displayed on the UI. Then, you can use PySpark for interactive development. You can also log on to the EMR Serverless Spark console and view the information about the session on the Sessions tab of the Livy gateway.

Verify the session.
After you create the session, you can use %%spark to run code. For example, run the following code to view all databases in the current Spark environment.
```
%%spark
spark.sql("show databases").show()
```
The output that is shown in the following figure is returned.

(Optional) Step 4: Release the session resources

Automatically release the session resources.
If the created session is idle for two hours, the session is automatically terminated.
Manually release the session resources.
- Release the session resources by using the sparkmagic plugin.
```
%spark delete -s <session_name>
```
- Release the session resources in the EMR Serverless Spark console.
  On the Sessions tab of the Livy gateway, find the created session and click Close in the Actions column.