Run Interactive Spark Jobs in DataWorks Notebook with Livy Gateway - E-MapReduce

Considerations

Before you begin, review the following constraints:

Only workspaces that use the new version of DataWorks Data Studio are supported.
Only Serverless resource groups are supported. For more information, see Use a Serverless resource group.
Only Python can be used to connect to EMR Serverless Spark compute resources.
Personal development environment instances created before 2025-12-01 do not support this feature. Create a new personal development environment instance instead.
The personal development environment instance must be at version 0.5.69 or later. To check the version, open the personal development environment, press CMD+SHIFT+P, and enter ABOUT. If an upgrade is available, follow the on-screen prompt to upgrade with one click.
A single task in a Serverless resource group supports a maximum of 64 CU (Compute Units). Keep tasks at or below 16 CU to avoid resource shortages and startup failures.

Prerequisites

Before you begin, ensure that you have:

A Serverless Spark workspace. For more information, see Create a workspace.
EMR Serverless Spark compute resources attached to your DataWorks workspace. For more information, see Attach EMR Serverless Spark compute resources.
A Serverless Spark Livy Gateway. For more information, see Create a Livy Gateway.
A DataWorks personal development environment instance. For more information, see Create a personal development environment instance.
A Notebook node in Data Studio (New).

How it works

DataWorks Notebook connects to EMR Serverless Spark through the Livy Gateway — a REST-based service that accepts task submissions and status queries over HTTP and is compatible with multiple programming languages. When you run the %emr_serverless_spark Magic Command in a Python cell, the Notebook connects to the Livy Gateway and creates a Spark Session. Subsequent cells in the same Notebook reuse this session.

When you access Data Lake Formation (DLF) data, authentication is based on the identity of the code executor — the Livy Gateway token is created with that identity.

In the production environment, Notebook tasks submitted to the Operation Center for scheduling bypass the Livy Gateway entirely and run as batch jobs via spark-submit.

Create a Notebook node

Go to the Data Studio (New) page. Open the DataWorks workspace list page. In the top navigation bar, switch to the destination region. Find your workspace and click Quick Access > Data Studio in the Actions column.
Create a Notebook in one of the following locations:
- Project Folder: In the navigation pane, click the icon to go to the Data Development page. Click the icon and select Notebook.
- Personal Folder: In the navigation pane, click the icon to create a new Notebook file.
- One-time Tasks: In the navigation pane, click the icon to go to the manual page. Under One-time Tasks, click the icon and select New Node > Notebook.

Connect to Serverless Spark

Use the %emr_serverless_spark Magic Command in a Python cell to establish a connection.

Magic commands

Magic command	Description
`%emr_serverless_spark`	Connects to the Livy Gateway (starts it automatically if stopped) and creates a Spark Session. On success, the Spark Session details appear in the Spark UI of the output area.
`%emr_serverless_spark info`	Shows Livy Gateway details. Click the Web UI link for a full view.
`%emr_serverless_spark stop`	Clears the Spark Session and stops the Livy Gateway. If multiple users share this gateway, run this command with caution.
`%emr_serverless_spark delete`	Deletes the Livy Gateway. If multiple users share this gateway, run this command with caution.
`%emr_serverless_spark refresh_token`	Creates a new Livy token. Run this if an administrator accidentally deletes the token on the Livy Gateway page.

Run the connection command

Enter the following command in a Python cell, then select the compute resource and Livy Gateway in the lower-right corner of the cell.

Basic usage — connect with default settings:

%emr_serverless_spark

Parameterized usage — pass Spark configuration parameters:

%%emr_serverless_spark
{
  "Spark_conf": {
    "Spark.emr.serverless.environmentId": "<EMR Serverless Spark runtime environment ID>",
    "Spark.emr.serverless.network.service.name": "<EMR Serverless Spark network connectivity ID>",
    "Spark.driver.cores": "1",
    "Spark.driver.memory": "8g",
    "Spark.executor.cores": "1",
    "Spark.executor.memory": "2g",
    "Spark.driver.maxResultSize": "32g"
  }
}

Replace the placeholder values:

Placeholder	Description
`<EMR Serverless Spark runtime environment ID>`	The runtime environment ID of your EMR Serverless Spark instance
`<EMR Serverless Spark network connectivity ID>`	The network connectivity ID of your EMR Serverless Spark instance

Spark parameter priority

The final priority of Spark parameters depends on the Global Configuration First option in Management Center > Serverless Spark > Spark:

Enabled: Management Center configurations take the highest priority and overwrite all parameters with the same name set in the Notebook.
Disabled: Parameters set with Spark_conf in the Notebook cell take precedence over global configurations.

Session behavior

After you run %emr_serverless_spark:

The Notebook is set to use Serverless Spark as the compute resource. Subsequent cells are limited to Python, Markdown, and EMR Spark SQL types.
The first execution creates a Spark Session attached to the personal development environment instance you select. All subsequent executions in the same Notebook reuse this session.

Run SQL and PySpark code

After a successful connection, write and execute code directly in the Notebook.

Submit SQL with an EMR Spark SQL cell

Add an EMR Spark SQL cell and write SQL statements directly. The cell reuses the session from %emr_serverless_spark — no compute resource selection is needed.

Submit PySpark code with a Python cell

Add a new Python cell and write PySpark code. No %%Spark prefix is required.

Publish to production

In production, Notebook tasks containing %emr_serverless_spark are submitted to the destination compute resource as batch jobs. The production environment does not use the Livy Gateway.

Before publishing, verify that the runtime image selected in the Scheduling Configuration contains all dependencies required to run your Notebook. To create a compatible image, see Create a DataWorks image based on a personal development environment.

Publish your Notebook based on its location:

Project Folder: Save the Notebook, then click the icon to publish. After publishing, find the task under Task O&M > Auto Triggered Task O&M > Auto Triggered Task in the Operation Center.
Personal Folder: Save the Notebook, click the icon to move it to the Project Folder, then click the icon to publish. After publishing, find the task under Task O&M > Auto Triggered Task O&M > Auto Triggered Task in the Operation Center.
One-time Tasks: Save the Notebook, then click the icon to publish. After publishing, find the task under Task O&M > One-time Task O&M > One-time Task in the Operation Center.

To unpublish a Notebook, right-click the node, select Delete, and follow the on-screen instructions to unpublish or delete the Notebook. For more information, see Unpublish a task.

What's next

Node scheduling configuration: Configure scheduling properties (such as a scheduled run time) to periodically run a Notebook from the Project Folder in the production environment.
Publish a node or workflow: Learn more about publishing options and workflow publishing.