Develop batch and streaming Spark tasks - E-MapReduce

EMR Serverless Spark supports four task types for batch and streaming workloads: JAR, PySpark, SQL, and Spark Submit. This page walks you through creating, configuring, and publishing a task in the Data Development console.

Prerequisites

Before you begin, ensure that you have:

A workspace. See Manage workspaces.
A resource queue. See Manage resource queues.

Create and configure a task

Step 1: Open Data Development

Log on to the E-MapReduce console.
In the left navigation pane, choose EMR Serverless > Spark.
On the Spark page, click the name of the target workspace.
On the EMR Serverless Spark page, click Data Development in the left navigation pane.

Step 2: Create a task

On the Development tab, click the icon.
In the dialog box, enter a Name, select Batch Job or Streaming Job, and click OK.
In the upper-right corner, select a queue.

Step 3: Configure task parameters

Select the tab for your task type and configure the parameters in the editor.

JAR

Parameter	Description
Main JAR Resource	The primary JAR package. Select Workspace (pre-uploaded on the Files page) or OSS.
Engine Version	The Spark version. See Introduction to engine versions.
Main Class	The main class to use when submitting the Spark task.
Execution Parameters	Runtime configuration items or custom parameters passed to the main class. Separate multiple parameters with spaces.
Timeout	The maximum allowed run time. The system stops the task if it exceeds this limit. Leave blank for no timeout.
Network Connection	An existing network connection for accessing data sources or external services within your Virtual Private Cloud (VPC). See Interconnect EMR Serverless Spark with other VPCs.
Mount Integrated File Directory	Disabled by default. When enabled, mounts the managed file directory to the task, allowing direct read and write access. Before enabling, add a file directory on the Files page under the Integrated File Directory tab. See Integrated file directory.
Mount to Executor	When enabled, mounts the file directory to executors as well. Resource consumption varies with file usage.
File Resources	Files distributed to executors via the `--files` parameter. Select Workspace or OSS.
Archive Resources	Archive files unpacked and distributed to executors via the `--archives` parameter. Select Workspace or OSS.
JAR Resources	JAR dependency files added via the `--jars` parameter. Select Workspace or OSS.
Tags	Key-value pairs for task management.

PySpark

PySpark tasks share most parameters with JAR tasks. The following parameters differ:

Parameter	Description
Main Python Resources	The primary Python file. Select Workspace or OSS.
Runtime Environment	Pre-configured resources based on the selected environment.
Pyfiles Resources	Python dependency files distributed via the `--py-files` parameter. Select Workspace or OSS.
File Resources	Files distributed to all executor nodes in the cluster. Select Workspace or OSS.

PySpark tasks do not have a Main Class parameter. All other parameters — Engine Version, Execution Parameters, Timeout, Network Connection, Mount Integrated File Directory, Mount to Executor, Archive Resources, JAR Resources, Tags, and Spark resource parameters — are the same as for JAR tasks.

SQL

SQL tasks share most parameters with JAR tasks. The following parameter differs:

Parameter	Description
SQL File	The SQL file to submit. Select Workspace (pre-uploaded on the Files page) or OSS.

SQL tasks do not have Main Class or Execution Parameters. All other parameters — Engine Version, Timeout, Network Connection, Mount Integrated File Directory, Mount to Executor, Tags, and Spark resource parameters — are the same as for JAR tasks.

Spark Submit

Parameter	Description
Engine Version	The Spark version. See Introduction to engine versions.
Script	Your `spark-submit` script. Example: <pre>--class org.apache.spark.examples.SparkPi \<br>--conf spark.executor.memory=2g \<br>oss://<YourBucket>/spark-examples_2.12-3.5.2.jar</pre>
Timeout	The maximum allowed run time. Leave blank for no timeout.
Network Connection	An existing VPC network connection for accessing data sources or external services.
Mount Integrated File Directory	See the JAR section for details.
Mount to Executor	See the JAR section for details.
Tags	Key-value pairs for task management.

Spark resource parameters (spark.driver.*, spark.executor.*, Dynamic Resource Allocation, More Memory Configurations, and Spark Configuration) apply to Spark Submit tasks in the same way as JAR tasks.

Spark resource parameters

All task types share the following Spark resource parameters:

Parameter	Description
spark.driver.cores	Number of CPU cores for the driver.
spark.driver.memory	Memory available to the driver.
spark.executor.cores	Number of virtual CPU cores per executor.
spark.executor.memory	Memory available to each executor.
spark.executor.instances	Number of executors to allocate.
Dynamic Resource Allocation	Disabled by default. When enabled, set Minimum Number of Executors (default: 2) and Maximum Number of Executors (default: 10, if `spark.executor.instances` is not set).
Spark Configuration	Additional Spark configuration as key-value pairs separated by a space. For example: `key value`.

More memory configurations:

Parameter	Description
spark.driver.memoryOverhead	Non-heap memory per driver. Default: `max(384 MB, 10% × spark.driver.memory)`.
spark.executor.memoryOverhead	Non-heap memory per executor. Default: `max(384 MB, 10% × spark.executor.memory)`.
spark.memory.offHeap.size	Off-heap memory available to Spark. Default: 1 GB. Takes effect only when `spark.memory.offHeap.enabled` is set to `true`. When using the Fusion Engine, off-heap memory is enabled by default and set to 1 GB.

Mount Integrated File Directory — resource consumption

When Mount Integrated File Directory is enabled, the mount operation consumes driver resources equal to the greater of:

Fixed resources: 0.3 CPU core + 1 GB memory
Dynamic resources: 10% of spark.driver resources (0.1 × spark.driver cores and memory)

Example: If spark.driver is configured with 4 CPU cores and 8 GB memory, the dynamic resources are 0.4 CPU core + 0.8 GB memory. The actual consumed resources are max(0.3 core + 1 GB, 0.4 core + 0.8 GB) = 0.4 CPU core + 1 GB memory.

Enabling mounting attaches the directory to the driver only by default. To also mount to executors, enable Mount to Executor.

Important

After mounting an integrated NAS file directory, configure a network connection. The VPC of the network connection must match the VPC where the NAS mount target resides.

Step 4: (Optional) Review version information

On the right side of the task development page, click the Version Information tab to view version history or compare versions.

Run and publish the task

Click Run. After the task runs, go to the Execution Records area at the bottom of the page and click Details. You are redirected to the Overview page to view task details.
Click Publish in the upper-right corner.
In the Publish dialog box, enter Remarks and click OK.

FAQ

How do I set an automatic retry policy for failed streaming tasks?

Add these two Spark configuration items under Spark Configuration:

spark.emr.serverless.streaming.fail.retry.interval 60    # Retry interval in seconds
spark.emr.serverless.streaming.fail.retry.time 3         # Maximum number of retries

The example above retries up to 3 times with a 60-second interval between attempts.

What's next

For a complete JAR task example, see Quick start for JAR development.
For a complete Spark Submit example, see Quick start for Spark Submit development.
For a complete SparkSQL example, see Quick start for SparkSQL development.
For a complete PySpark batch example, see Quick start for PySpark development.
For a complete PySpark streaming example, see Submit a PySpark streaming task using Serverless Spark.