All Products
Search
Document Center

E-MapReduce:Develop a batch or streaming task

Last Updated:Mar 26, 2026

EMR Serverless Spark supports four task types for batch and streaming workloads: JAR, PySpark, SQL, and Spark Submit. This page walks you through creating, configuring, and publishing a task in the Data Development console.

Prerequisites

Before you begin, ensure that you have:

Create and configure a task

Step 1: Open Data Development

  1. Log on to the E-MapReduce console.

  2. In the left navigation pane, choose EMR Serverless > Spark.

  3. On the Spark page, click the name of the target workspace.

  4. On the EMR Serverless Spark page, click Data Development in the left navigation pane.

Step 2: Create a task

  1. On the Development tab, click the image icon.

  2. In the dialog box, enter a Name, select Batch Job or Streaming Job, and click OK.

  3. In the upper-right corner, select a queue.

Step 3: Configure task parameters

Select the tab for your task type and configure the parameters in the editor.

JAR

Parameter Description
Main JAR Resource The primary JAR package. Select Workspace (pre-uploaded on the Files page) or OSS.
Engine Version The Spark version. See Introduction to engine versions.
Main Class The main class to use when submitting the Spark task.
Execution Parameters Runtime configuration items or custom parameters passed to the main class. Separate multiple parameters with spaces.
Timeout The maximum allowed run time. The system stops the task if it exceeds this limit. Leave blank for no timeout.
Network Connection An existing network connection for accessing data sources or external services within your Virtual Private Cloud (VPC). See Interconnect EMR Serverless Spark with other VPCs.
Mount Integrated File Directory Disabled by default. When enabled, mounts the managed file directory to the task, allowing direct read and write access. Before enabling, add a file directory on the Files page under the Integrated File Directory tab. See Integrated file directory.
Mount to Executor When enabled, mounts the file directory to executors as well. Resource consumption varies with file usage.
File Resources Files distributed to executors via the --files parameter. Select Workspace or OSS.
Archive Resources Archive files unpacked and distributed to executors via the --archives parameter. Select Workspace or OSS.
JAR Resources JAR dependency files added via the --jars parameter. Select Workspace or OSS.
Tags Key-value pairs for task management.

PySpark

PySpark tasks share most parameters with JAR tasks. The following parameters differ:

Parameter Description
Main Python Resources The primary Python file. Select Workspace or OSS.
Runtime Environment Pre-configured resources based on the selected environment.
Pyfiles Resources Python dependency files distributed via the --py-files parameter. Select Workspace or OSS.
File Resources Files distributed to all executor nodes in the cluster. Select Workspace or OSS.

PySpark tasks do not have a Main Class parameter. All other parameters — Engine Version, Execution Parameters, Timeout, Network Connection, Mount Integrated File Directory, Mount to Executor, Archive Resources, JAR Resources, Tags, and Spark resource parameters — are the same as for JAR tasks.

SQL

SQL tasks share most parameters with JAR tasks. The following parameter differs:

Parameter Description
SQL File The SQL file to submit. Select Workspace (pre-uploaded on the Files page) or OSS.

SQL tasks do not have Main Class or Execution Parameters. All other parameters — Engine Version, Timeout, Network Connection, Mount Integrated File Directory, Mount to Executor, Tags, and Spark resource parameters — are the same as for JAR tasks.

Spark Submit

Parameter Description
Engine Version The Spark version. See Introduction to engine versions.
Script Your spark-submit script. Example: <pre>--class org.apache.spark.examples.SparkPi \<br>--conf spark.executor.memory=2g \<br>oss://&lt;YourBucket&gt;/spark-examples_2.12-3.5.2.jar</pre>
Timeout The maximum allowed run time. Leave blank for no timeout.
Network Connection An existing VPC network connection for accessing data sources or external services.
Mount Integrated File Directory See the JAR section for details.
Mount to Executor See the JAR section for details.
Tags Key-value pairs for task management.

Spark resource parameters (spark.driver.*, spark.executor.*, Dynamic Resource Allocation, More Memory Configurations, and Spark Configuration) apply to Spark Submit tasks in the same way as JAR tasks.

Spark resource parameters

All task types share the following Spark resource parameters:

Parameter Description
spark.driver.cores Number of CPU cores for the driver.
spark.driver.memory Memory available to the driver.
spark.executor.cores Number of virtual CPU cores per executor.
spark.executor.memory Memory available to each executor.
spark.executor.instances Number of executors to allocate.
Dynamic Resource Allocation Disabled by default. When enabled, set Minimum Number of Executors (default: 2) and Maximum Number of Executors (default: 10, if spark.executor.instances is not set).
Spark Configuration Additional Spark configuration as key-value pairs separated by a space. For example: key value.

More memory configurations:

Parameter Description
spark.driver.memoryOverhead Non-heap memory per driver. Default: max(384 MB, 10% × spark.driver.memory).
spark.executor.memoryOverhead Non-heap memory per executor. Default: max(384 MB, 10% × spark.executor.memory).
spark.memory.offHeap.size Off-heap memory available to Spark. Default: 1 GB. Takes effect only when spark.memory.offHeap.enabled is set to true. When using the Fusion Engine, off-heap memory is enabled by default and set to 1 GB.

Mount Integrated File Directory — resource consumption

When Mount Integrated File Directory is enabled, the mount operation consumes driver resources equal to the greater of:

  • Fixed resources: 0.3 CPU core + 1 GB memory

  • Dynamic resources: 10% of spark.driver resources (0.1 × spark.driver cores and memory)

Example: If spark.driver is configured with 4 CPU cores and 8 GB memory, the dynamic resources are 0.4 CPU core + 0.8 GB memory. The actual consumed resources are max(0.3 core + 1 GB, 0.4 core + 0.8 GB) = 0.4 CPU core + 1 GB memory.

Enabling mounting attaches the directory to the driver only by default. To also mount to executors, enable Mount to Executor.
Important

After mounting an integrated NAS file directory, configure a network connection. The VPC of the network connection must match the VPC where the NAS mount target resides.

Step 4: (Optional) Review version information

On the right side of the task development page, click the Version Information tab to view version history or compare versions.

Run and publish the task

  1. Click Run. After the task runs, go to the Execution Records area at the bottom of the page and click Details. You are redirected to the Overview page to view task details.

  2. Click Publish in the upper-right corner.

  3. In the Publish dialog box, enter Remarks and click OK.

FAQ

How do I set an automatic retry policy for failed streaming tasks?

Add these two Spark configuration items under Spark Configuration:

spark.emr.serverless.streaming.fail.retry.interval 60    # Retry interval in seconds
spark.emr.serverless.streaming.fail.retry.time 3         # Maximum number of retries

The example above retries up to 3 times with a 60-second interval between attempts.

What's next