Develop and deploy a Spark JAR batch job - E-MapReduce

Build and deploy a Spark JAR job on E-MapReduce (EMR) Serverless Spark — from Maven configuration through execution and publishing.

EMR Serverless Spark does not provide an integrated development environment (IDE) for JAR packages. Build and package your Spark application on a local or standalone development platform before uploading it.

Prerequisites

Before you begin, ensure that you have:

A workspace. See Workspace Management.
A JAR file built from your Spark application.

Step 1: Configure Maven dependencies

In the pom.xml of your Maven project, add the Spark dependencies with scope set to provided. The EMR Serverless Spark runtime already includes these libraries, so setting provided prevents duplicate packaging and version conflicts while keeping the dependencies available during compilation and testing.

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.12</artifactId>
    <version>3.5.2</version>
    <scope>provided</scope>
</dependency>

Code examples

The following two examples are used throughout this guide. Each targets a different main class, which you specify when configuring the job.

Example 1: Query a Data Lake Formation (DLF) table

Main class: org.example.HiveTableAccess

public class HiveTableAccess {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("DlfTableAccessExample")
                .enableHiveSupport()
                .getOrCreate();
        spark.sql("SELECT * FROM test_table").show();
        spark.stop();
    }
}

Example 2: Calculate the approximate value of pi (π)

Main class: org.example.JavaSparkPi

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;

import java.util.ArrayList;
import java.util.List;

/**
 * Computes an approximation to pi
 * Usage: JavaSparkPi [partitions]
 */
public final class JavaSparkPi {

  public static void main(String[] args) throws Exception {
    SparkSession spark = SparkSession
      .builder()
      .appName("JavaSparkPi")
      .getOrCreate();

    JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());

    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
    int n = 100000 * slices;
    List<Integer> l = new ArrayList<>(n);
    for (int i = 0; i < n; i++) {
      l.add(i);
    }

    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

    int count = dataSet.map(integer -> {
      double x = Math.random() * 2 - 1;
      double y = Math.random() * 2 - 1;
      return (x * x + y * y <= 1) ? 1 : 0;
    }).reduce((integer, integer2) -> integer + integer2);

    System.out.println("Pi is roughly " + 4.0 * count / n);

    spark.stop();
  }
}

Click SparkExample-1.0-SNAPSHOT.jar to download a prebuilt test JAR package.

Step 2: Upload the JAR package

Log on to the EMR console.
In the left navigation pane, choose EMR Serverless > Spark.
On the Spark page, click the name of your workspace.
In the left navigation pane of the workspace, click Artifacts.
On the Artifacts page, click Upload File.
In the Upload File dialog box, click the upload area to select a local JAR package, or drag the package into the area. This guide uses SparkExample-1.0-SNAPSHOT.jar as an example.

Step 3: Create and run a job

In the left navigation pane, click Development.
On the Development tab, click the icon to create a new job.
Enter a name, set Type to Application(Batch) > JAR, and click OK.
In the upper-right corner, select a resource queue. For instructions on adding a queue, see Manage resource queues.

Configure the following parameters, leave the remaining settings at their defaults, and click Run.

Parameter	Description
Main JAR Resource	Select the JAR package uploaded in Step 2. In this example, select SparkExample-1.0-SNAPSHOT.jar.
Main Class	The entry point class for your Spark job. Enter `org.example.JavaSparkPi` for the pi example, or `org.example.HiveTableAccess` for the DLF table query.

After the job runs, go to the Execution Records section and click Logs in the Actions column to view the output.

Step 4: Publish the job

Important

Publishing a job makes it available as a node in a workflow.

After the job completes, click Publish in the upper-right corner.
In the dialog box, enter the release information and click OK.

(Optional) Step 5: View the Spark UI

After the job runs successfully, inspect its execution details on the Spark UI.

In the left navigation pane, click Job History.
On the Application page, find your job and click Spark UI in the Actions column.
On the Spark Jobs page, view the job details.

What's next

After publishing, use your job as a scheduled node in a workflow. See Manage workflows for details. For a complete walkthrough of job orchestration, see Get started with SparkSQL development.