EMR Serverless Spark New Features & Updates (August 2024) - E-MapReduce

This topic describes the release notes for E-MapReduce (EMR) Serverless Spark on August 20, 2024.

Overview

The August 20, 2024 release of EMR Serverless Spark adds eight platform features—including Spark SQL tasks, Apache Airflow and DolphinScheduler scheduling, CloudMonitor integration, and RAM user access control—and ships engine version esr-2.2 (Spark 3.3.1, Scala 2.12) with Fusion acceleration covering 26 operators, 240 expressions, and 12 data types, a JindoSDK upgrade to NextArch 6.5.1, and an RPC retry mechanism that ensures all tasks start successfully.

Platform updates

Task development

The following task types are now supported:

Spark SQL
Application (batch): Java Archive (JAR), PySpark, SQL, and Spark Submit
Application (streaming): JAR and PySpark

Integration with other ecosystems

DataWorks: Associate EMR Serverless Spark with a DataWorks workspace.
Scheduling: Apache Airflow operators are supported, including compatibility with livy_operator. DolphinScheduler operators are also supported.
Metadata: Use external Hive Metastores to store metadata.
API access: The Spark Thrift Server provides Java Database Connectivity (JDBC) API access. The Livy service provides RESTful API access.
Command-line submission: The spark_submit command is available.

Notebook

Supported languages: PySpark, Python, and Markdown.
Data visualization is supported.

Workflow

Scheduled task types: Spark SQL, Application JAR, and PySpark.
Integration with CloudMonitor lets you monitor workflow and node status and configure alerting.
Manage workflows in topology view and grid view.

Task history

Collect statistics on memory usage and CPU utilization per task.

Resource management

Manage SQL computes, notebook computes, gateways, Spark Thrift Servers, and queues from a single interface.

Access control

Manage RAM user permissions at the workspace level.

Resource observation

Real-time monitoring of compute unit (CU), CPU, and memory metrics, scoped to the workspace or queue.
Filter and analyze metrics by time range.

Engine updates

esr-2.2 (Spark 3.3.1, Scala 2.12)

Fusion acceleration

Supports 26 common Spark operators. See the Operators section in the Fusion engine topic.
Supports 240 common Spark expressions. See the Expressions section in the Fusion engine topic.
Supports 12 basic data types. See the Data types section in the Fusion engine topic.
Celeborn is supported.
Parquet and Paimon formats can be read.
Operators and expressions not covered by Fusion acceleration fall back to the Java Runtime environment.

Bug fixes and improvements

Paimon: Update and delete operations on append-only tables are now supported.
Hudi: Fixed an issue where jobs marked with TIMELINE_SERVER_BASED could not be terminated.
Spark UI and logs: Optimized log retrieval performance.
JindoSDK: Updated to NextArch 6.5.1. Committer optimization is supported when Fusion acceleration is disabled.
pandas and Matplotlib: Images for pandas and Matplotlib are now supported.
Network stability: An RPC (remote procedure call) retry mechanism ensures all tasks start successfully even under unstable network conditions.