This topic describes the release notes for E-MapReduce (EMR) Serverless Spark on August 20, 2024.
Overview
The August 20, 2024 release of EMR Serverless Spark adds eight platform features—including Spark SQL tasks, Apache Airflow and DolphinScheduler scheduling, CloudMonitor integration, and RAM user access control—and ships engine version esr-2.2 (Spark 3.3.1, Scala 2.12) with Fusion acceleration covering 26 operators, 240 expressions, and 12 data types, a JindoSDK upgrade to NextArch 6.5.1, and an RPC retry mechanism that ensures all tasks start successfully.
Platform updates
Task development
The following task types are now supported:
-
Spark SQL
-
Application (batch): Java Archive (JAR), PySpark, SQL, and Spark Submit
-
Application (streaming): JAR and PySpark
Integration with other ecosystems
-
DataWorks: Associate EMR Serverless Spark with a DataWorks workspace.
-
Scheduling: Apache Airflow operators are supported, including compatibility with
livy_operator. DolphinScheduler operators are also supported. -
Metadata: Use external Hive Metastores to store metadata.
-
API access: The Spark Thrift Server provides Java Database Connectivity (JDBC) API access. The Livy service provides RESTful API access.
-
Command-line submission: The
spark_submitcommand is available.
Notebook
-
Supported languages: PySpark, Python, and Markdown.
-
Data visualization is supported.
Workflow
-
Scheduled task types: Spark SQL, Application JAR, and PySpark.
-
Integration with CloudMonitor lets you monitor workflow and node status and configure alerting.
-
Manage workflows in topology view and grid view.
Task history
Collect statistics on memory usage and CPU utilization per task.
Resource management
Manage SQL computes, notebook computes, gateways, Spark Thrift Servers, and queues from a single interface.
Access control
Manage RAM user permissions at the workspace level.
Resource observation
-
Real-time monitoring of compute unit (CU), CPU, and memory metrics, scoped to the workspace or queue.
-
Filter and analyze metrics by time range.
Engine updates
esr-2.2 (Spark 3.3.1, Scala 2.12)
Fusion acceleration
-
Supports 26 common Spark operators. See the Operators section in the Fusion engine topic.
-
Supports 240 common Spark expressions. See the Expressions section in the Fusion engine topic.
-
Supports 12 basic data types. See the Data types section in the Fusion engine topic.
-
Celeborn is supported.
-
Parquet and Paimon formats can be read.
-
Operators and expressions not covered by Fusion acceleration fall back to the Java Runtime environment.
Bug fixes and improvements
-
Paimon: Update and delete operations on append-only tables are now supported.
-
Hudi: Fixed an issue where jobs marked with
TIMELINE_SERVER_BASEDcould not be terminated. -
Spark UI and logs: Optimized log retrieval performance.
-
JindoSDK: Updated to NextArch 6.5.1. Committer optimization is supported when Fusion acceleration is disabled.
-
pandas and Matplotlib: Images for pandas and Matplotlib are now supported.
-
Network stability: An RPC (remote procedure call) retry mechanism ensures all tasks start successfully even under unstable network conditions.