Release notes for EMR Serverless Spark on November 25, 2024 - E-MapReduce

This topic describes the release notes for E-MapReduce (EMR) Serverless Spark on November 25, 2024.

Overview

On November 25, 2024, the latest version of EMR Serverless Spark is released, featuring platform improvements, ecosystem integration, improved performance, and enhanced engine capabilities.

Platform updates

Feature	Description
Workflow	Notebook jobs can be scheduled in a workflow.
Job history	Stdout and Stderr logs can be viewed on the Development Job Runs tab of the Job History page in the EMR console. Resources that are consumed to run a job can be viewed on the Development Job Runs tab of the Job History page in the EMR console, such as the memory size, CPU cores, and compute units (CUs).
Session management	Internal endpoints are supported by Spark Thrift Servers. Custom JAR packages are supported for Spark Thrift Servers whose engine version is esr-2.4 or later. LDAP authentication and Ranger authentication are supported for Spark Thrift Servers. Creation time and start time of a session can be viewed.
Gateway management	Internal endpoints are supported by Livy gateways. Spark sessions that are created by using the Livy interface can be viewed and the Spark UI can be accessed. The driver logs of Spark sessions that are created by using the Livy interface can be viewed by using a specified gateway. This applies to only Livy gateways whose version is esr-2.2.2 or later. Custom runtime environments are available for Livy gateways.
Data development	The maximum size of a notebook is increased. Code of a notebook can be run in another notebook. Custom runtime environments are available for PySpark jobs.
Others	Creation of folders is supported. Comments can be added in the Spark Configuration field. A Spark driver can be viewed after the spark_submit command is run.

Engine updates

Version	Description
esr-3.0.0 (Spark 3.4.3, Scala 2.12)	Fusion acceleration Data of complex data types in the Parquet files can be read and processed. Data can be written to a table in the Parquet format. Only engines whose version is esr-3.0.0 or later support this feature. Three arguments can be configured for the parse_url function. Data of the void data type is supported by the Parquet data source. Encoding based on Parquet v2 is supported. Conversion between data of the TIMESTAMP type and data of the STRING type is supported. All time zone formats are supported by from_unix_timestamp and to_unix_timestamp. WriteFileExec can be used to merge small files. RankTopK operators are supported. The following issue is fixed: UDFs in a generate operator are not pulled out to the upstream project operator. Fusion and Celeborn can be used to implement both Java and native shuffling in a single process. Java Runtime The max_pt expression is supported. The url_decode function that uses the try semantics is supported. The issue that Magic Committer cannot correctly process escape characters is fixed. Concurrent jobs can be run to write data to different partitions of a table. Automatic unarchiving of data in OSS is supported. The following issue is fixed: When `spark.read` is used to read data in the Snappy format, the local library of Snappy fails to be loaded. Data of the native Snappy format can be read. Paimon JAR packages in custom lake formats are supported. The format can be specified to determine storage format of tables. Nested column tailoring is supported. The execution efficiency of count(*) is optimized. Database migration is supported.
esr-2.4.0 (Spark 3.3.1, Scala 2.12)