All Products
Search
Document Center

E-MapReduce:Enhanced features of Spark in EMR

Last Updated:Mar 26, 2026

Alibaba Cloud E-MapReduce (EMR) runs on Elastic Compute Service (ECS) instances and enhances open source Apache Hadoop and Apache Spark with additional capabilities. This page lists the Spark enhancements shipped in each EMR release, organized by EMR series.

EMR V5.x series

EMR versionSpark versionChanges
EMR V5.17.0Spark 3.4.2Spark updated to 3.4.2.
EMR V5.16.0Spark 3.3.1Security vulnerabilities in the Commons Text library are fixed.
EMR V5.15.1Spark 3.3.1Configurations related to jdo are removed from the hive-site.xml file.
EMR V5.12.1Spark 3.3.1OSS-HDFS is now the default storage backend for Spark History Server. Object Storage Service (OSS) or OSS-HDFS can be used to store data for Spark3 Native Engine.
EMR V5.10.0Spark 3.3.1Spark updated to 3.3.1.
EMR V5.9.0Spark 3.3.0Spark updated to 3.3. Kerberos authentication is now supported.
EMR V5.8.0Spark 3.2.1LDAP authentication can be enabled with a single click.
EMR V5.6.0Spark 3.2.1Spark updated to 3.2.1.
EMR V5.5.0Spark 3.2.0See EMR V5.5.0 details.
EMR V5.4.0Spark 3.1.2See EMR V5.4.0 details.
EMR V5.3.0Spark 3.1.1The incompatibility between Spark and Delta Lake is fixed.
EMR V5.2.1Spark 3.1.1See EMR V5.2.1 details.

EMR V5.5.0 (Spark 3.2.0)

New features and improvements

  • Conditional COUNT DISTINCT: An IF expression can now be used inside COUNT DISTINCT, and the CASE WHEN syntax for COUNT DISTINCT is optimized. This feature is disabled by default. To enable it, set spark.sql.optimizer.rewriteConditionalDistinctAggregates to true.

  • Shuffle Hash Join fallback: Spark can fall back from Shuffle Hash Join to Sort Merge Join when Sort Merge Join is more efficient. To enable this behavior, set spark.sql.join.preferSortMergeJoin to false and spark.sql.join.enableShuffledHashJoinFallback to true.

  • Automatic small-file merging: Small output files in non-dynamic partitions are merged automatically. This feature is disabled by default. To enable it, set spark.sql.adaptive.merge.output.small.files.enabled to true.

  • Automatic concurrency adjustment for GROUPING SETS and DISTINCT: Spark automatically adjusts parallelism in queries that use GROUPING SETS or DISTINCT. This feature is disabled by default. To enable it, set spark.sql.execution.optimizeExpand to true.

  • Hive on Spark improvements: Performance and compatibility of Hive on Spark are optimized.

  • Time travel syntax: The time travel syntax is now supported.

  • JindoSDK integration: Spark is adapted to JindoSDK.

EMR V5.4.0 (Spark 3.1.2)

New features and improvements

  • Spark updated to 3.1.2.

  • Distinct computing performance in Spark SQL is optimized for aggregation operators that contain multiple count(distinct case ... when ...) expressions.

Bug fixes

  • Fixed an array-index out-of-bounds error that occurred when some statistics required by Adaptive Query Execution (AQE) were missing.

  • Fixed errors related to AQE and data caching in specific scenarios.

EMR V5.2.1 (Spark 3.1.1)

Important

In EMR V5.2.1, Spark 3.1.1 and Kudu 1.11.1 are incompatible with each other.

New features and improvements

  • Delta Lake and Hudi are now supported.

  • Remote Shuffle Service is now supported.

  • Livy is now supported.

  • Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized.

  • Cost-based optimization (CBO), dynamic partition pruning, and Z-order features deliver 50% higher performance compared to standard Spark 3.

  • Log Service, DataHub, and Message Queue for Apache RocketMQ can be used as data sources.

EMR V4.x series

EMR versionSpark versionChanges
EMR V4.10.0Spark 2.4.8See EMR V4.10.0 details.
EMR V4.9.0Spark 2.4.7See EMR V4.9.0 details.
EMR V4.8.0Spark 2.4.7See EMR V4.8.0 details.
EMR V4.6.0Spark 2.4.7Spark updated to 2.4.7. jQuery updated to 3.5.1. Hive automatically updates table and partition sizes. Spark metadata and job information can be sent to DataWorks.
EMR V4.5.0Spark 2.4.5Metadata from Alibaba Cloud Data Lake Formation (DLF) is now supported.
EMR V4.3.0Spark 2.4.5Spark updated to 2.4.5. Delta Lake updated to 0.6.0. Fixed an issue where PySpark failed to run after Ranger Hive was enabled.

EMR V4.10.0 (Spark 2.4.8)

New features and improvements

  • Spark updated to 2.4.8.

  • Default configurations for Thrift Server are optimized.

  • Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized.

  • Hive on Spark improvements.

  • The Zstandard compression algorithm is now supported.

Bug fixes

  • Fixed an issue where adaptive execution did not take effect in some scenarios.

  • Fixed an issue where Spark and Hive handled statistical aggregate functions differently.

  • Fixed an issue where Spark could not read valid CHAR type data from a Hive ORC table.

  • Fixed an array-index out-of-bounds error that occurred when some statistics required by AQE were missing.

  • Fixed errors related to AQE and data caching in specific scenarios.

  • Removed Log4j Metrics Appender, which had an invalid configuration.

  • Fixed a null pointer exception that occurred when SparkContext started.

EMR V4.9.0 (Spark 2.4.7)

Bug fixes

  • Fixed an issue where adaptive execution did not take effect in some scenarios.

  • Fixed an issue where Spark and Hive handled statistical aggregate functions differently.

  • Fixed an issue where Spark could not read valid CHAR type data from a Hive ORC table.

EMR V4.8.0 (Spark 2.4.7)

New features and improvements

  • Default configurations are optimized.

  • Window-based top-k queries can be pushed down, improving performance.

  • Reading from and writing to Hive tables in CSV or JSON format is enhanced.

  • All column names can be omitted in an ANALYZE statement.

  • LDAP authentication can be enabled or disabled with a single click.

  • Spark Beeline is easier to use.

EMR V3.x series

EMR versionSpark versionChanges
EMR V3.51.0Spark 3.4.2Spark updated to 3.4.2.
EMR V3.50.0Spark 3.3.1Security vulnerabilities in the Commons Text library are fixed.
EMR V3.49.0Spark 3.3.1Configurations related to jdo are removed from the hive-site.xml file.
EMR V3.46.1Spark 3.3.1OSS-HDFS is now the default storage backend for Spark History Server. OSS or OSS-HDFS can be used to store data for Spark3 Native Engine.
EMR V3.44.0Spark 3.3.1Spark updated to 3.3.1.
EMR V3.43.0Spark 3.3.0Spark updated to 3.3. Kerberos authentication is now supported.
EMR V3.40.0Spark 3.2.1Spark updated to 3.2.1.
EMR V3.39.1Spark 2.4.8Hive on Spark improvements. Spark adapted to JindoSDK.
EMR V3.38.1Spark 2.4.8Removed Log4j Metrics Appender (invalid configuration). Fixed a null pointer exception that occurred when SparkContext started.
EMR V3.38.0Spark 2.4.8See EMR V3.38.0 details.
EMR V3.37.0Spark 2.4.7The incompatibility between Spark and Delta Lake is fixed.
EMR V3.36.1Spark 2.4.7Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized. Log collection performance is optimized. Zstandard compression is now supported.
EMR V3.35.0Spark 2.4.7Fixed issues where adaptive execution did not take effect, Spark and Hive handled aggregate stats differently, and Spark could not read CHAR type data from Hive ORC tables.
EMR V3.34.0Spark 2.4.7See EMR V3.34.0 details.
EMR V3.33.0Spark 2.4.7Spark updated to 2.4.7. jQuery updated to 3.5.1. Hive automatically updates table and partition sizes. Spark metadata and job information can be sent to DataWorks.
EMR V3.32.0Spark 2.4.5The data collection feature of JindoTable can be enabled or disabled.
EMR V3.30.0Spark 2.4.5See EMR V3.30.0 details.
EMR V3.29.0Spark 2.4.5Spark updated to 2.4.5.2.0. Third-party metastore support added. The datalake metastore-client parameter is added.
EMR V3.28.0Spark 2.4.5Spark updated to 2.4.5. Compatible with Streaming SQL scripts from DataFactory. Delta Lake 0.6.0 is now supported.
EMR V3.27.0Spark 2.4.3Partition fields of the date type are now supported in cube operations. The stack depth in spark-submit scripts is increased.
EMR V3.25.0Spark 2.4.3See EMR V3.25.0 details.
EMR V3.24.0Spark 2.4.3Delta Lake parameters are now supported. The Spark plugin can be configured in Ranger. JindoCube updated to 0.3.0.
EMR V3.23.0Spark 2.4.3See EMR V3.23.0 details.
EMR V3.22.0Spark 2.4.3See EMR V3.22.0 details.

EMR V3.38.0 (Spark 2.4.8)

New features and improvements

  • Spark updated to 2.4.8.

  • Both Spark 2.4.8 and Spark 3.1.2 are now supported.

    Note

    Delta Lake and Remote Shuffle Service are not supported with Spark 3 in this release.

  • Distinct computing performance in Spark SQL is optimized for aggregation operators that contain multiple count(distinct case ... when ...) expressions.

Bug fixes

  • Fixed an array-index out-of-bounds error that occurred when some statistics required by AQE were missing.

  • Fixed errors related to AQE and data caching in specific scenarios.

EMR V3.34.0 (Spark 2.4.7)

New features and improvements

  • Default configurations are optimized.

  • Window-based top-k queries can be pushed down, improving performance.

  • Reading from and writing to Hive tables in CSV or JSON format is enhanced.

  • All column names can be omitted in an ANALYZE statement.

  • LDAP authentication can be enabled or disabled with a single click.

  • Spark Beeline is easier to use.

EMR V3.30.0 (Spark 2.4.5)

New features and improvements

  • Metadata from Alibaba Cloud Data Lake Formation (DLF) is now supported.

  • Has dependencies updated to 2.0.1.

  • Delta is now separately deployed; the Delta JAR package is removed from the main distribution.

  • Logs are stored in HDFS directories.

Bug fixes

  • Fixed issues caused by backticks (` ``) in Streaming SQL.

EMR V3.25.0 (Spark 2.4.3)

New features and improvements

  • Delta Lake parameters such as spark.sql.extensions can now be configured in the EMR console.

  • Delta table data can be read using Hive without manually configuring InputFormat.

  • The ALTER TABLE SET TBLPROPERTIES and ALTER TABLE UNSET TBLPROPERTIES statements are now supported.

EMR V3.23.0 (Spark 2.4.3)

New features and improvements

  • MERGE INTO syntax is now supported.

  • SCAN and STREAM syntaxes are now supported.

  • Exactly-once semantics (EOS) for the Structured Streaming Kafka sink are now supported.

  • Delta upgraded to version 0.4.0.

Bug fixes

  • Fixed an issue where IsolatedClassLoader could not load classes in certain cases (Spark SQL Thrift Server).

  • Refactored Spark transaction code to improve stability.

  • Fixed an issue where optimized row columnar (ORC) files could not be read or written after the built-in Hive was upgraded to version 2.3.

EMR V3.22.0 (Spark 2.4.3)

New features and improvements

Relational cache

A relational cache accelerates queries by pre-computing data. During a query, Spark Optimizer automatically detects and applies an appropriate relational cache, optimizing the SQL execution plan. Use cases include multidimensional online analytical processing (MOLAP), report generation, dashboard creation, and cross-cluster data synchronization.

  • DDL operations supported: CACHE, UNCACHE, ALTER, and SHOW. All Spark data sources and formats are supported.

  • Caches can be updated automatically or with the REFRESH command. Incremental caching by partition is supported.

  • The SQL execution plan is automatically optimized when a relational cache is available.

Streaming SQL

  • Stream Query Writer parameters are normalized.

  • Schema compatibility checks for Kafka data tables are optimized.

  • A schema is automatically registered with Schema Registry for Kafka data tables that do not have an existing schema.

  • Log output when a Kafka schema is incompatible is improved.

  • The restriction that streaming SQL queries only support Kafka and LogHub data sources is removed.

Delta (added in this release)

The Delta component is now available. Use Spark to create a Delta data source for streaming writes, transactional reads and writes, data verification, and data rollback.

  • Read and write Delta data using the DataFrame API.

  • Read and write Delta data using the Structured Streaming API, with Delta as the source or sink.

  • Use the Delta API to update, delete, merge, vacuum, and optimize data.

  • Use SQL statements to create Delta tables, import data, and query Delta tables.

Bug fixes

  • Fixed an issue where the column name had to be explicitly specified when writing query results to a Kafka data table.

  • Resolved JAR conflicts, including the servlet conflict.