Alibaba Cloud E-MapReduce (EMR) runs on Elastic Compute Service (ECS) instances and enhances open source Apache Hadoop and Apache Spark with additional capabilities. This page lists the Spark enhancements shipped in each EMR release, organized by EMR series.
EMR V5.x series
| EMR version | Spark version | Changes |
|---|---|---|
| EMR V5.17.0 | Spark 3.4.2 | Spark updated to 3.4.2. |
| EMR V5.16.0 | Spark 3.3.1 | Security vulnerabilities in the Commons Text library are fixed. |
| EMR V5.15.1 | Spark 3.3.1 | Configurations related to jdo are removed from the hive-site.xml file. |
| EMR V5.12.1 | Spark 3.3.1 | OSS-HDFS is now the default storage backend for Spark History Server. Object Storage Service (OSS) or OSS-HDFS can be used to store data for Spark3 Native Engine. |
| EMR V5.10.0 | Spark 3.3.1 | Spark updated to 3.3.1. |
| EMR V5.9.0 | Spark 3.3.0 | Spark updated to 3.3. Kerberos authentication is now supported. |
| EMR V5.8.0 | Spark 3.2.1 | LDAP authentication can be enabled with a single click. |
| EMR V5.6.0 | Spark 3.2.1 | Spark updated to 3.2.1. |
| EMR V5.5.0 | Spark 3.2.0 | See EMR V5.5.0 details. |
| EMR V5.4.0 | Spark 3.1.2 | See EMR V5.4.0 details. |
| EMR V5.3.0 | Spark 3.1.1 | The incompatibility between Spark and Delta Lake is fixed. |
| EMR V5.2.1 | Spark 3.1.1 | See EMR V5.2.1 details. |
EMR V5.5.0 (Spark 3.2.0)
New features and improvements
Conditional COUNT DISTINCT: An
IFexpression can now be used insideCOUNT DISTINCT, and theCASE WHENsyntax forCOUNT DISTINCTis optimized. This feature is disabled by default. To enable it, setspark.sql.optimizer.rewriteConditionalDistinctAggregatestotrue.Shuffle Hash Join fallback: Spark can fall back from Shuffle Hash Join to Sort Merge Join when Sort Merge Join is more efficient. To enable this behavior, set
spark.sql.join.preferSortMergeJointofalseandspark.sql.join.enableShuffledHashJoinFallbacktotrue.Automatic small-file merging: Small output files in non-dynamic partitions are merged automatically. This feature is disabled by default. To enable it, set
spark.sql.adaptive.merge.output.small.files.enabledtotrue.Automatic concurrency adjustment for GROUPING SETS and DISTINCT: Spark automatically adjusts parallelism in queries that use
GROUPING SETSorDISTINCT. This feature is disabled by default. To enable it, setspark.sql.execution.optimizeExpandtotrue.Hive on Spark improvements: Performance and compatibility of Hive on Spark are optimized.
Time travel syntax: The time travel syntax is now supported.
JindoSDK integration: Spark is adapted to JindoSDK.
EMR V5.4.0 (Spark 3.1.2)
New features and improvements
Spark updated to 3.1.2.
Distinct computing performance in Spark SQL is optimized for aggregation operators that contain multiple
count(distinct case ... when ...)expressions.
Bug fixes
Fixed an array-index out-of-bounds error that occurred when some statistics required by Adaptive Query Execution (AQE) were missing.
Fixed errors related to AQE and data caching in specific scenarios.
EMR V5.2.1 (Spark 3.1.1)
In EMR V5.2.1, Spark 3.1.1 and Kudu 1.11.1 are incompatible with each other.
New features and improvements
Delta Lake and Hudi are now supported.
Remote Shuffle Service is now supported.
Livy is now supported.
Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized.
Cost-based optimization (CBO), dynamic partition pruning, and Z-order features deliver 50% higher performance compared to standard Spark 3.
Log Service, DataHub, and Message Queue for Apache RocketMQ can be used as data sources.
EMR V4.x series
| EMR version | Spark version | Changes |
|---|---|---|
| EMR V4.10.0 | Spark 2.4.8 | See EMR V4.10.0 details. |
| EMR V4.9.0 | Spark 2.4.7 | See EMR V4.9.0 details. |
| EMR V4.8.0 | Spark 2.4.7 | See EMR V4.8.0 details. |
| EMR V4.6.0 | Spark 2.4.7 | Spark updated to 2.4.7. jQuery updated to 3.5.1. Hive automatically updates table and partition sizes. Spark metadata and job information can be sent to DataWorks. |
| EMR V4.5.0 | Spark 2.4.5 | Metadata from Alibaba Cloud Data Lake Formation (DLF) is now supported. |
| EMR V4.3.0 | Spark 2.4.5 | Spark updated to 2.4.5. Delta Lake updated to 0.6.0. Fixed an issue where PySpark failed to run after Ranger Hive was enabled. |
EMR V4.10.0 (Spark 2.4.8)
New features and improvements
Spark updated to 2.4.8.
Default configurations for Thrift Server are optimized.
Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized.
Hive on Spark improvements.
The Zstandard compression algorithm is now supported.
Bug fixes
Fixed an issue where adaptive execution did not take effect in some scenarios.
Fixed an issue where Spark and Hive handled statistical aggregate functions differently.
Fixed an issue where Spark could not read valid
CHARtype data from a Hive ORC table.Fixed an array-index out-of-bounds error that occurred when some statistics required by AQE were missing.
Fixed errors related to AQE and data caching in specific scenarios.
Removed Log4j Metrics Appender, which had an invalid configuration.
Fixed a null pointer exception that occurred when SparkContext started.
EMR V4.9.0 (Spark 2.4.7)
Bug fixes
Fixed an issue where adaptive execution did not take effect in some scenarios.
Fixed an issue where Spark and Hive handled statistical aggregate functions differently.
Fixed an issue where Spark could not read valid
CHARtype data from a Hive ORC table.
EMR V4.8.0 (Spark 2.4.7)
New features and improvements
Default configurations are optimized.
Window-based top-k queries can be pushed down, improving performance.
Reading from and writing to Hive tables in CSV or JSON format is enhanced.
All column names can be omitted in an
ANALYZEstatement.LDAP authentication can be enabled or disabled with a single click.
Spark Beeline is easier to use.
EMR V3.x series
| EMR version | Spark version | Changes |
|---|---|---|
| EMR V3.51.0 | Spark 3.4.2 | Spark updated to 3.4.2. |
| EMR V3.50.0 | Spark 3.3.1 | Security vulnerabilities in the Commons Text library are fixed. |
| EMR V3.49.0 | Spark 3.3.1 | Configurations related to jdo are removed from the hive-site.xml file. |
| EMR V3.46.1 | Spark 3.3.1 | OSS-HDFS is now the default storage backend for Spark History Server. OSS or OSS-HDFS can be used to store data for Spark3 Native Engine. |
| EMR V3.44.0 | Spark 3.3.1 | Spark updated to 3.3.1. |
| EMR V3.43.0 | Spark 3.3.0 | Spark updated to 3.3. Kerberos authentication is now supported. |
| EMR V3.40.0 | Spark 3.2.1 | Spark updated to 3.2.1. |
| EMR V3.39.1 | Spark 2.4.8 | Hive on Spark improvements. Spark adapted to JindoSDK. |
| EMR V3.38.1 | Spark 2.4.8 | Removed Log4j Metrics Appender (invalid configuration). Fixed a null pointer exception that occurred when SparkContext started. |
| EMR V3.38.0 | Spark 2.4.8 | See EMR V3.38.0 details. |
| EMR V3.37.0 | Spark 2.4.7 | The incompatibility between Spark and Delta Lake is fixed. |
| EMR V3.36.1 | Spark 2.4.7 | Parameter names on the spark-defaults tab of the Configure tab in the EMR console are optimized. Log collection performance is optimized. Zstandard compression is now supported. |
| EMR V3.35.0 | Spark 2.4.7 | Fixed issues where adaptive execution did not take effect, Spark and Hive handled aggregate stats differently, and Spark could not read CHAR type data from Hive ORC tables. |
| EMR V3.34.0 | Spark 2.4.7 | See EMR V3.34.0 details. |
| EMR V3.33.0 | Spark 2.4.7 | Spark updated to 2.4.7. jQuery updated to 3.5.1. Hive automatically updates table and partition sizes. Spark metadata and job information can be sent to DataWorks. |
| EMR V3.32.0 | Spark 2.4.5 | The data collection feature of JindoTable can be enabled or disabled. |
| EMR V3.30.0 | Spark 2.4.5 | See EMR V3.30.0 details. |
| EMR V3.29.0 | Spark 2.4.5 | Spark updated to 2.4.5.2.0. Third-party metastore support added. The datalake metastore-client parameter is added. |
| EMR V3.28.0 | Spark 2.4.5 | Spark updated to 2.4.5. Compatible with Streaming SQL scripts from DataFactory. Delta Lake 0.6.0 is now supported. |
| EMR V3.27.0 | Spark 2.4.3 | Partition fields of the date type are now supported in cube operations. The stack depth in spark-submit scripts is increased. |
| EMR V3.25.0 | Spark 2.4.3 | See EMR V3.25.0 details. |
| EMR V3.24.0 | Spark 2.4.3 | Delta Lake parameters are now supported. The Spark plugin can be configured in Ranger. JindoCube updated to 0.3.0. |
| EMR V3.23.0 | Spark 2.4.3 | See EMR V3.23.0 details. |
| EMR V3.22.0 | Spark 2.4.3 | See EMR V3.22.0 details. |
EMR V3.38.0 (Spark 2.4.8)
New features and improvements
Spark updated to 2.4.8.
Both Spark 2.4.8 and Spark 3.1.2 are now supported.
NoteDelta Lake and Remote Shuffle Service are not supported with Spark 3 in this release.
Distinct computing performance in Spark SQL is optimized for aggregation operators that contain multiple
count(distinct case ... when ...)expressions.
Bug fixes
Fixed an array-index out-of-bounds error that occurred when some statistics required by AQE were missing.
Fixed errors related to AQE and data caching in specific scenarios.
EMR V3.34.0 (Spark 2.4.7)
New features and improvements
Default configurations are optimized.
Window-based top-k queries can be pushed down, improving performance.
Reading from and writing to Hive tables in CSV or JSON format is enhanced.
All column names can be omitted in an
ANALYZEstatement.LDAP authentication can be enabled or disabled with a single click.
Spark Beeline is easier to use.
EMR V3.30.0 (Spark 2.4.5)
New features and improvements
Metadata from Alibaba Cloud Data Lake Formation (DLF) is now supported.
Has dependencies updated to 2.0.1.
Delta is now separately deployed; the Delta JAR package is removed from the main distribution.
Logs are stored in HDFS directories.
Bug fixes
Fixed issues caused by backticks (`
``) in Streaming SQL.
EMR V3.25.0 (Spark 2.4.3)
New features and improvements
Delta Lake parameters such as
spark.sql.extensionscan now be configured in the EMR console.Delta table data can be read using Hive without manually configuring
InputFormat.The
ALTER TABLE SET TBLPROPERTIESandALTER TABLE UNSET TBLPROPERTIESstatements are now supported.
EMR V3.23.0 (Spark 2.4.3)
New features and improvements
MERGE INTOsyntax is now supported.SCANandSTREAMsyntaxes are now supported.Exactly-once semantics (EOS) for the Structured Streaming Kafka sink are now supported.
Delta upgraded to version 0.4.0.
Bug fixes
Fixed an issue where
IsolatedClassLoadercould not load classes in certain cases (Spark SQL Thrift Server).Refactored Spark transaction code to improve stability.
Fixed an issue where optimized row columnar (ORC) files could not be read or written after the built-in Hive was upgraded to version 2.3.
EMR V3.22.0 (Spark 2.4.3)
New features and improvements
Relational cache
A relational cache accelerates queries by pre-computing data. During a query, Spark Optimizer automatically detects and applies an appropriate relational cache, optimizing the SQL execution plan. Use cases include multidimensional online analytical processing (MOLAP), report generation, dashboard creation, and cross-cluster data synchronization.
DDL operations supported:
CACHE,UNCACHE,ALTER, andSHOW. All Spark data sources and formats are supported.Caches can be updated automatically or with the
REFRESHcommand. Incremental caching by partition is supported.The SQL execution plan is automatically optimized when a relational cache is available.
Streaming SQL
Stream Query Writer parameters are normalized.
Schema compatibility checks for Kafka data tables are optimized.
A schema is automatically registered with Schema Registry for Kafka data tables that do not have an existing schema.
Log output when a Kafka schema is incompatible is improved.
The restriction that streaming SQL queries only support Kafka and LogHub data sources is removed.
Delta (added in this release)
The Delta component is now available. Use Spark to create a Delta data source for streaming writes, transactional reads and writes, data verification, and data rollback.
Read and write Delta data using the DataFrame API.
Read and write Delta data using the Structured Streaming API, with Delta as the source or sink.
Use the Delta API to update, delete, merge, vacuum, and optimize data.
Use SQL statements to create Delta tables, import data, and query Delta tables.
Bug fixes
Fixed an issue where the column name had to be explicitly specified when writing query results to a Kafka data table.
Resolved JAR conflicts, including the servlet conflict.