The Fusion engine is a vectorized SQL execution engine built into EMR Serverless Spark. It performs three times better than open source Spark on TPC-DS benchmark tests. The Fusion engine is fully compatible with open source Spark, so no code changes are required. Enable it by turning on the Use Fusion Acceleration switch when creating a session.
The Fusion engine accelerates Spark SQL and DataFrame jobs. It improves performance for most operators, expressions, and data types.
Limitations
The Fusion engine does not accelerate the following job types:
Resilient Distributed Dataset (RDD) jobs
Jobs that use user-defined functions (UDFs)
Supported storage formats
Parquet
Paimon
ORC (partial support)
Supported operators
| Type | Operators |
|---|---|
| Source | FileSourceScanExec, HiveTableScanExec, BatchScanExec, InMemoryTableScanExec |
| Sink | DataWritingCommandExec |
| Common operation | FilterExec, ProjectExec, SortExec, UnionExec |
| Aggregation | HashAggregateExec |
| Join | BroadcastHashJoinExec, ShuffledHashJoinExec, SortMergeJoinExec, BroadcastNestedLoopJoinExec, CartesianProductExec |
| Window | WindowExec, WindowTopK |
| Exchange | ShuffleExchangeExec, ReusedExchangeExec, BroadcastExchangeExec, CoalesceExec |
| Limit | GlobalLimitExec, LocalLimitExec, TakeOrderedAndProjectExec |
| Subquery | SubqueryBroadcastExec |
| Others | ExpandExec, GenerateExec |
Unsupported operators
| Type | Operators |
|---|---|
| Aggregation | ObjectHashAggregateExec, SortAggregateExec |
| Exchange | CustomShuffleReaderExec |
| Pandas | AggregateInPandasExec, FlatMapGroupsInPandasExec, ArrowEvalPythonExec, MapInPandasExec, WindowInPandasExec |
| Others | CollectLimitExec, RangeExec, SampleExec |
Supported expressions
| Type | Expressions |
|---|---|
| Comparison/Logic | =, ==, !=, <>, <, <=, >, >=, <=>, is null, is not null, between, and, or, ||, !, negative, null if |
| Arithmetic | %, +, -, *, /, isnan, mod, negative, not, positive, abs, acos, acosh, asin, asinh, atan, atan2, atanh, cbrt, ceil, ceiling, cos, cosh, degrees, e, exp, floor, ln, log, log10, log2, pi, pmod, pow, power, radians, rand, random, rint, round, shiftleft, shiftright, sign, signum, sin, sqrt, tan, tanh |
| Bitwise | ^, |, &, ~, bit_and, bit_count, bit_or, bit_xor, bit_length |
| Conditional | case, if, when |
| Set | in, find_in_set |
| String | ascii, char, chr, char_length, character_length, concat, instr, lcase, lower, length, locate, lpad, ltrim, overlay, replace, reverse, rtrim, split, split_part, substr, substring, trim, ucase, upper, like, regexp, regexp_extract, regexp_extract_all, regexp_like, regexp_replace, rlike |
| Aggregation | aggregate, approx_count_distinct, avg, collect_list, collect_set, corr, count, covar_pop, covar_samp, first, first_value, kurtosis, last, last_value, max, max_by, mean, min, regr_avgx, regr_avgy, regr_count, regr_r2, regr_intercept, regr_slope, regr_sxy, regr_sxx, regr_syy, skewness, std, stddev, stddev_pop, stddev_samp, sum, var_pop, var_samp, variance |
| Window | cume_dist, dense_rank, lag, lead, nth_value, ntile, percent_rank, rank, row_number |
| Time | add_months, current_date, current_timestamp, current_timezone, date, date_add, date_format, date_from_unix_date, date_sub, datediff, day, dayofmonth, dayofweek, dayofyear, from_unixtime, from_utc_timestamp, hour, last_day, make_date, minute, month, next_day, now, quarter, second, timestamp_micros, timestamp_millis, to_date, to_unix_timestamp, unix_seconds, unix_millis, unix_micros, weekday, weekofyear, year |
| JSON | get_json_object, json_array_length |
| Array | array, array_contains, array_distinct, array_except, array_intersect, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_sort, arrays_overlap, arrays_zip, element_at, exists, filter, forall, flatten, shuffle, size, sort_array |
| Map | map, get_map_value, map_from_arrays, map_keys, map_values, map_zip_with, named_struct, struct, str_to_map |
| Encoding | crc32, hash, md5, sha1, sha2 |
| Others | current_catalog, current_database, greatest, least, monotonically_increasing_id, nanvl, spark_partition_id, stack, uuid, rand |
Supported data types
Byte, Short, Int, and Long
Boolean
String and Binary
Decimal
Float and Double
Date and Timestamp
Unsupported data types
Struct
Array
Map