Fusion Engine SQL Acceleration Overview - E-MapReduce

The Fusion engine is a vectorized SQL execution engine built into EMR Serverless Spark. It performs three times better than open source Spark on TPC-DS benchmark tests. The Fusion engine is fully compatible with open source Spark, so no code changes are required. Enable it by turning on the Use Fusion Acceleration switch when creating a session.

The Fusion engine accelerates Spark SQL and DataFrame jobs. It improves performance for most operators, expressions, and data types.

Limitations

The Fusion engine does not accelerate the following job types:

Resilient Distributed Dataset (RDD) jobs
Jobs that use user-defined functions (UDFs)

Supported storage formats

Parquet
Paimon
ORC (partial support)

Supported operators

Type	Operators
Source	FileSourceScanExec, HiveTableScanExec, BatchScanExec, InMemoryTableScanExec
Sink	DataWritingCommandExec
Common operation	FilterExec, ProjectExec, SortExec, UnionExec
Aggregation	HashAggregateExec
Join	BroadcastHashJoinExec, ShuffledHashJoinExec, SortMergeJoinExec, BroadcastNestedLoopJoinExec, CartesianProductExec
Window	WindowExec, WindowTopK
Exchange	ShuffleExchangeExec, ReusedExchangeExec, BroadcastExchangeExec, CoalesceExec
Limit	GlobalLimitExec, LocalLimitExec, TakeOrderedAndProjectExec
Subquery	SubqueryBroadcastExec
Others	ExpandExec, GenerateExec

Unsupported operators

Type	Operators
Aggregation	ObjectHashAggregateExec, SortAggregateExec
Exchange	CustomShuffleReaderExec
Pandas	AggregateInPandasExec, FlatMapGroupsInPandasExec, ArrowEvalPythonExec, MapInPandasExec, WindowInPandasExec
Others	CollectLimitExec, RangeExec, SampleExec

Supported expressions

Type	Expressions
Comparison/Logic	`=`, `==`, `!=`, `<>`, `<`, `<=`, `>`, `>=`, `<=>`, `is null`, `is not null`, `between`, `and`, `or`, `\|\|`, `!`, `negative`, `null if`
Arithmetic	`%`, `+`, `-`, `*`, `/`, `isnan`, `mod`, `negative`, `not`, `positive`, `abs`, `acos`, `acosh`, `asin`, `asinh`, `atan`, `atan2`, `atanh`, `cbrt`, `ceil`, `ceiling`, `cos`, `cosh`, `degrees`, `e`, `exp`, `floor`, `ln`, `log`, `log10`, `log2`, `pi`, `pmod`, `pow`, `power`, `radians`, `rand`, `random`, `rint`, `round`, `shiftleft`, `shiftright`, `sign`, `signum`, `sin`, `sqrt`, `tan`, `tanh`
Bitwise	`^`, `\|`, `&`, `~`, `bit_and`, `bit_count`, `bit_or`, `bit_xor`, `bit_length`
Conditional	`case`, `if`, `when`
Set	`in`, `find_in_set`
String	`ascii`, `char`, `chr`, `char_length`, `character_length`, `concat`, `instr`, `lcase`, `lower`, `length`, `locate`, `lpad`, `ltrim`, `overlay`, `replace`, `reverse`, `rtrim`, `split`, `split_part`, `substr`, `substring`, `trim`, `ucase`, `upper`, `like`, `regexp`, `regexp_extract`, `regexp_extract_all`, `regexp_like`, `regexp_replace`, `rlike`
Aggregation	`aggregate`, `approx_count_distinct`, `avg`, `collect_list`, `collect_set`, `corr`, `count`, `covar_pop`, `covar_samp`, `first`, `first_value`, `kurtosis`, `last`, `last_value`, `max`, `max_by`, `mean`, `min`, `regr_avgx`, `regr_avgy`, `regr_count`, `regr_r2`, `regr_intercept`, `regr_slope`, `regr_sxy`, `regr_sxx`, `regr_syy`, `skewness`, `std`, `stddev`, `stddev_pop`, `stddev_samp`, `sum`, `var_pop`, `var_samp`, `variance`
Window	`cume_dist`, `dense_rank`, `lag`, `lead`, `nth_value`, `ntile`, `percent_rank`, `rank`, `row_number`
Time	`add_months`, `current_date`, `current_timestamp`, `current_timezone`, `date`, `date_add`, `date_format`, `date_from_unix_date`, `date_sub`, `datediff`, `day`, `dayofmonth`, `dayofweek`, `dayofyear`, `from_unixtime`, `from_utc_timestamp`, `hour`, `last_day`, `make_date`, `minute`, `month`, `next_day`, `now`, `quarter`, `second`, `timestamp_micros`, `timestamp_millis`, `to_date`, `to_unix_timestamp`, `unix_seconds`, `unix_millis`, `unix_micros`, `weekday`, `weekofyear`, `year`
JSON	`get_json_object`, `json_array_length`
Array	`array`, `array_contains`, `array_distinct`, `array_except`, `array_intersect`, `array_join`, `array_max`, `array_min`, `array_position`, `array_remove`, `array_repeat`, `array_sort`, `arrays_overlap`, `arrays_zip`, `element_at`, `exists`, `filter`, `forall`, `flatten`, `shuffle`, `size`, `sort_array`
Map	`map`, `get_map_value`, `map_from_arrays`, `map_keys`, `map_values`, `map_zip_with`, `named_struct`, `struct`, `str_to_map`
Encoding	`crc32`, `hash`, `md5`, `sha1`, `sha2`
Others	`current_catalog`, `current_database`, `greatest`, `least`, `monotonically_increasing_id`, `nanvl`, `spark_partition_id`, `stack`, `uuid`, `rand`

Supported data types

Byte, Short, Int, and Long
Boolean
String and Binary
Decimal
Float and Double
Date and Timestamp

Unsupported data types

Struct
Array
Map