E-MapReduce:Integrate Hudi with Spark SQL - E-MapReduce

Spark SQL support for Hudi was introduced in EMR Hudi 0.8.0, letting you query and modify Hudi tables with standard SQL — no Spark DataFrame code required. This significantly reduces the usage costs of Hudi and is useful when you want a lower-barrier entry point to Hudi on EMR, or when your team prefers SQL-based workflows over DataFrame APIs.

Limitations

Spark SQL read/write for Hudi requires one of the following EMR cluster versions:

EMR V3.36.0 or a later minor version
EMR V5.2.0 or a later minor version

Start Spark SQL

The required startup flags depend on your Spark and Hudi versions. Use the table below to identify your configuration before running the startup command.

Spark version	Hudi version	Required `--conf` flags
Spark 2 or Spark 3	Earlier than 0.11	`spark.serializer`, `spark.sql.extensions`
Spark 3	0.11 or later	`spark.serializer`, `spark.sql.extensions`, `spark.sql.catalog.spark_catalog`

Spark 2 or Spark 3, Hudi earlier than 0.11:

spark-sql \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Spark 3, Hudi 0.11 or later:

spark-sql \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
  --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Examples

The following examples walk through the full lifecycle of a Hudi table: create, inspect, write, and query.

Create a table

CREATE TABLE h0 (
  id    BIGINT,
  name  STRING,
  price DOUBLE,
  ts    LONG
) USING hudi
TBLPROPERTIES (
  primaryKey="id",
  preCombineField="ts"
);

primaryKey uniquely identifies each record within the table. Hudi uses it for indexing and upsert lookups.
preCombineField is the ordering field — typically an event timestamp — that Hudi uses to resolve conflicts when multiple versions of the same record exist.

Inspect the table schema

DESC FORMATTED h0;

Expected output:

_hoodie_commit_time       string
_hoodie_commit_seqno      string
_hoodie_record_key        string
_hoodie_partition_path    string
_hoodie_file_name         string
id                        bigint
name                      string
price                     double
ts                        bigint

Note Hudi automatically adds five metadata fields prefixed with _hoodie_ to every table. These fields track commit history, record identity, partition location, and file information — they support Hudi's indexing and incremental query features.

Insert data

INSERT INTO h0 VALUES (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);

Update data

UPDATE h0 SET name = 'a1_new' WHERE id = 1;

Delete data

DELETE FROM h0 WHERE id = 1;

Query data

After the insert, update, and delete operations above, query the table to confirm the results.

Query specific columns:

SELECT id, name, price, ts FROM h0;

Expected output:

2    a2    11.0    1000

Query all columns (including metadata fields):

SELECT * FROM h0;

Expected output:

4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000

Note This is a non-partitioned table, so _hoodie_partition_path is blank. The first four values in the output are the Hudi metadata fields.