Spark SQL support for Hudi was introduced in EMR Hudi 0.8.0, letting you query and modify Hudi tables with standard SQL — no Spark DataFrame code required. This significantly reduces the usage costs of Hudi and is useful when you want a lower-barrier entry point to Hudi on EMR, or when your team prefers SQL-based workflows over DataFrame APIs.
Limitations
Spark SQL read/write for Hudi requires one of the following EMR cluster versions:
-
EMR V3.36.0 or a later minor version
-
EMR V5.2.0 or a later minor version
Start Spark SQL
The required startup flags depend on your Spark and Hudi versions. Use the table below to identify your configuration before running the startup command.
| Spark version | Hudi version | Required --conf flags |
|---|---|---|
| Spark 2 or Spark 3 | Earlier than 0.11 | spark.serializer, spark.sql.extensions |
| Spark 3 | 0.11 or later | spark.serializer, spark.sql.extensions, spark.sql.catalog.spark_catalog |
Spark 2 or Spark 3, Hudi earlier than 0.11:
spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
Spark 3, Hudi 0.11 or later:
spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Examples
The following examples walk through the full lifecycle of a Hudi table: create, inspect, write, and query.
Create a table
CREATE TABLE h0 (
id BIGINT,
name STRING,
price DOUBLE,
ts LONG
) USING hudi
TBLPROPERTIES (
primaryKey="id",
preCombineField="ts"
);
-
primaryKeyuniquely identifies each record within the table. Hudi uses it for indexing and upsert lookups. -
preCombineFieldis the ordering field — typically an event timestamp — that Hudi uses to resolve conflicts when multiple versions of the same record exist.
Inspect the table schema
DESC FORMATTED h0;
Expected output:
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
id bigint
name string
price double
ts bigint
_hoodie_ to every table. These fields track commit history, record identity, partition location, and file information — they support Hudi's indexing and incremental query features.Insert data
INSERT INTO h0 VALUES (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);
Update data
UPDATE h0 SET name = 'a1_new' WHERE id = 1;
Delete data
DELETE FROM h0 WHERE id = 1;
Query data
After the insert, update, and delete operations above, query the table to confirm the results.
Query specific columns:
SELECT id, name, price, ts FROM h0;
Expected output:
2 a2 11.0 1000
Query all columns (including metadata fields):
SELECT * FROM h0;
Expected output:
4.820221130150621338 20221130150621338_0_1 id:2 40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet 2 a2 11.0 1000
_hoodie_partition_path is blank. The first four values in the output are the Hudi metadata fields.