All Products
Search
Document Center

E-MapReduce:Use Hudi

Last Updated:Feb 27, 2026

Apache Hudi enables record-level inserts, updates, and deletes on data lake storage, along with change data capture for incremental processing. This topic describes how to create a Hudi-enabled Spark SQL session in EMR Serverless Spark and perform read and write operations on a Hudi table. For more information, see Apache Hudi.

Prerequisites

Before you begin, ensure that you have:

Step 1: Create an SQL session

  1. Log on to the EMR console.

  2. In the left-side navigation pane, choose EMR Serverless > Spark.

  3. On the Spark page, click the name of the workspace that you want to manage.

  4. In the left-side navigation pane of the EMR Serverless Spark page, choose O&M Center > Sessions.

  5. On the SQL Session tab, click Create SQL Session.

  6. On the Create SQL Session page, expand the Custom Configuration section. In the Spark Configuration editor, add the following configuration, and then click Create. For more information, see Manage SQL sessions. > Note: This configuration uses the default catalog of the workspace. To use an external Hive Metastore as a catalog instead, see Use EMR Serverless Spark to connect to an external Hive Metastore.

    ParameterValueDescription
    spark.sql.extensionsorg.apache.spark.sql.hudi.HoodieSparkSessionExtensionRegisters Hudi SQL extensions with Spark, enabling Hudi-specific syntax such as USING hudi and TBLPROPERTIES.
    spark.sql.catalog.spark_catalogorg.apache.spark.sql.hudi.catalog.HoodieCatalogReplaces the default Spark catalog with the Hudi catalog so that Spark can manage Hudi tables through standard SQL.
    spark.serializerorg.apache.spark.serializer.KryoSerializerUses Kryo serialization instead of the default Java serializer. Kryo serialization is recommended for Hudi to achieve optimal performance.
       spark.sql.extensions             org.apache.spark.sql.hudi.HoodieSparkSessionExtension
       spark.sql.catalog.spark_catalog  org.apache.spark.sql.hudi.catalog.HoodieCatalog
       spark.serializer                 org.apache.spark.serializer.KryoSerializer

Step 2: Write data to and read data from a Hudi table

  1. In the left-side navigation pane of the EMR Serverless Spark page, click Development.

  2. On the Development tab, click the image icon.

  3. In the New dialog box, set Name to users_task, keep the default Type value SparkSQL, and then click OK.

  4. On the users_task tab, paste the following SQL statements into the code editor: The following table describes what each statement does. Hudi table types The type property in TBLPROPERTIES specifies the storage type. Hudi supports two table types: The example above uses cow. Choose the type that matches your read/write pattern. Primary key The primaryKey property specifies which column uniquely identifies each record. Hudi uses the primary key to handle upserts and deduplication. In this example, id serves as the primary key.

    TypeBehaviorBest for
    cow (Copy-on-Write)Each write creates a new version of the data file. Reads are fast because data is always in columnar format.Read-heavy workloads
    mor (Merge-on-Read)Writes go to a delta log and are merged at read time. Writes are faster, but reads incur a merge overhead.Write-heavy or streaming workloads
       CREATE DATABASE IF NOT EXISTS ss_hudi_db;
    
       CREATE TABLE ss_hudi_db.hudi_tbl (id INT, name STRING) USING hudi TBLPROPERTIES (
         type = 'cow',
         primaryKey = 'id'
       );
    
       INSERT INTO ss_hudi_db.hudi_tbl VALUES (1, "a"), (2, "b");
    
       SELECT id, name FROM ss_hudi_db.hudi_tbl ORDER BY id;
    
       DROP TABLE ss_hudi_db.hudi_tbl;
    
       DROP DATABASE ss_hudi_db;
  5. Select a database from the Default Database drop-down list and the created SQL session from the SQL Session drop-down list.

  6. Click Run. Expected output: image

References