All Products
Search
Document Center

E-MapReduce:Access Delta Lake and Hudi data from Hive

Last Updated:Mar 26, 2026

Hive cannot write data to Delta Lake or Apache Hudi, but it can query data from both formats via external tables. This topic describes how to use Hive in an E-MapReduce (EMR) cluster to query Delta Lake and Hudi data.

Prerequisites

Before you begin, ensure that you have:

  • A cluster with Hive, Delta Lake, and Hudi services installed. For more information, see Create a cluster.

  • Logged in to the cluster. For more information, see Log on to a cluster.

Limitations

To use Hive to read Hudi data, your cluster must meet the following version requirements.

EMR version Hive access to Hudi
V3.36.0 and later minor versions Supported
V5.2.0 and later minor versions Supported
Earlier versions Not supported. Use Spark or Flink to query Hudi data instead.

Query Delta Lake data using Hive

Hive can read Delta Lake tables created in Spark. The following steps walk you through creating a Delta Lake table in Spark, inserting data, and querying that data from Hive.

Step 1: Create a Delta Lake table in Spark

  1. Open the Spark CLI.

    spark-sql
  2. Create a Delta Lake table named delta_table.

    create table delta_table (id int) using delta;
  3. Insert data into the table.

    insert into delta_table values 0,1,2,3,4;
  4. Verify the data by querying the table in Spark.

    select * from delta_table;

    The output is similar to:

    2
    3
    4
    0
    1
    Time taken: 1.847 seconds, Fetched 5 row(s)

Step 2: Query Delta Lake data in Hive

  1. Open the Hive CLI.

    hive
  2. (Optional) Inspect the table metadata.

    desc formatted delta_table;
  3. Query the Delta Lake table.

    select * from delta_table;

    The output is similar to:

    OK
    2
    3
    4
    0
    1
    Time taken: 1.897 seconds, Fetched: 5 row(s)

    The data returned by Hive matches the data in the Delta Lake table in Spark, confirming that Hive can successfully read Delta Lake data.

Query Hudi data using Hive

Important

Only Spark or Flink can write data to Hudi. Hive supports read-only access.

Step 1: Prepare Hudi data using Spark

Before querying from Hive, create and populate a Hudi table using Spark SQL.

Start Spark SQL

The startup command depends on the Spark and Hudi versions in your cluster.

  • For Spark 2 or Spark 3 with Hudi earlier than 0.11:

    spark-sql \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  • For Spark 3 with Hudi 0.11 or later:

    spark-sql \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
    --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Create and populate a Hudi table

  1. Create a Hudi table named h0.

    create table h0 (
      id bigint,
      name string,
      price double,
      ts long
    ) using hudi
    tblproperties (
      primaryKey="id",
      preCombineField="ts"
    );
  2. Inspect the table schema.

    desc formatted h0;

    The output is similar to:

    _hoodie_commit_time     string
    _hoodie_commit_seqno    string
    _hoodie_record_key      string
    _hoodie_partition_path    string
    _hoodie_file_name       string
    id                      bigint
    name                    string
    price                   double
    ts                      bigint

    Hudi automatically adds five metadata fields to every table: _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name. These fields appear in all query results that use select *.

  3. Insert, update, and delete data.

    -- insert
    insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);
    
    -- update
    update h0 set name = 'a1_new' where id = 1;
    
    -- delete
    delete from h0 where id = 1;
  4. Verify the data. After the update and delete operations, only row id=2 remains.

    Note

    h0 is a non-partitioned table, so _hoodie_partition_path is empty. The first five values in the output correspond to the five Hudi metadata fields.

    select id, name, price, ts from h0;

    The output is similar to:

    2    a2    11.0    1000

    To include the Hudi metadata fields in the result, run:

    select * from h0;

    The output is similar to:

    4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000

For more Spark examples for Hudi, see the Spark Guide.

Step 2: Query Hudi data in Hive

Open the Hive CLI or connect via Beeline to Hive ThriftServer, then run a query against the Hudi table.

-- select
select * from h0;

What's next