Use Hive to query data in Delta Lake and Hudi - E-MapReduce

Prerequisites

Before you begin, ensure that you have:

A cluster with Hive, Delta Lake, and Hudi services installed. For more information, see Create a cluster.
Logged in to the cluster. For more information, see Log on to a cluster.

Limitations

To use Hive to read Hudi data, your cluster must meet the following version requirements.

EMR version	Hive access to Hudi
V3.36.0 and later minor versions	Supported
V5.2.0 and later minor versions	Supported
Earlier versions	Not supported. Use Spark or Flink to query Hudi data instead.

Query Delta Lake data using Hive

Hive can read Delta Lake tables created in Spark. The following steps walk you through creating a Delta Lake table in Spark, inserting data, and querying that data from Hive.

Step 1: Create a Delta Lake table in Spark

Open the Spark CLI.
```
spark-sql
```

Create a Delta Lake table named delta_table.

create table delta_table (id int) using delta;

Insert data into the table.

insert into delta_table values 0,1,2,3,4;

Verify the data by querying the table in Spark.

select * from delta_table;

The output is similar to:

2
3
4
0
1
Time taken: 1.847 seconds, Fetched 5 row(s)

Step 2: Query Delta Lake data in Hive

Open the Hive CLI.
```
hive
```
(Optional) Inspect the table metadata.
```
desc formatted delta_table;
```
Query the Delta Lake table.
```
select * from delta_table;
```
The output is similar to:
```
OK
2
3
4
0
1
Time taken: 1.897 seconds, Fetched: 5 row(s)
```
The data returned by Hive matches the data in the Delta Lake table in Spark, confirming that Hive can successfully read Delta Lake data.

Query Hudi data using Hive

Important

Only Spark or Flink can write data to Hudi. Hive supports read-only access.

Step 1: Prepare Hudi data using Spark

Before querying from Hive, create and populate a Hudi table using Spark SQL.

Start Spark SQL

The startup command depends on the Spark and Hudi versions in your cluster.

For Spark 2 or Spark 3 with Hudi earlier than 0.11:

spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

For Spark 3 with Hudi 0.11 or later:

spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Create and populate a Hudi table

Create a Hudi table named h0.

create table h0 (
  id bigint,
  name string,
  price double,
  ts long
) using hudi
tblproperties (
  primaryKey="id",
  preCombineField="ts"
);

Inspect the table schema.

desc formatted h0;

The output is similar to:

_hoodie_commit_time     string
_hoodie_commit_seqno    string
_hoodie_record_key      string
_hoodie_partition_path    string
_hoodie_file_name       string
id                      bigint
name                    string
price                   double
ts                      bigint

Hudi automatically adds five metadata fields to every table: _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name. These fields appear in all query results that use select *.

Insert, update, and delete data.

-- insert
insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);

-- update
update h0 set name = 'a1_new' where id = 1;

-- delete
delete from h0 where id = 1;

Verify the data. After the update and delete operations, only row id=2 remains.

Note
h0 is a non-partitioned table, so _hoodie_partition_path is empty. The first five values in the output correspond to the five Hudi metadata fields.
```
select id, name, price, ts from h0;
```
The output is similar to:
```
2    a2    11.0    1000
```
To include the Hudi metadata fields in the result, run:
```
select * from h0;
```
The output is similar to:
```
4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000
```

For more Spark examples for Hudi, see the Spark Guide.

Step 2: Query Hudi data in Hive

Open the Hive CLI or connect via Beeline to Hive ThriftServer, then run a query against the Hudi table.

-- select
select * from h0;

What's next

To explore more advanced Hudi operations in Spark, see the Apache Hudi Spark Guide.