Hive cannot write data to Delta Lake or Apache Hudi, but it can query data from both formats via external tables. This topic describes how to use Hive in an E-MapReduce (EMR) cluster to query Delta Lake and Hudi data.
Prerequisites
Before you begin, ensure that you have:
-
A cluster with Hive, Delta Lake, and Hudi services installed. For more information, see Create a cluster.
-
Logged in to the cluster. For more information, see Log on to a cluster.
Limitations
To use Hive to read Hudi data, your cluster must meet the following version requirements.
| EMR version | Hive access to Hudi |
|---|---|
| V3.36.0 and later minor versions | Supported |
| V5.2.0 and later minor versions | Supported |
| Earlier versions | Not supported. Use Spark or Flink to query Hudi data instead. |
Query Delta Lake data using Hive
Hive can read Delta Lake tables created in Spark. The following steps walk you through creating a Delta Lake table in Spark, inserting data, and querying that data from Hive.
Step 1: Create a Delta Lake table in Spark
-
Open the Spark CLI.
spark-sql -
Create a Delta Lake table named
delta_table.create table delta_table (id int) using delta; -
Insert data into the table.
insert into delta_table values 0,1,2,3,4; -
Verify the data by querying the table in Spark.
select * from delta_table;The output is similar to:
2 3 4 0 1 Time taken: 1.847 seconds, Fetched 5 row(s)
Step 2: Query Delta Lake data in Hive
-
Open the Hive CLI.
hive -
(Optional) Inspect the table metadata.
desc formatted delta_table; -
Query the Delta Lake table.
select * from delta_table;The output is similar to:
OK 2 3 4 0 1 Time taken: 1.897 seconds, Fetched: 5 row(s)The data returned by Hive matches the data in the Delta Lake table in Spark, confirming that Hive can successfully read Delta Lake data.
Query Hudi data using Hive
Only Spark or Flink can write data to Hudi. Hive supports read-only access.
Step 1: Prepare Hudi data using Spark
Before querying from Hive, create and populate a Hudi table using Spark SQL.
Start Spark SQL
The startup command depends on the Spark and Hudi versions in your cluster.
-
For Spark 2 or Spark 3 with Hudi earlier than 0.11:
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' -
For Spark 3 with Hudi 0.11 or later:
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Create and populate a Hudi table
-
Create a Hudi table named
h0.create table h0 ( id bigint, name string, price double, ts long ) using hudi tblproperties ( primaryKey="id", preCombineField="ts" ); -
Inspect the table schema.
desc formatted h0;The output is similar to:
_hoodie_commit_time string _hoodie_commit_seqno string _hoodie_record_key string _hoodie_partition_path string _hoodie_file_name string id bigint name string price double ts bigintHudi automatically adds five metadata fields to every table:
_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path, and_hoodie_file_name. These fields appear in all query results that useselect *. -
Insert, update, and delete data.
-- insert insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000); -- update update h0 set name = 'a1_new' where id = 1; -- delete delete from h0 where id = 1; -
Verify the data. After the update and delete operations, only row
id=2remains.Noteh0is a non-partitioned table, so_hoodie_partition_pathis empty. The first five values in the output correspond to the five Hudi metadata fields.select id, name, price, ts from h0;The output is similar to:
2 a2 11.0 1000To include the Hudi metadata fields in the result, run:
select * from h0;The output is similar to:
4.820221130150621338 20221130150621338_0_1 id:2 40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet 2 a2 11.0 1000
For more Spark examples for Hudi, see the Spark Guide.
Step 2: Query Hudi data in Hive
Open the Hive CLI or connect via Beeline to Hive ThriftServer, then run a query against the Hudi table.
-- select
select * from h0;
What's next
-
To explore more advanced Hudi operations in Spark, see the Apache Hudi Spark Guide.