All Products
Search
Document Center

E-MapReduce:Use Hive to access data of Delta Lake and Hudi

Last Updated:May 18, 2023

You cannot use Hive to write data to Delta Lake and Hudi, but you can query data of Delta Lake and Hudi by using a Hive external table. This topic describes how to use Hive in your E-MapReduce (EMR) cluster to access data of Delta Lake and Hudi.

Prerequisites

  • A cluster that contains the Hive, Delta Lake, and Hudi services is created. For more information, see Create a cluster.

  • You have logged on to the cluster. For more information, see Log on to a cluster.

Limits

Only E-MapReduce(EMR) V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Hive to read Hudi data.

Use Hive to access Delta Lake data

  1. Run the following command to open the Spark CLI:

    spark-sql
  2. Create a table in Spark and query data in the table.

    1. Execute the following statement to create a Delta Lake table named delta_table in Spark:

      create table delta_table (id int) using delta;
    2. Execute the following statement to insert data into the table:

      insert into delta_table values 0,1,2,3,4;
    3. Execute the following statement to query data in the table:

      select * from delta_table;

      The following output is returned:

      2
      3
      4
      0
      1
      Time taken: 1.847 seconds, Fetched 5 row(s)
  3. Query Delta Lake data in Hive.

    1. Run the following command to open the Hive CLI:

      hive
    2. Run the following command to view detailed information about the Delta Lake table in Hive:

      desc formatted delta_table;
    3. Run the following command to query data of the Delta Lake table in Hive:

      select * from delta_table;

      The following output is returned:

      OK
      2
      3
      4
      0
      1
      Time taken: 1.897 seconds, Fetched: 5 row(s)
      Note

      The preceding information shows that the data queried in Hive is the same as the data in the Delta Lake table in Spark. This indicates that you have used Hive to access Delta Lake data.

Use Hive to access Hudi data

Important Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Hive to read Hudi data.

Use Spark to read data from and write data to Hudi

  • Methods to start Spark SQL

    • If you use Spark 2 or Spark 3 and Hudi whose version is earlier than 0.11, run the following code:

      spark-sql \
      --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
      --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    • If you use Spark 3 and Hudi 0.11 or later, run the following code:

      spark-sql \
      --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
      --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
      --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
  • Example

    • Create a table.

      create table h0 (
        id bigint,
        name string,
        price double,
        ts long
      ) using hudi
      tblproperties (
        primaryKey="id",
        preCombineField="ts"
      );
    • View the details of the table.

      desc formatted h0;

      The following output is returned:

      _hoodie_commit_time     string
      _hoodie_commit_seqno    string
      _hoodie_record_key      string
      _hoodie_partition_path    string
      _hoodie_file_name       string
      id                      bigint
      name                    string
      price                   double
      ts                      bigint
      Note

      By default, the _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name fields are added by Hudi.

    • Perform operations on the table.

      -- insert
      insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);
      
      -- update
      update h0 set name = 'a1_new' where id = 1;
      
      -- delete
      delete from h0 where id = 1;
    • Query data in the table.

      • Example 1

        select id, name, price, ts from h0;

        The following output is returned:

        2    a2    11.0    1000
      • Example 2

        select * from h0;

        The following output is returned:

        4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000
        Note

        The table is a non-partitioned table. Therefore, the _hoodie_partition_path field is left empty. The first four field values in the preceding output are the values of the default fields that are added by Hudi.

    For more examples of using Spark to read data from and write data to Hudi, see Spark Guide.

Use Hive to query Hudi data

Note

You can use only Spark or Flink to read data from or write data to Hudi. You can use Hive to only query data from Hudi.

Enter the hive command to open the Hive CLI or use Beeline to connect to Hive ThriftServer. Then, execute an SQL statement to query data. Sample statement:

-- select
select * from h0;