You cannot use Hive to write data to Delta Lake and Hudi, but you can query data of Delta Lake and Hudi by using a Hive external table. This topic describes how to use Hive in your E-MapReduce (EMR) cluster to access data of Delta Lake and Hudi.

Prerequisites

An EMR Hadoop cluster is created. For more information, see Create a cluster.

Limits

Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Hive to read Hudi data.

Use Hive to access Delta Lake data

  1. Open the Spark CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Spark CLI:
      spark-sql
  2. Create a table in Spark and query data in the table.
    1. Run the following command to create a Delta Lake table named delta_table in Spark:
      create table delta_table (id int) using delta location "/tmp/delta_table";
    2. Run the following command to insert data into the table:
      insert into delta_table values 0,1,2,3,4;
    3. Run the following command to query data in the table:
      select * from delta_table;
      The information that contains the following content is returned:
      2
      3
      4
      0
      1
      Time taken: 1.847 seconds, Fetched 5 row(s)
  3. Query Delta Lake data in Hive.
    1. Run the following command to open the Hive CLI:
      hive
    2. Run the following command to view detailed information about the Delta Lake table in Hive:
      desc formatted delta_table;
      The following information is returned:
      OK
      # col_name              data_type               comment
      
      id                      int
      
      # Detailed Table Information
      Database:               default
      Owner:                  root
      CreateTime:             Sat Jul 24 14:20:44 CST 2021
      LastAccessTime:         UNKNOWN
      Retention:              0
      Location:               hdfs://emr-header-1.cluster-238095:9000/tmp/delta_table
      Table Type:             EXTERNAL_TABLE
      Table Parameters:
              EXTERNAL                TRUE
              delta.database          default
              delta.syncMetadataToCatalog     true
              delta.table             delta_table
              delta.tableType         EXTERNAL_TABLE
              last_modified_time      1627107644
              numFiles                0
              spark.sql.sources.provider      delta
              spark.sql.sources.schema.numParts       1
              spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
              totalSize               0
              transient_lastDdlTime   1627107644
      
      # Storage Information
      SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
      InputFormat:            io.delta.hive.DeltaInputFormat
      OutputFormat:           io.delta.hive.DeltaOutputFormat
      Compressed:             No
      Num Buckets:            0
      Bucket Columns:         []
      Sort Columns:           []
      Storage Desc Params:
              path                    hdfs://emr-header-1.cluster-238095:9000/tmp/delta_table
              serialization.format    1
      Time taken: 1.09 seconds, Fetched: 37 row(s)                                  
    3. Run the following command to view data of the Delta Lake table in Hive:
      select * from delta_table;
      The following information is returned:
      OK
      2
      3
      4
      0
      1
      Time taken: 1.897 seconds, Fetched: 5 row(s)
      Note The preceding information shows that the data queried in Hive is the same as the data in the Delta Lake table in Spark. This indicates that you have used Hive to access Delta Lake data.

Use Hive to access Hudi data

Notice Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Hive to read Hudi data.
  1. Open the Spark CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Spark CLI:
      spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
      --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  2. Create a Hudi table in Spark and query data in the table.
    1. Run the following command to create a Hudi table in Spark:
      create table h0 (
                     id bigint,
                     name string,
                     price double
                   ) using hudi;  
      You can run the following command to view the table schema:
      desc formatted h0;
      The following fields are displayed in the return information. These fields are the default system fields of Hudi and are synchronized to the metadata.
      _hoodie_commit_time string  NULL
      _hoodie_commit_seqno  string  NULL
      _hoodie_record_key  string  NULL
      _hoodie_partition_path  string  NULL
      _hoodie_file_name string  NULL
    2. Run the following command to insert data into the table:
      insert into h0 select 1, 'a1', 10;
    3. Run the following command to query data in the table:
      select * from h0;
      The following information is returned:
      OK
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      20210728180336  20210728180336_0_1      74641e17-75d6-4243-be71-a5aa98b5c1af            49062904-52da-402c-82c1-84c04d3c2a4c-0_0-6-6_20210728180336.parquet     1       a1      10.0
      Time taken: 2.001 seconds, Fetched: 1 row(s)
  3. Query Hudi data in Hive.
    1. Run the following command to open the Hive CLI:
      hive
    2. Run the following command to query data of the Hudi table in Hive:
      select * from h0;
      The following information is returned:
      OK
      SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
      SLF4J: Defaulting to no-operation (NOP) logger implementation
      SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
      20210728180336  20210728180336_0_1      74641e17-75d6-4243-be71-a5aa98b5c1af            49062904-52da-402c-82c1-84c04d3c2a4c-0_0-6-6_20210728180336.parquet     1       a1      10.0
      Time taken: 2.001 seconds, Fetched: 1 row(s)
      Note The preceding information shows that the data queried in Hive is the same as the data in the Hudi table in Spark. This indicates that you have used Hive to access Hudi data. If you do not want to view the default system fields of Hudi in Hive, you can create an external table in Hive that directs to the data directory of Hudi.
      You can also run the following command to view the schema of the Hudi table in Hive:
      desc formatted h0;
      The following information is returned:
      OK
      # col_name              data_type             comment             
           
      _hoodie_commit_time   string                                    
      _hoodie_commit_seqno  string                                    
      _hoodie_record_key    string                                    
      _hoodie_partition_path  string                                    
      _hoodie_file_name     string                                    
      id                    bigint                                    
      name                  string                                    
      price                 double                                    
           
      # Detailed Table Information     
      Database:             default                
      Owner:                root                   
      CreateTime:           Thu Jun 24 16:50:48 CST 2021   
      LastAccessTime:       UNKNOWN                
      Retention:            0                      
      Location:             hdfs://emr-header-1.cluster-23****:9000/user/hive/warehouse/h0   
      Table Type:           MANAGED_TABLE          
      Table Parameters:    
        last_commit_time_sync 20210624165059      
        numFiles              1                   
        spark.sql.create.version  2.4.7               
        spark.sql.sources.provider  hudi                
        spark.sql.sources.schema.numParts 1                   
        spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}
        totalSize             434752              
        transient_lastDdlTime 1624524648          
           
      # Storage Information    
      SerDe Library:        org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
      InputFormat:          org.apache.hudi.hadoop.HoodieParquetInputFormat  
      OutputFormat:         org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
      Compressed:           No                     
      Num Buckets:          -1                     
      Bucket Columns:       []                     
      Sort Columns:         []                     
      Storage Desc Params:     
        path                  hdfs://emr-header-1.cluster-234173:9000/user/hive/warehouse/h0
        serialization.format  1                   
      Time taken: 0.204 seconds, Fetched: 40 row(s)