EMR allows you to read data from a Delta table by using Hive. The following two methods are available: DeltaInputFormat and Spark SQL. DeltaInputFormat is dedicated for EMR. This topic describes how to use Hive to read data from a Delta table.

Use DeltaInputFormat to read data from a Delta table (for EMR only)

  1. Use the Hive client to create an external table in your Hive metastore. Specify a Delta directory for the external table.
    CREATE EXTERNAL TABLE delta_tbl(id bigint, `date` string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
    OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
    LOCATION '/tmp/delta_table';
    Note
    • If the Delta table from which you want to read data is a partitioned table, create a partitioned external table in Hive by using the PARTITIONED BY clause.
    • When a new partition is added to the Delta table, run the msck repair command to synchronize the partition information to the external table in Hive.
  2. Start the Hive client to read data from the Delta table.
    SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;---This parameter must be set in versions earlier than EMR V3.25.0.
    SELECT * FROM delta_tbl LIMIT 10;

Use Spark SQL to query the Delta table created by using Hive

Spark SQL cannot access the Delta table created in Use DeltaInputFormat to read data from a Delta table (for EMR only) because the table does not contain the information required by Spark. If you want to use Spark SQL to access the table, add the required information to the CREATE TABLE statement of Hive.
CREATE EXTERNAL TABLE delta_tbl(id bigint, `date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES("path" = "/tmp/delta_table")
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION '/tmp/delta_table'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");
Note Hive is not compatible with Delta tables created by using the USING syntax of Spark SQL.