After you use Spark in E-MapReduce (EMR) to create a Delta table, the table is synchronized to the Hive metadatabase. You can query data in the Delta table in the same way as you query data in other tables. You can also create an external table in Hive to query data in the Delta table. This topic describes how to use Spark to create a Delta table and use Hive, Presto, or Impala to query data in the Delta table. This topic also describes how to create an external table in Hive and use the external table to query data in an existing Delta table.

Background information

Use Spark to create a Delta table and query data in the table

  1. Open the Spark CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Spark CLI:
      spark-sql
  2. Create a table in Spark and query data in the table.
    1. Run the following command to create a Delta Lake table named delta_table in Spark:
      create table delta_table (id int) using delta location "/tmp/delta_table";
    2. Run the following command to insert data into the table:
      insert into delta_table values 0,1,2,3,4;
    3. Run the following command to query data in the table:
      select * from delta_table;
      The information that contains the following content is returned:
      2
      3
      4
      0
      1
      Time taken: 1.847 seconds, Fetched 5 row(s)
  3. Query data in the Delta table.
    • Use Hive to query data in the Delta table
      Notice In EMR V3.37.0 and later minor versions, if you want to use Hive to query data in a Delta table, you must add a custom parameter in the EMR console or by running commands. The parameter that you must add in the EMR console varies based on the execution engine of Hive. If you use Hive on MapReduce, you must add the hive.input.format parameter on the Configure tab for the Hive service and set the parameter to io.delta.hive.HiveInputFormat. If you use Hive on Tez, you must add the hive.tez.input.format parameter on the tab and set the parameter to io.delta.hive.HiveInputFormat. For more information about how to add a custom parameter, see Add parameters.
      1. Run the following command to open the Hive CLI:
        hive
      2. Run the following command to query data in the Delta table:
        select * from delta_table;
        The following information is returned:
        2
        3
        4
        0
        1
        Time taken: 2.937 seconds, Fetched: 5 row(s)
        Note The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Hive is successful.
    • Use Presto to query data in the Delta table
      1. Run the following command to open the Presto CLI:
        presto --server emr-header-1:9090 --catalog hive --schema default
      2. Run the following command to query data in the Delta table:
        select * from delta_table;
        The following information is returned:
         id
        ----
          2
          3
          4
          0
          1
        (5 rows)
        Note The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Presto is successful.
    • Use Impala to query data in the Delta table
      1. Run the following command to open the Impala CLI:
        impala-shell
      2. Run the following command to query data in the Delta table:
        select * from delta_table;
        The following information is returned:
        +----+
        | id |
        +----+
        | 2  |
        | 3  |
        | 4  |
        | 0  |
        | 1  |
        +----+                             
        Note The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Impala is successful.

Create an external table in Hive

EMR allows you to create an external table in Hive to query data in an existing Delta table. Example:
CREATE EXTERNAL TABLE delta_tbl (name string, age int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");
If your cluster is of EMR V3.37.0 or a later minor version, you can also use Hive Storage Handler to create an external table. Example:
CREATE EXTERNAL TABLE delta_tbl123 (name string, age int, city string)
STORED BY 'io.delta.hive.DeltaStorageHandler'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");
Note The path parameter for SERDEPROPERTIES and the spark.sql.sources.provider parameter for TBLPROPERTIES in the CREATE EXTERNAL TABLE statement are used to ensure that you can also use Spark SQL to query data after the external table is created. After you create the external table, you can use Hive, Presto, or Impala and the external table to query data in the Delta table that is associated with the external table. The query operations are the same as the operations described in Step 3.