All Products
Search
Document Center

E-MapReduce:Query data in a Delta table

Last Updated:Dec 28, 2023

After you use Spark in E-MapReduce (EMR) to create a Delta table, the metadata of the table is synchronized to Hive Metastore. You can query data in the Delta table in the same manner as you query data in other tables. You can also create an external table in Hive to query data in the Delta table. This topic describes how to use Spark to create a Delta table and use Hive, Presto, or Trino to query data in the Delta table. This topic also describes how to create an external table in Hive to query data in an existing Delta table.

Background information

Limits

This topic applies to EMR V3.37.0 and later minor versions, EMR V5.3.0 and later minor versions, and EMR V4.10.0.

Use Spark to create a Delta table and query data in the table

  1. Open the Spark CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Spark CLI:
      spark-sql
  2. Create a table in Spark and query data in the table.
    1. Run the following command to create a Delta Lake table named delta_table in Spark:
      create table delta_table (id int) using delta location "/tmp/delta_table";
    2. Run the following command to insert data into the table:
      insert into delta_table values 0,1,2,3,4;
    3. Run the following command to query data in the table:
      select * from delta_table;
      The information that contains the following content is returned:
      2
      3
      4
      0
      1
      Time taken: 1.847 seconds, Fetched 5 row(s)
  3. Query data in the Delta table.

    Use Hive to query data in the Delta table

    Important
    • In EMR V3.37.0 and later minor versions, if you want to use Hive to query data in a Delta table, you must add a custom parameter in the EMR console or by running commands. The parameter that you must add in the EMR console varies based on the execution engine of Hive. If you use Hive on MapReduce, you must add the hive.input.format parameter on the hive-site.xml subtab of the Configure tab for the Hive service and set the parameter to io.delta.hive.HiveInputFormat. If you use Hive on Tez, you must add the hive.tez.input.format parameter and set the parameter to io.delta.hive.HiveInputFormat. For more information, see Add parameters.

    • If you want to execute the ADD COLUMN statement to add a column to a Delta table, we recommend that you add the new field before the partition field. This prevents data exceptions from occurring when you use Hive to query data in the Delta table. For more information, see ALTER TABLE.

    1. Run the following command to open the Hive CLI:

      hive
    2. Run the following command to query data in the Delta table:

      select * from delta_table;

      The following output is returned:

      2
      3
      4
      0
      1
      Time taken: 2.937 seconds, Fetched: 5 row(s)
      Note

      The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Hive is successful.

    Use Presto to query data in the Delta table

    1. Run the following command to open the Presto CLI:

      presto --server master-1-1:8889 --catalog delta --schema default --user hadoop

      Parameters in the preceding command:

      • --server master-1-1:8889: the IP address and port number of the Presto service.

      • --catalog delta: the name of the data source to which you want to connect.

      • --schema default: the name of the database that you want to use. In this example, a default database is used. You can modify this parameter based on your business requirements.

      • --user hadoop: the username for authentication.

    2. Run the following command to query data in the Delta table:

      select * from delta_table;

      The following output is returned:

       id
      ----
        2
        3
        4
        0
        1
      (5 rows)
      Note

      The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Presto is successful.

    Use Trino to query data in the Delta table

    1. Run the following command to open the Trino CLI:

      trino --server master-1-1:9090 --catalog delta --schema default --user hadoop

      Parameters in the preceding command:

      • --server master-1-1:9090: the IP address and port number of the Trino service.

      • --catalog delta: the name of the data source to which you want to connect.

      • --schema default: the name of the database that you want to use. In this example, a default database is used. You can modify this parameter based on your business requirements.

      • --user hadoop: the username for authentication.

    2. Run the following command to query data in the Delta table:

      select * from delta_table;

      The following output is returned:

       id
      ----
        2
        3
        4
        0
        1
      (5 rows)
      Note

      The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Trino is successful.

Create an external table in Hive

EMR allows you to create an external table in Hive to query data in an existing Delta table. Sample code:

CREATE EXTERNAL TABLE delta_tbl (name string, age int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");

If your cluster is of EMR V3.37.0 or a later minor version, you can also use Hive Storage Handler to create an external table. Sample code:

CREATE EXTERNAL TABLE delta_tbl123 (name string, age int, city string)
STORED BY 'io.delta.hive.DeltaStorageHandler'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");
Note

In the preceding statements, path for SERDEPROPERTIES and spark.sql.sources.provider for TBLPROPERTIES are configured to ensure that Spark SQL can be used to query the external table. After you create the external table, you can use Hive, Presto, or Trino to query the external table in the same manner as you use Hive, Presto, or Trino to query a Delta table that is created by using Spark. For more information, see Step 3.