Query data in a Delta table - E-MapReduce - Alibaba Cloud Documentation Center

After you use Spark in E-MapReduce (EMR) to create a Delta table, the metadata of the table is synchronized to Hive Metastore. You can query data in the Delta table in the same manner as you query data in other tables. You can also create an external table in Hive to query data in the Delta table. This topic describes how to use Spark to create a Delta table and use Hive, Presto, or Trino to query data in the Delta table. This topic also describes how to create an external table in Hive to query data in an existing Delta table.

Background information

You can use only Spark to execute DDL and DML statements on a Delta table. You can use Hive, Presto, or Trino to query data in a Delta table. For more information, see the Use Spark to create a Delta table and query data in the table section in this topic.
EMR also allows you to create an external table in Hive to query data in an existing Delta table. For more information, see Create an external table in Hive.

Limits

This topic applies to EMR V3.37.0 and later minor versions, EMR V5.3.0 and later minor versions, and EMR V4.10.0.

Use Spark to create a Delta table and query data in the table

Open the Spark CLI.
1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
2. Run the following command to open the Spark CLI:
```
spark-sql
```
Create a table in Spark and query data in the table.
1. Run the following command to create a Delta Lake table named delta_table in Spark:
```
create table delta_table (id int) using delta location "/tmp/delta_table";
```
2. Run the following command to insert data into the table:
```
insert into delta_table values 0,1,2,3,4;
```
3. Run the following command to query data in the table:
```
select * from delta_table;
```
  The information that contains the following content is returned:
```
2
3
4
0
1
Time taken: 1.847 seconds, Fetched 5 row(s)
```
Query data in the Delta table.
Use Hive to query data in the Delta table
Important
In EMR V3.37.0 and later minor versions, if you want to use Hive to query data in a Delta table, you must add a custom parameter in the EMR console or by running commands. The parameter that you must add in the EMR console varies based on the execution engine of Hive. If you use Hive on MapReduce, you must add the hive.input.format parameter on the hive-site.xml subtab of the Configure tab for the Hive service and set the parameter to io.delta.hive.HiveInputFormat. If you use Hive on Tez, you must add the hive.tez.input.format parameter and set the parameter to io.delta.hive.HiveInputFormat. For more information, see Add parameters.
If you want to execute the ADD COLUMN statement to add a column to a Delta table, we recommend that you add the new field before the partition field. This prevents data exceptions from occurring when you use Hive to query data in the Delta table. For more information, see ALTER TABLE.
1. Run the following command to open the Hive CLI:
```
hive
```
2. Run the following command to query data in the Delta table:
```
select * from delta_table;
```
  The following output is returned:
  2 3 4 0 1 Time taken: 2.937 seconds, Fetched: 5 row(s)
  Note
  The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Hive is successful.
Use Presto to query data in the Delta table
1. Run the following command to open the Presto CLI:
```
presto --server master-1-1:8889 --catalog delta --schema default --user hadoop
```
  Parameters in the preceding command:
  - --server master-1-1:8889: the IP address and port number of the Presto service.
  - --catalog delta: the name of the data source to which you want to connect.
  - --schema default: the name of the database that you want to use. In this example, a default database is used. You can modify this parameter based on your business requirements.
  - --user hadoop: the username for authentication.
2. Run the following command to query data in the Delta table:
```
select * from delta_table;
```
  The following output is returned:
  id ---- 2 3 4 0 1 (5 rows)
  Note
  The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Presto is successful.
Use Trino to query data in the Delta table
1. Run the following command to open the Trino CLI:
```
trino --server master-1-1:9090 --catalog delta --schema default --user hadoop
```
  Parameters in the preceding command:
  - --server master-1-1:9090: the IP address and port number of the Trino service.
  - --catalog delta: the name of the data source to which you want to connect.
  - --schema default: the name of the database that you want to use. In this example, a default database is used. You can modify this parameter based on your business requirements.
  - --user hadoop: the username for authentication.
2. Run the following command to query data in the Delta table:
```
select * from delta_table;
```
  The following output is returned:
  id ---- 2 3 4 0 1 (5 rows)
  Note
  The data that is returned is the same as the data that you inserted into the Delta table by using Spark. This indicates that the access to the Delta table from Trino is successful.

Create an external table in Hive

EMR allows you to create an external table in Hive to query data in an existing Delta table. Sample code:

CREATE EXTERNAL TABLE delta_tbl (name string, age int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
STORED AS INPUTFORMAT 'io.delta.hive.DeltaInputFormat'
OUTPUTFORMAT 'io.delta.hive.DeltaOutputFormat'
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");

If your cluster is of EMR V3.37.0 or a later minor version, you can also use Hive Storage Handler to create an external table. Sample code:

CREATE EXTERNAL TABLE delta_tbl123 (name string, age int, city string)
STORED BY 'io.delta.hive.DeltaStorageHandler'
WITH SERDEPROPERTIES("path" = "/path/to/delta")
LOCATION '/path/to/delta'
TBLPROPERTIES("spark.sql.sources.provider" = "delta");

Note

In the preceding statements, path for SERDEPROPERTIES and spark.sql.sources.provider for TBLPROPERTIES are configured to ensure that Spark SQL can be used to query the external table. After you create the external table, you can use Hive, Presto, or Trino to query the external table in the same manner as you use Hive, Presto, or Trino to query a Delta table that is created by using Spark. For more information, see Step 3.

E-MapReduce:Query data in a Delta table

Background information

Limits

Use Spark to create a Delta table and query data in the table

Use Hive to query data in the Delta table

Use Presto to query data in the Delta table

Use Trino to query data in the Delta table

Create an external table in Hive