You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, you can use Hudi at lower costs. This topic describes how to use Spark SQL to read data from or write data to Hudi.
Prerequisites
An EMR cluster is created. For more information, see Create a cluster.
Limits
Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.
Open the Spark SQL CLI
- Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
- Run the following command to open the Spark SQL CLI:
spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
If the output contains the following information, the Spark SQL CLI is started:spark-sql>
Execute SQL statements
- Create a table
create table h0( id bigint, name string, price double ) using hudi options ( primaryKey = 'id', preCombineField = 'id' );
If the output contains information similar to the following example, the table is created:Time taken: 1.258 seconds
- Insert data into a table
insert into h0 select 1, 'a1', 10;
If the output contains information similar to the following example, data is inserted into the table:Time taken: 7.294 seconds
- Query data
select id,name,price from h0;
If the output contains information similar to the following example, the query operation succeeds:1 a1 10.0 Time taken: 1.219 seconds, Fetched 1 row(s)