You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, you can use Hudi at lower costs. This topic describes how to use Spark SQL to read data from or write data to Hudi.

Prerequisites

An EMR cluster is created. For more information, see Create a cluster.

Limits

Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.

Open the Spark SQL CLI

  1. Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to open the Spark SQL CLI:
    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    If the output contains the following information, the Spark SQL CLI is started:
    spark-sql>

Execute SQL statements

  • Create a table
    create table h0(
       id bigint,
       name string,
       price double
     ) using hudi
     options (
       primaryKey = 'id', 
       preCombineField = 'id'
    );  
    If the output contains information similar to the following example, the table is created:
    Time taken: 1.258 seconds
  • Insert data into a table
    insert into h0 select 1, 'a1', 10;
    If the output contains information similar to the following example, data is inserted into the table:
    Time taken: 7.294 seconds
  • Query data
    select id,name,price from h0;
    If the output contains information similar to the following example, the query operation succeeds:
    1  a1  10.0
    Time taken: 1.219 seconds, Fetched 1 row(s)