You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, usage costs of Hudi are significantly reduced. This topic describes how to use Spark SQL to read data from or write data to Hudi.

Prerequisites

An EMR Hadoop cluster is created. For more information, see Create a cluster.

Limits

Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.

Open the Spark SQL CLI

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to open the Spark SQL CLI:
    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    If the output contains the following information, the Spark SQL CLI is opened:
    spark-sql>

Execute SQL statements

  • Create a table
    create table h0 (
                   id bigint,
                   name string,
                   price double
                 ) using hudi;  
    If the output contains information similar to the following example, the table is created.
    Time taken: 1.258 seconds
  • Insert data into a table
    insert into h0 select 1, 'a1', 10;
    If the output contains information similar to the following example, data is inserted into the table.
    Time taken: 7.294 seconds
  • Query data
    select id,name,price from h0;
    If the output contains information similar to the following example, the query operation succeeds.
    1  a1  10.0
    Time taken: 1.219 seconds, Fetched 1 row(s)