This topic describes the data definition language (DDL) statements that are supported after Hudi is integrated with Spark SQL.

Background information

When you use Spark SQL to create a Hudi table, you can configure parameters in options to configure the table. The following table describes the parameters.
Parameter Description
primaryKey The primary key column. If you use multiple fields as primary key columns, separate the fields with commas (,).
type The type of the table. Valid values:
  • cow: the Copy-On-Write type. This is the default value.
  • mor: the Merge-On-Read type.
preCombineField The version field.

This parameter corresponds to the DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY field in the Hudi table.

payloadClass Default value: DefaultHoodieRecordPayload.

This parameter corresponds to the DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY field in the Hudi table.

Prerequisites

An EMR Hadoop cluster is created. For more information, see Create a cluster.

Limits

Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.

Open the Spark SQL CLI

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to open the Spark SQL CLI:
    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    If the output contains the following information, the Spark SQL CLI is opened:
    spark-sql>

Create a non-partitioned table

Examples:
  • Create a non-partitioned table that is of the cow type and uses id as the primary key column.
    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'cow',
      primaryKey = 'id'
    );
  • Create a non-partitioned table that is of the mor type and uses id and name as the primary key columns.
    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'mor',
      primaryKey = 'id,name'
    );
  • Create a non-partitioned table of the cow type.
    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'cow'
    );

Create a partitioned table

Example:
create table if not exists  h_p0 (
id bigint,
name string,
dt string, 
hh string  
) using hudi
location '/xx/xxx/h_p0'
options (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 ) 
partitioned by (dt, hh);
Note In the command, the location parameter specifies the storage path of the table. The path can be an Object Storage Service (OSS) storage path or a Hadoop Distributed File System (HDFS) storage path. The primary key column is id, the partition fields are dt and hh, and the version field is ts.

Create an external table

You can create an external table based on an existing Hudi table. Example:
create table h0
using hudi
location '/xx/xx/h0';

CTAS statement

This section provides examples on how to use the CREATE TABLE AS SELECT (CTAS) statement.

  • Example 1:
    create table if not exists h1 using hudi 
    as 
    select 1 as id, 'a1' as name, 10 as price;
  • Example 2:
    create table if not exists h2 using hudi
    partitioned by (dt)
    location '/xx/xx/h2'
    options (
      type = 'mor',
      primaryKey = 'id,name'
    ) as
    select 1 as id, 'a1' as name, 20 as price, '2021-01-03' as dt;