All Products
Search
Document Center

E-MapReduce:DDL statements

Last Updated:Oct 09, 2023

This topic describes the table creation statements that are supported after Hudi is integrated with Spark SQL.

Background information

When you use Spark SQL to create a Hudi table, you can configure parameters in options to configure the table. The following table describes the parameters.

Important

In a version later than Hudi 0.10, options is replaced with tblproperties.

Parameter

Description

Required

primaryKey

The primary key column. If you use multiple fields as primary key columns, separate the fields with commas (,).

Yes.

type

The type of the table. Valid values:

  • cow: the Copy-On-Write type. This is the default value.

  • mor: the Merge-On-Read type.

No.

preCombineField

The version field.

This parameter corresponds to the DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY field in the Hudi table.

We recommend that you configure this parameter. Otherwise, the upsert operation is not supported.

payloadClass

Default value: DefaultHoodieRecordPayload.

This parameter corresponds to the DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY field in the Hudi table.

No.

Prerequisites

An EMR Hadoop cluster is created. For more information, see Create a cluster.

Limits

Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.

Open the Spark SQL CLI

  1. Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to open the Spark SQL CLI:
    spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
    If the output contains the following information, the Spark SQL CLI is opened:
    spark-sql>

Create a non-partitioned table

Examples:

  • Create a non-partitioned table that is of the cow type and uses id as the primary key column.

    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'cow',
      primaryKey = 'id'
    );
  • Create a non-partitioned table that is of the mor type and uses id and name as the primary key columns.

    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'mor',
      primaryKey = 'id,name'
    );
  • Create a non-partitioned table of the cow type.

    create table if not exists h0(
      id bigint, 
      name string, 
      price double
    ) using hudi
    options (
      type = 'cow'
    );

Create a partitioned table

Example:

create table if not exists h_p0 (
id bigint,
name string,
dt string, 
hh string
) using hudi
location 'oss://xxx/h_p0'
options (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'id'
 )
partitioned by (dt, hh);
Note

In the command, the location parameter specifies the storage path of the table. The path can be an Object Storage Service (OSS) storage path or a Hadoop Distributed File System (HDFS) storage path. The primary key column is id, the partition fields are dt and hh, and the version field is id.

Create an external table

You can create an external table based on an existing Hudi table. Example:

create table h0
using hudi
location '/xx/xx/h0';

CTAS statement

This section provides examples on how to use the CREATE TABLE AS SELECT (CTAS) statement.

  • Example 1:

    create table if not exists h1 using hudi 
    as 
    select 1 as id, 'a1' as name, 10 as price;
  • Example 2:

    create table if not exists h2 using hudi
    partitioned by (dt)
    location '/xx/xx/h2'
    options (
      type = 'mor',
      primaryKey = 'id,name'
    ) as
    select 1 as id, 'a1' as name, 20 as price, '2021-01-03' as dt;