This topic describes the table creation statements that are supported after Hudi is integrated with Spark SQL.
Background information
When you use Spark SQL to create a Hudi table, you can configure parameters in options to configure the table. The following table describes the parameters.
In a version later than Hudi 0.10, options is replaced with tblproperties.
Parameter | Description | Required |
primaryKey | The primary key column. If you use multiple fields as primary key columns, separate the fields with commas (,). | Yes. |
type | The type of the table. Valid values:
| No. |
preCombineField | The version field. This parameter corresponds to the DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY field in the Hudi table. | We recommend that you configure this parameter. Otherwise, the upsert operation is not supported. |
payloadClass | Default value: DefaultHoodieRecordPayload. This parameter corresponds to the DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY field in the Hudi table. | No. |
Prerequisites
An E-MapReduce (EMR) cluster that contains the Spark and Hudi services is created. For more information, see Create a cluster.
limits
Only clusters of EMR V3.36.0 or a later minor version and clusters of EMR V5.2.0 or a later minor version allow you to use Spark SQL to read data from or write data to Hudi.
Methods to start Spark SQL
- Spark 2 or Spark 3 and Hudi of a version earilier than 0.11
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' - Spark 3 and Hudi 0.11 or later
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Create a non-partitioned table
Examples:
Create a non-partitioned table that is of the cow type and uses id as the primary key column.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'cow', primaryKey = 'id' );Create a non-partitioned table that is of the mor type and uses id and name as the primary key columns.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'mor', primaryKey = 'id,name' );Create a non-partitioned table of the cow type.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'cow' );
Create a partitioned table
Example:
create table if not exists h_p0 (
id bigint,
name string,
dt string,
hh string
) using hudi
location 'oss://xxx/h_p0'
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'id'
)
partitioned by (dt, hh);In the command, the location parameter specifies the storage path of the table. The path can be an Object Storage Service (OSS) storage path or a Hadoop Distributed File System (HDFS) storage path. The primary key column is id, the partition fields are dt and hh, and the version field is id.
Create an external table
You can create an external table based on an existing Hudi table. Example:
create table h0
using hudi
location '/xx/xx/h0';CTAS statement
This section provides examples on how to use the CREATE TABLE AS SELECT (CTAS) statement.
Example 1:
create table if not exists h1 using hudi as select 1 as id, 'a1' as name, 10 as price;Example 2:
create table if not exists h2 using hudi partitioned by (dt) location '/xx/xx/h2' options ( type = 'mor', primaryKey = 'id,name' ) as select 1 as id, 'a1' as name, 20 as price, '2021-01-03' as dt;