This topic describes the table creation statements that are supported after Hudi is integrated with Spark SQL.
Background information
When you use Spark SQL to create a Hudi table, you can configure parameters in options to configure the table. The following table describes the parameters.
In a version later than Hudi 0.10, options is replaced with tblproperties.
Parameter | Description | Required |
primaryKey | The primary key column. If you use multiple fields as primary key columns, separate the fields with commas (,). | Yes. |
type | The type of the table. Valid values:
| No. |
preCombineField | The version field. This parameter corresponds to the DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY field in the Hudi table. | We recommend that you configure this parameter. Otherwise, the upsert operation is not supported. |
payloadClass | Default value: DefaultHoodieRecordPayload. This parameter corresponds to the DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY field in the Hudi table. | No. |
Prerequisites
An EMR Hadoop cluster is created. For more information, see Create a cluster.
Limits
Only EMR V3.36.0 and later minor versions or EMR V5.2.0 and later minor versions allow you to use Spark SQL to read data from or write data to Hudi.
Open the Spark SQL CLI
- Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
- Run the following command to open the Spark SQL CLI:
spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
If the output contains the following information, the Spark SQL CLI is opened:spark-sql>
Create a non-partitioned table
Examples:
Create a non-partitioned table that is of the cow type and uses id as the primary key column.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'cow', primaryKey = 'id' );
Create a non-partitioned table that is of the mor type and uses id and name as the primary key columns.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'mor', primaryKey = 'id,name' );
Create a non-partitioned table of the cow type.
create table if not exists h0( id bigint, name string, price double ) using hudi options ( type = 'cow' );
Create a partitioned table
Example:
create table if not exists h_p0 (
id bigint,
name string,
dt string,
hh string
) using hudi
location 'oss://xxx/h_p0'
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'id'
)
partitioned by (dt, hh);
In the command, the location parameter specifies the storage path of the table. The path can be an Object Storage Service (OSS) storage path or a Hadoop Distributed File System (HDFS) storage path. The primary key column is id, the partition fields are dt and hh, and the version field is id.
Create an external table
You can create an external table based on an existing Hudi table. Example:
create table h0
using hudi
location '/xx/xx/h0';
CTAS statement
This section provides examples on how to use the CREATE TABLE AS SELECT (CTAS) statement.
Example 1:
create table if not exists h1 using hudi as select 1 as id, 'a1' as name, 10 as price;
Example 2:
create table if not exists h2 using hudi partitioned by (dt) location '/xx/xx/h2' options ( type = 'mor', primaryKey = 'id,name' ) as select 1 as id, 'a1' as name, 20 as price, '2021-01-03' as dt;