Delta Lake and Hudi are mainstream data lake products. You can read data from or write data to Delta Lake and Hudi by using Spark. This topic describes how to use Spark to read data from or write data to Delta Lake and Hudi.
Background information
For more information about Delta Lake and Hudi, see Delta Lake documentation and Hudi documentation.
Preparations
Environments
Add Project Object Model (POM) dependencies that are related to Delta Lake or Hudi to your project.
Parameters
Delta Lake parameters
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtensionIf Spark 3 is deployed in your cluster, you also need to configure the following parameter:
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalogHudi parameters
spark.serializer org.apache.spark.serializer.KryoSerializer spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtensionIf Spark 3 is deployed in your cluster, you also need to configure the following parameter:
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
Use Spark to read data from and write data to Delta Lake
Spark SQL syntax
The following example describes how to use Spark SQL to read data from and write data to Delta Lake.
-- Create a table.
create table delta_tbl (id int, name string) using delta;
-- Insert data into the table.
insert into delta_tbl values (1, "a1"), (2, "a2");
-- Update data in the table.
update delta_tbl set name = 'a1_new' where id = 1;
-- Delete data from the table.
delete from delta_tbl where id = 1;
-- Query data from the table.
select * from delta_tbl;Spark Dataset syntax
The following example describes how to use Spark Dataset to read data from and write data to Delta Lake.
// Write data to the table.
val df = Seq((1, "a1"), (2, "a2")).toDF("id", name)
df.write.format("delta").save("/tmp/delta_tbl")
// Read data from the table.
spark.read.format("delta").load("/tmp/delta_tbl")Use Spark to read data from and write data to Hudi
Spark SQL syntax
The following example describes how to use Spark SQL to read data from and write data to Hudi.
-- Create a table.
create table hudi_tbl (
id bigint,
name string,
price double,
ts long
) using hudi
tblproperties (
primaryKey="id",
preCombineField="ts"
);
-- Insert data into the table.
insert into hudi_tbl values (1, 'a1', 10.0, 1000), (2, 'a2', 11.0, 1000);
-- Update data in the table.
update hudi_tbl set name = 'a1_new' where id = 1;
-- Delete data from the table.
delete from hudi_tbl where id = 1;
-- Query data from the table.
select * from hudi_tbl;Spark Dataset syntax
The following example describe how to use Spark Dataset to read data from and write data to Hudi.
// Write data to the table.
import org.apache.hudi.DataSourceWriteOptions._
val df = Seq((1, "a1", 10.0, 1000), (2, "a2", 11.0, 1000)).toDF("id", "name", "price", "ts")
df.write.format("hudi").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "id").
option(PARTITIONPATH_FIELD.key(), "").
option("hoodie.table.name", "hudi_tbl").
mode("append").
save("/tmp/hudi_tbl")
// Read data from the table.
spark.read.format("hudi").load("/tmp/hudi_tbl")