All Products
Search
Document Center

E-MapReduce:Use Spark to read data from or write data to Delta Lake and Hudi

Last Updated:Oct 30, 2023

Delta Lake and Hudi are mainstream data lake products. You can read data from or write data to Delta Lake and Hudi by using Spark. This topic describes how to use Spark to read data from or write data to Delta Lake and Hudi.

Background information

For more information about Delta Lake and Hudi, see Delta Lake documentation and Hudi documentation.

Preparations

Environments

Add Project Object Model (POM) dependencies that are related to Delta Lake or Hudi to your project.

Parameters

  • Delta Lake parameters

    spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension

    If Spark 3 is deployed in your cluster, you also need to configure the following parameter:

    spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
  • Hudi parameters

    spark.serializer org.apache.spark.serializer.KryoSerializer
    spark.sql.extensions org.apache.spark.sql.hudi.HoodieSparkSessionExtension

    If Spark 3 is deployed in your cluster, you also need to configure the following parameter:

    spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog

Use Spark to read data from and write data to Delta Lake

Spark SQL syntax

The following example describes how to use Spark SQL to read data from and write data to Delta Lake.

-- Create a table.
create table delta_tbl (id int, name string) using delta;

-- Insert data into the table.
insert into delta_tbl values (1, "a1"), (2, "a2");

-- Update data in the table.
update delta_tbl set name = 'a1_new' where id = 1;

-- Delete data from the table.
delete from delta_tbl where id = 1;

-- Query data from the table.
select * from delta_tbl;

Spark Dataset syntax

The following example describes how to use Spark Dataset to read data from and write data to Delta Lake.

// Write data to the table.
val df = Seq((1, "a1"), (2, "a2")).toDF("id", name)
df.write.format("delta").save("/tmp/delta_tbl")

// Read data from the table.
spark.read.format("delta").load("/tmp/delta_tbl")

Use Spark to read data from and write data to Hudi

Spark SQL syntax

The following example describes how to use Spark SQL to read data from and write data to Hudi.

-- Create a table.
create table hudi_tbl (
  id bigint,
  name string,
  price double,
  ts long
) using hudi
tblproperties (
  primaryKey="id",
  preCombineField="ts"
);

-- Insert data into the table.
insert into hudi_tbl values (1, 'a1', 10.0, 1000), (2, 'a2', 11.0, 1000);

-- Update data in the table.
update hudi_tbl set name = 'a1_new' where id = 1;

-- Delete data from the table.
delete from hudi_tbl where id = 1;

-- Query data from the table.
select * from hudi_tbl;

Spark Dataset syntax

The following example describe how to use Spark Dataset to read data from and write data to Hudi.

// Write data to the table.
import org.apache.hudi.DataSourceWriteOptions._

val df = Seq((1, "a1", 10.0, 1000), (2, "a2", 11.0, 1000)).toDF("id", "name", "price", "ts")

df.write.format("hudi").
option(PRECOMBINE_FIELD.key(), "ts").
option(RECORDKEY_FIELD.key(), "id").
option(PARTITIONPATH_FIELD.key(), "").
option("hoodie.table.name", "hudi_tbl").
mode("append").
save("/tmp/hudi_tbl")

// Read data from the table.
spark.read.format("hudi").load("/tmp/hudi_tbl")