You can use Paimon to build a data lake on HDFS or OSS and use the Spark compute engine to analyze the data. This topic describes how to use Spark SQL in EMR to read data from and write data to Paimon tables.
Prerequisites
You have created a DataLake or Custom EMR cluster with Spark and Paimon installed. For more information, see Create a cluster.
Limitations
-
You can use Spark SQL to read from and write to Paimon on EMR clusters running EMR-3.46.0 or later, or EMR-5.12.0 or later.
-
Only Spark SQL in Spark 3 can read data from and write data to Paimon through a catalog.
Procedure
Step 1: Configure a catalog
Spark reads from and writes to Paimon tables through a catalog. Two types of catalogs are available: the Paimon catalog and spark_catalog. You can choose one based on your use case.
-
Paimon catalog: Manages metadata in the Paimon format. You can use it only to query and write to Paimon tables.
-
spark_catalog: The default built-in catalog in Spark. It typically manages the metadata of internal tables in Spark SQL and lets you query and write both Paimon and non-Paimon tables.
Paimon catalog
You can store metadata in a file system such as HDFS or an object storage service such as OSS. You can also synchronize metadata to DLF and Hive to make Paimon accessible to other services.
The spark.sql.catalog.paimon.warehouse parameter specifies the warehouse root path. If the root path does not exist, the system automatically creates it. If the root path exists, you can use this catalog to access the existing tables in the path.
-
Log on to the master node of your cluster over SSH. For more information, see Log on to a cluster.
-
Select a catalog to configure based on the metadata type. Run the corresponding command to start Spark SQL.
Filesystem catalog
A filesystem catalog stores metadata in a file system or object storage.
spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \ --conf spark.sql.catalog.paimon.metastore=filesystem \ --conf spark.sql.catalog.paimon.warehouse=oss://<yourBucketName>/warehouse \ --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensionsNote-
spark.sql.catalog.paimon: Defines a catalog named paimon. -
spark.sql.catalog.paimon.metastore: Specifies the metastore type for the catalog. Setting the value tofilesystemmeans that metadata is stored in the file system. -
spark.sql.catalog.paimon.warehouse: Configures the location of the data warehouse. Replace<yourBucketName>with the name of your OSS bucket. For more information about how to create a bucket, see Create a bucket.
DLF catalog
A DLF catalog synchronizes metadata to DLF.
ImportantWhen you create the cluster, Metadata must be set to DLF Unified Metadata.
spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \ --conf spark.sql.catalog.paimon.metastore=dlf \ --conf spark.sql.catalog.paimon.warehouse=oss://<yourBucketName>/warehouse \ --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensionsNote-
spark.sql.catalog.paimon: Defines a catalog named paimon. -
spark.sql.catalog.paimon.metastore: Specifies the metastore type for the catalog. Setting the value todlfmeans that metadata is synchronized to Data Lake Formation (DLF). -
spark.sql.catalog.paimon.warehouse: Configures the location of the data warehouse. Replace<yourBucketName>with the name of your OSS bucket. For more information about how to create a bucket, see Create a bucket.
Hive catalog
A Hive catalog synchronizes metadata to Hive Metastore. Tables created in the Hive catalog can be queried directly in Hive. For more information about querying Paimon from Hive, see Integrate Paimon with Hive.
spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog \ --conf spark.sql.catalog.paimon.metastore=hive \ --conf spark.sql.catalog.paimon.uri=thrift://master-1-1:9083 \ --conf spark.sql.catalog.paimon.warehouse=oss://<yourBucketName>/warehouse \ --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensionsNote-
spark.sql.catalog.paimon: Defines a catalog named paimon. -
spark.sql.catalog.paimon.metastore: Specifies the metastore type for the catalog. Setting the value tohivemeans that metadata is synchronized to Hive Metastore. -
spark.sql.catalog.paimon.uri: The address and port of the Hive Metastore service. The valuethrift://master-1-1:9083means that Spark SQL connects to the Hive Metastore service that runs on themaster-1-1host and listens on port 9083 to obtain metadata. -
spark.sql.catalog.paimon.warehouse: Configures the location of the data warehouse. Replace<yourBucketName>with the name of your OSS bucket. For more information about how to create a bucket, see Create a bucket.
-
spark_catalog
-
Log on to the master node of your cluster over SSH. For more information, see Log on to a cluster.
-
Run the following command to configure the catalog and start Spark SQL.
spark-sql --conf spark.sql.catalog.spark_catalog=org.apache.paimon.spark.SparkGenericCatalog \ --conf spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensionsNote-
spark.sql.catalog.spark_catalog: Defines a catalog named spark_catalog. -
The warehouse root path for
spark_catalogis specified by thespark.sql.warehouse.dirparameter. In most cases, you do not need to modify this parameter.
-
Step 2: Read and write Paimon tables
Run the following Spark SQL statements to create a table in the catalog and then read data from and write data to the table.
Paimon catalog
To access a Paimon table, use the format paimon.<db_name>.<tbl_name>, where <db_name> is the database name and <tbl_name> is the table name.
-- Create a database.
CREATE DATABASE IF NOT EXISTS paimon.ss_paimon_db;
-- Create a Paimon table.
CREATE TABLE paimon.ss_paimon_db.paimon_tbl (id INT, name STRING) USING paimon;
-- Write data to the Paimon table.
INSERT INTO paimon.ss_paimon_db.paimon_tbl VALUES (1, "apple"), (2, "banana"), (3, "cherry");
-- Query the write result.
SELECT * FROM paimon.ss_paimon_db.paimon_tbl ORDER BY id;
-- Drop the database.
DROP DATABASE paimon.ss_paimon_db CASCADE;
If you receive a metastore: Failed to connect to the Metastore Server error when you create a database after you configure a Hive catalog, this means the Hive Metastore service is not running. You must run the following command to start the service. After the service starts, run the command to configure the Hive catalog again.
hive --service metastore &
If you selected DLF Unified Metadata when you created the cluster, we recommend configuring a DLF catalog to synchronize the metadata to DLF.
spark_catalog
You can use spark_catalog.<db_name>.<tbl_name> to access both Paimon and non-Paimon tables. Because spark_catalog is the default built-in catalog in Spark, you can omit the catalog name and access tables directly using <db_name>.<tbl_name>. In these formats, <db_name> is the database name and <tbl_name> is the table name.
-- Create databases.
CREATE DATABASE IF NOT EXISTS ss_paimon_db;
CREATE DATABASE IF NOT EXISTS ss_parquet_db;
-- Create a Paimon table and a Parquet table.
CREATE TABLE ss_paimon_db.paimon_tbl (id INT, name STRING) USING paimon;
CREATE TABLE ss_parquet_db.parquet_tbl USING parquet AS SELECT 3, "cherry";
-- Write data to the Paimon table.
INSERT INTO ss_paimon_db.paimon_tbl VALUES (1, "apple"), (2, "banana");
INSERT INTO ss_paimon_db.paimon_tbl SELECT * FROM ss_parquet_db.parquet_tbl;
-- Query the write result.
SELECT * FROM ss_paimon_db.paimon_tbl ORDER BY id;
-- Drop the databases.
DROP DATABASE ss_paimon_db CASCADE;
DROP DATABASE ss_parquet_db CASCADE;
The query returns the following result:
1 apple
2 banana
3 cherry
FAQ
Related documentation
For more information about the usage and configuration of Paimon, see the Apache Paimon documentation.
