Iceberg is an open table format for data lakes. You can use Iceberg to quickly build your own data lake storage service on Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS). This topic describes how to read data from and write data to an Iceberg table in EMR Serverless Spark.
Prerequisites
A workspace is created. For more information, see Create a Workspace.
Procedure
You can use only a Spark SQL job or a notebook to read data from and write data to Iceberg tables. In this topic, a Spark SQL job is used.
Step 1: Create a session
Go to the Sessions page.
Log on to the EMR console.
In the left-side navigation pane, choose
.On the Spark page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the EMR Serverless Spark page, choose Operation Center > Sessions.
On the SQL Sessions tab, click Create SQL Session.
On the Create SQL Session page, add the following code in the Spark Configuration section based on the catalog that you use and click Create. For more information, see Manage SQL sessions.
Catalogs are required when you read data from or write data to Iceberg in EMR Serverless Spark. You can specify a catalog based on your business requirements.
Catalog types
Type
Description
Iceberg catalog
The catalog used to manage metadata in the Iceberg format. You can use Iceberg catalogs only to query data from and write data to Iceberg tables.
Data Lake Formation (DLF) 1.0 catalogs, Hive Metastore catalogs, and file system catalogs are supported. You can specify a catalog based on your business requirements.
To access an Iceberg table, you must specify the table name in the
<catalogName>.<Database name>.<Table name>
format.Important<catalogName>
specifies the name of the catalog. You can specify a catalog name based on your business requirements. We recommend that you use the default catalog nameiceberg
.
spark_catalog
The default catalog of a workspace, which can be used to query data from Iceberg tables and non-Iceberg tables.
The default catalog of a workspace is used.
For information about how to use an external Hive Metastore as a catalog, see use EMR Serverless Spark to connect to an external Hive Metastore.
To access an Iceberg table or a non-Iceberg table, you must specify the table name in the
<Database name>.<Table name>
format.
Catalog configuration
Use an Iceberg catalog
DLF 1.0
Metadata is stored in DLF 1.0.
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.<catalogName> org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.<catalogName>.catalog-impl org.apache.iceberg.aliyun.dlf.hive.DlfCatalog
Hive Metastore
Metadata is stored in a specific Hive Metastore.
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.<catalogName> org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.<catalogName>.catalog-impl org.apache.iceberg.hive.HiveCatalog spark.sql.catalog.<catalogName>.uri thrift://<yourHMSUri>:<port>
Parameter
Description
thrift://<yourHMSUri>:<port>
The Uniform Resource Identifier (URI) of the Hive Metastore. Configure this parameter in the
thrift://<IP address of a Hive Metastore>:9083
format.<IP address of a Hive Metastore>
specifies the internal IP address of the Hive Metastore. For information about how to specify an external Hive Metastore, see Use EMR Serverless Spark to connect to an external Hive Metastore.File system
Metadata is stored in a file system.
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.<catalogName> org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.<catalogName>.type hadoop spark.sql.catalog.<catalogName>.warehouse oss://<yourBucketName>/warehouse
Use a Spark catalog
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkSessionCatalog
Find the created session and click Start in the Actions column.
Step 2: Read data from and write data to an Iceberg table
Go to the data development page of EMR Serverless Spark.
In the left-side navigation pane of the EMR Serverless Spark page, click Data Development.
On the Development tab, click the
icon.
In the Create dialog box, set the Name parameter to users_task, use the default value SparkSQL for the Type parameter, and then click OK.
On the users_task tab, copy the following code to the code editor:
CREATE DATABASE IF NOT EXISTS iceberg.ss_iceberg_db; CREATE TABLE iceberg.ss_iceberg_db.iceberg_tbl (id INT, name STRING) USING iceberg; INSERT INTO iceberg.ss_iceberg_db.iceberg_tbl VALUES (1, "a"), (2, "b"); SELECT id, name FROM iceberg.ss_iceberg_db.iceberg_tbl ORDER BY id;
You can run the following commands to drop the table and database if you no longer need them.
DROP TABLE iceberg.ss_iceberg_db.iceberg_tbl; DROP DATABASE iceberg.ss_iceberg_db;
Select a database from the Default Database drop-down list and the created SQL session from the SQL Sessions drop-down list.
Click Run. The following figure shows the output.
References
For information about how to develop and orchestrate SQL jobs, see Get started with the development of Spark SQL jobs.
For more information about Iceberg, see Apache Iceberg.
For more information about SQL sessions, see Manage SQL sessions.
For information about notebook sessions, see Manage notebook sessions.