Build Iceberg Data Lakes with EMR Serverless Spark - E-MapReduce

Apache Iceberg is an open table format for data lakes that supports ACID transactions, schema evolution, and time travel on large datasets stored in object storage or HDFS. This topic walks you through reading from and writing to Iceberg tables in E-MapReduce (EMR) Serverless Spark using a Spark SQL session.

Prerequisites

Before you begin, ensure that you have:

A workspace in EMR Serverless Spark. See Create a workspace.

Choose a catalog type

Every Iceberg read/write operation in EMR Serverless Spark requires a catalog. Choose a catalog type based on where you store Iceberg metadata.

Catalog type	Metadata storage	Minimum engine version	Configuration required
Data Catalog	Managed by Data Lake Formation (DLF)	None	None — managed through the console
DLF	DLF	esr-4.7.0 / esr-3.6.0	Spark configuration parameters
DLF-Legacy	DLF-Legacy (formerly DLF 1.0)	esr-4.3.0 / esr-3.3.0 / esr-2.7.0	Spark configuration parameters
Hive Metastore	Hive Metastore	esr-4.3.0 / esr-3.3.0 / esr-2.7.0	Spark configuration parameters
File system	Object Storage Service (OSS) or Hadoop Distributed File System (HDFS)	None	Spark configuration parameters

Note

Both Spark SQL and Notebook support reading from and writing to Iceberg tables. This topic uses a Spark SQL session as an example.

Step 1: Create a SQL session

Log on to the EMR console.
In the left-side navigation pane, choose EMR Serverless > Spark.
Click the name of the workspace you want to manage.
In the left-side navigation pane, choose Operation Center > Sessions.
On the SQL Sessions tab, click Create SQL Session.
On the Create SQL Session page, go to the Spark Configuration section and add the catalog configuration for your chosen catalog type. Then click Create. For general session settings, see Manage SQL sessions. For data catalog management, see Manage data catalogs.

Data Catalog (no configuration required)

Click Add Catalog on the Catalogs page in the console. The catalog becomes available for selection directly in Spark SQL — no Spark configuration parameters needed.

DLF

Note

Requires engine version esr-4.7.0, esr-3.6.0, or later.

Add the following parameters to the Spark Configuration section:

spark.sql.extensions                                          org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg_catalog                             org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_catalog.catalog-impl                org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.iceberg_catalog.uri                         http://<region-id>-vpc.dlf.aliyuncs.com
spark.sql.catalog.iceberg_catalog.warehouse                   <catalog-name>
spark.sql.catalog.iceberg_catalog.io-impl                     org.apache.iceberg.rest.DlfFileIO
spark.sql.catalog.iceberg_catalog.rest.auth.type              sigv4
spark.sql.catalog.iceberg_catalog.rest.auth.sigv4.delegate-auth-type  none
spark.sql.catalog.iceberg_catalog.rest.signing-region         <region-id>
spark.sql.catalog.iceberg_catalog.rest.signing-name           DlfNext
spark.sql.catalog.iceberg_catalog.rest.access-key-id          <access-key-id>
spark.sql.catalog.iceberg_catalog.rest.secret-access-key      <access-key-secret>

Replace the following placeholders:

<region-id>: The region where your DLF service is located, for example, cn-hangzhou.
<catalog-name>: The name of your DLF catalog. Do not use a catalog created by data sharing.
<access-key-id>: The AccessKey ID of your Alibaba Cloud account or RAM user.
<access-key-secret>: The AccessKey secret of your Alibaba Cloud account or RAM user.

The static fields (org.apache.iceberg.spark.SparkCatalog, org.apache.iceberg.rest.RESTCatalog, org.apache.iceberg.rest.DlfFileIO, sigv4, none, DlfNext) do not need to be changed.

Parameter	Description	Example
`spark.sql.extensions`	Enables Iceberg Spark extensions.	`org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions`
`spark.sql.catalog.iceberg_catalog`	Registers a Spark catalog named `iceberg_catalog`.	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.iceberg_catalog.catalog-impl`	Sets the catalog implementation to Iceberg REST Catalog.	`org.apache.iceberg.rest.RESTCatalog`
`spark.sql.catalog.iceberg_catalog.uri`	The REST API endpoint of the DLF Iceberg service. Format: `http://<region-id>-vpc.dlf.aliyuncs.com`.	`http://cn-hangzhou-vpc.dlf.aliyuncs.com`
`spark.sql.catalog.iceberg_catalog.warehouse`	The associated DLF catalog name.	`<catalog-name>`
`spark.sql.catalog.iceberg_catalog.io-impl`	Uses the DLF-customized FileIO implementation.	`org.apache.iceberg.rest.DlfFileIO`
`spark.sql.catalog.iceberg_catalog.rest.auth.type`	Enables SigV4 signature authentication for REST requests.	`sigv4`
`spark.sql.catalog.iceberg_catalog.rest.auth.sigv4.delegate-auth-type`	Disables delegated authentication. The client signs requests directly with the AccessKey ID and AccessKey secret.	`none`
`spark.sql.catalog.iceberg_catalog.rest.signing-region`	The region used for signing, which must match the DLF service region.	`cn-hangzhou`
`spark.sql.catalog.iceberg_catalog.rest.signing-name`	The service name used for signing.	`DlfNext`
`spark.sql.catalog.iceberg_catalog.rest.access-key-id`	The AccessKey ID of your Alibaba Cloud account or RAM user.	`<access-key-id>`
`spark.sql.catalog.iceberg_catalog.rest.secret-access-key`	The AccessKey secret of your Alibaba Cloud account or RAM user.	`<access-key-secret>`

DLF-Legacy (formerly DLF 1.0)

Note

Requires engine version esr-4.3.0, esr-3.3.0, esr-2.7.0, or later.

Metadata is stored in DLF-Legacy (formerly DLF 1.0). Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.catalog-impl     org.apache.iceberg.aliyun.dlf.hive.DlfCatalog
spark.sql.catalog.<catalog-name>.dlf.catalog.id   <catalog-name>

Replace the following placeholder:

<catalog-name>: The name you assign to this catalog. Use the same value in both the parameter key and dlf.catalog.id.

Hive Metastore

Note

Requires engine version esr-4.3.0, esr-3.3.0, esr-2.7.0, or later.

Metadata is stored in a specific Hive Metastore. Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.catalog-impl     org.apache.iceberg.hive.HiveCatalog
spark.sql.catalog.<catalog-name>.uri              thrift://<hms-ip>:<port>

Replace the following placeholders:

<catalog-name>: The name you assign to this catalog.
<hms-ip>: The internal IP address of your Hive Metastore. For an external Metastore, see Connect to an external Hive Metastore service.
<port>: The Hive Metastore port, typically 9083.

File system

Metadata is stored directly in the file system using HadoopCatalog — no Hive Metastore needed. Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.type             hadoop
spark.sql.catalog.<catalog-name>.warehouse        oss://<bucket-name>/warehouse

Replace the following placeholders:

<catalog-name>: The name you assign to this catalog.
<bucket-name>: The name of your OSS bucket.

Step 2: Run Iceberg SQL

Note

All examples use iceberg_catalog — the catalog name configured in Step 1. If you used a different catalog name, replace iceberg_catalog accordingly.

On the EMR Serverless Spark page, click Data Development in the left-side navigation pane.
On the Development tab, click the icon.
In the New dialog box, enter a name (for example, users_task), leave the type as SparkSQL, and click OK.

Paste the following SQL into the new tab.

Note

If you omit the database name, tables are created in the default database of the catalog. Create a separate database to organize your tables.

-- Create a database
CREATE DATABASE IF NOT EXISTS iceberg_catalog.db;

-- Create a non-partitioned table
CREATE TABLE iceberg_catalog.db.tbl (
    id BIGINT NOT NULL COMMENT 'unique id',
    data STRING
)
USING iceberg;

-- Insert data into the non-partitioned table
INSERT INTO iceberg_catalog.db.tbl VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie');

-- Query all rows
SELECT * FROM iceberg_catalog.db.tbl;

-- Query with a filter
SELECT * FROM iceberg_catalog.db.tbl WHERE id = 2;

-- Update a row
UPDATE iceberg_catalog.db.tbl SET data = 'David' WHERE id = 3;

-- Confirm the update
SELECT * FROM iceberg_catalog.db.tbl WHERE id = 3;

-- Delete a row
DELETE FROM iceberg_catalog.db.tbl WHERE id = 1;

-- Confirm the deletion
SELECT * FROM iceberg_catalog.db.tbl;

-- Create a partitioned table
CREATE TABLE iceberg_catalog.db.part_tbl (
    id BIGINT,
    data STRING,
    category STRING,
    ts TIMESTAMP,
    dt DATE
)
USING iceberg
PARTITIONED BY (dt, category);

-- Insert data into the partitioned table
INSERT INTO iceberg_catalog.db.part_tbl VALUES
  (1 , 'data-01', 'A', timestamp'2026-01-01 10:00:00', date'2026-01-01'),
  (2 , 'data-02', 'A', timestamp'2026-01-01 11:00:00', date'2026-01-01'),
  (3 , 'data-03', 'A', timestamp'2026-01-02 09:30:00', date'2026-01-02'),
  (4 , 'data-04', 'B', timestamp'2026-01-02 12:15:00', date'2026-01-02'),
  (5 , 'data-05', 'B', timestamp'2026-01-03 08:05:00', date'2026-01-03'),
  (6 , 'data-06', 'B', timestamp'2026-01-03 14:20:00', date'2026-01-03'),
  (7 , 'data-07', 'C', timestamp'2026-01-04 16:45:00', date'2026-01-04'),
  (8 , 'data-08', 'C', timestamp'2026-01-04 18:10:00', date'2026-01-04'),
  (9 , 'data-09', 'C', timestamp'2026-01-05 07:55:00', date'2026-01-05'),
  (10, 'data-10', 'A', timestamp'2026-01-05 13:35:00', date'2026-01-05');

-- Query all rows from the partitioned table
SELECT * FROM iceberg_catalog.db.part_tbl;

-- Query by date partition
SELECT * FROM iceberg_catalog.db.part_tbl WHERE dt = '2026-01-01';

-- Query by category
SELECT * FROM iceberg_catalog.db.part_tbl WHERE category = 'A';

-- Query with multiple partition filters
SELECT * FROM iceberg_catalog.db.part_tbl
WHERE dt = '2026-01-01'
  AND category = 'A';

-- Aggregate by category
SELECT category, COUNT(*) AS count
FROM iceberg_catalog.db.part_tbl
GROUP BY category;

-- Drop the database (use with caution — the database must be empty first)
-- DROP DATABASE iceberg_catalog.db;

In the session drop-down list, select the SQL session you created, and click Run. Results appear below the editor after the job completes.

What's next

End-to-end task development: Quick start for Spark SQL development — covers the full SQL task development and orchestration workflow.
Session management: Manage SQL sessions — configure session resources and settings.
Notebook sessions: Manage notebook sessions — use Iceberg with Notebook instead of Spark SQL.
Iceberg SQL reference: Apache Iceberg documentation — DDL commands, querying, writing, and table maintenance.