All Products
Search
Document Center

E-MapReduce:Use Iceberg

Last Updated:Mar 25, 2026

Apache Iceberg is an open table format for data lakes that supports ACID transactions, schema evolution, and time travel on large datasets stored in object storage or HDFS. This topic walks you through reading from and writing to Iceberg tables in E-MapReduce (EMR) Serverless Spark using a Spark SQL session.

Prerequisites

Before you begin, ensure that you have:

Choose a catalog type

Every Iceberg read/write operation in EMR Serverless Spark requires a catalog. Choose a catalog type based on where you store Iceberg metadata.

Catalog typeMetadata storageMinimum engine versionConfiguration required
Data CatalogManaged by Data Lake Formation (DLF)NoneNone — managed through the console
DLFDLFesr-4.7.0 / esr-3.6.0Spark configuration parameters
DLF-LegacyDLF-Legacy (formerly DLF 1.0)esr-4.3.0 / esr-3.3.0 / esr-2.7.0Spark configuration parameters
Hive MetastoreHive Metastoreesr-4.3.0 / esr-3.3.0 / esr-2.7.0Spark configuration parameters
File systemObject Storage Service (OSS) or Hadoop Distributed File System (HDFS)NoneSpark configuration parameters
Note

Both Spark SQL and Notebook support reading from and writing to Iceberg tables. This topic uses a Spark SQL session as an example.

Step 1: Create a SQL session

  1. Log on to the EMR console.

  2. In the left-side navigation pane, choose EMR Serverless > Spark.

  3. Click the name of the workspace you want to manage.

  4. In the left-side navigation pane, choose Operation Center > Sessions.

  5. On the SQL Sessions tab, click Create SQL Session.

  6. On the Create SQL Session page, go to the Spark Configuration section and add the catalog configuration for your chosen catalog type. Then click Create. For general session settings, see Manage SQL sessions. For data catalog management, see Manage data catalogs.

Data Catalog (no configuration required)

Click Add Catalog on the Catalogs page in the console. The catalog becomes available for selection directly in Spark SQL — no Spark configuration parameters needed.

DLF

Note

Requires engine version esr-4.7.0, esr-3.6.0, or later.

Add the following parameters to the Spark Configuration section:

spark.sql.extensions                                          org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg_catalog                             org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_catalog.catalog-impl                org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.iceberg_catalog.uri                         http://<region-id>-vpc.dlf.aliyuncs.com
spark.sql.catalog.iceberg_catalog.warehouse                   <catalog-name>
spark.sql.catalog.iceberg_catalog.io-impl                     org.apache.iceberg.rest.DlfFileIO
spark.sql.catalog.iceberg_catalog.rest.auth.type              sigv4
spark.sql.catalog.iceberg_catalog.rest.auth.sigv4.delegate-auth-type  none
spark.sql.catalog.iceberg_catalog.rest.signing-region         <region-id>
spark.sql.catalog.iceberg_catalog.rest.signing-name           DlfNext
spark.sql.catalog.iceberg_catalog.rest.access-key-id          <access-key-id>
spark.sql.catalog.iceberg_catalog.rest.secret-access-key      <access-key-secret>

Replace the following placeholders:

  • <region-id>: The region where your DLF service is located, for example, cn-hangzhou.

  • <catalog-name>: The name of your DLF catalog. Do not use a catalog created by data sharing.

  • <access-key-id>: The AccessKey ID of your Alibaba Cloud account or RAM user.

  • <access-key-secret>: The AccessKey secret of your Alibaba Cloud account or RAM user.

The static fields (org.apache.iceberg.spark.SparkCatalog, org.apache.iceberg.rest.RESTCatalog, org.apache.iceberg.rest.DlfFileIO, sigv4, none, DlfNext) do not need to be changed.

ParameterDescriptionExample
spark.sql.extensionsEnables Iceberg Spark extensions.org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.iceberg_catalogRegisters a Spark catalog named iceberg_catalog.org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg_catalog.catalog-implSets the catalog implementation to Iceberg REST Catalog.org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.iceberg_catalog.uriThe REST API endpoint of the DLF Iceberg service. Format: http://<region-id>-vpc.dlf.aliyuncs.com.http://cn-hangzhou-vpc.dlf.aliyuncs.com
spark.sql.catalog.iceberg_catalog.warehouseThe associated DLF catalog name.<catalog-name>
spark.sql.catalog.iceberg_catalog.io-implUses the DLF-customized FileIO implementation.org.apache.iceberg.rest.DlfFileIO
spark.sql.catalog.iceberg_catalog.rest.auth.typeEnables SigV4 signature authentication for REST requests.sigv4
spark.sql.catalog.iceberg_catalog.rest.auth.sigv4.delegate-auth-typeDisables delegated authentication. The client signs requests directly with the AccessKey ID and AccessKey secret.none
spark.sql.catalog.iceberg_catalog.rest.signing-regionThe region used for signing, which must match the DLF service region.cn-hangzhou
spark.sql.catalog.iceberg_catalog.rest.signing-nameThe service name used for signing.DlfNext
spark.sql.catalog.iceberg_catalog.rest.access-key-idThe AccessKey ID of your Alibaba Cloud account or RAM user.<access-key-id>
spark.sql.catalog.iceberg_catalog.rest.secret-access-keyThe AccessKey secret of your Alibaba Cloud account or RAM user.<access-key-secret>

DLF-Legacy (formerly DLF 1.0)

Note

Requires engine version esr-4.3.0, esr-3.3.0, esr-2.7.0, or later.

Metadata is stored in DLF-Legacy (formerly DLF 1.0). Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.catalog-impl     org.apache.iceberg.aliyun.dlf.hive.DlfCatalog
spark.sql.catalog.<catalog-name>.dlf.catalog.id   <catalog-name>

Replace the following placeholder:

  • <catalog-name>: The name you assign to this catalog. Use the same value in both the parameter key and dlf.catalog.id.

Hive Metastore

Note

Requires engine version esr-4.3.0, esr-3.3.0, esr-2.7.0, or later.

Metadata is stored in a specific Hive Metastore. Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.catalog-impl     org.apache.iceberg.hive.HiveCatalog
spark.sql.catalog.<catalog-name>.uri              thrift://<hms-ip>:<port>

Replace the following placeholders:

  • <catalog-name>: The name you assign to this catalog.

  • <hms-ip>: The internal IP address of your Hive Metastore. For an external Metastore, see Connect to an external Hive Metastore service.

  • <port>: The Hive Metastore port, typically 9083.

File system

Metadata is stored directly in the file system using HadoopCatalog — no Hive Metastore needed. Add the following parameters to the Spark Configuration section:

spark.sql.extensions                              org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.<catalog-name>                  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog-name>.type             hadoop
spark.sql.catalog.<catalog-name>.warehouse        oss://<bucket-name>/warehouse

Replace the following placeholders:

  • <catalog-name>: The name you assign to this catalog.

  • <bucket-name>: The name of your OSS bucket.

Step 2: Run Iceberg SQL

Note

All examples use iceberg_catalog — the catalog name configured in Step 1. If you used a different catalog name, replace iceberg_catalog accordingly.

  1. On the EMR Serverless Spark page, click Data Development in the left-side navigation pane.

  2. On the Development tab, click the image icon.

  3. In the New dialog box, enter a name (for example, users_task), leave the type as SparkSQL, and click OK.

  4. Paste the following SQL into the new tab.

    Note

    If you omit the database name, tables are created in the default database of the catalog. Create a separate database to organize your tables.

    -- Create a database
    CREATE DATABASE IF NOT EXISTS iceberg_catalog.db;
    
    -- Create a non-partitioned table
    CREATE TABLE iceberg_catalog.db.tbl (
        id BIGINT NOT NULL COMMENT 'unique id',
        data STRING
    )
    USING iceberg;
    
    -- Insert data into the non-partitioned table
    INSERT INTO iceberg_catalog.db.tbl VALUES
    (1, 'Alice'),
    (2, 'Bob'),
    (3, 'Charlie');
    
    -- Query all rows
    SELECT * FROM iceberg_catalog.db.tbl;
    
    -- Query with a filter
    SELECT * FROM iceberg_catalog.db.tbl WHERE id = 2;
    
    -- Update a row
    UPDATE iceberg_catalog.db.tbl SET data = 'David' WHERE id = 3;
    
    -- Confirm the update
    SELECT * FROM iceberg_catalog.db.tbl WHERE id = 3;
    
    -- Delete a row
    DELETE FROM iceberg_catalog.db.tbl WHERE id = 1;
    
    -- Confirm the deletion
    SELECT * FROM iceberg_catalog.db.tbl;
    
    -- Create a partitioned table
    CREATE TABLE iceberg_catalog.db.part_tbl (
        id BIGINT,
        data STRING,
        category STRING,
        ts TIMESTAMP,
        dt DATE
    )
    USING iceberg
    PARTITIONED BY (dt, category);
    
    -- Insert data into the partitioned table
    INSERT INTO iceberg_catalog.db.part_tbl VALUES
      (1 , 'data-01', 'A', timestamp'2026-01-01 10:00:00', date'2026-01-01'),
      (2 , 'data-02', 'A', timestamp'2026-01-01 11:00:00', date'2026-01-01'),
      (3 , 'data-03', 'A', timestamp'2026-01-02 09:30:00', date'2026-01-02'),
      (4 , 'data-04', 'B', timestamp'2026-01-02 12:15:00', date'2026-01-02'),
      (5 , 'data-05', 'B', timestamp'2026-01-03 08:05:00', date'2026-01-03'),
      (6 , 'data-06', 'B', timestamp'2026-01-03 14:20:00', date'2026-01-03'),
      (7 , 'data-07', 'C', timestamp'2026-01-04 16:45:00', date'2026-01-04'),
      (8 , 'data-08', 'C', timestamp'2026-01-04 18:10:00', date'2026-01-04'),
      (9 , 'data-09', 'C', timestamp'2026-01-05 07:55:00', date'2026-01-05'),
      (10, 'data-10', 'A', timestamp'2026-01-05 13:35:00', date'2026-01-05');
    
    -- Query all rows from the partitioned table
    SELECT * FROM iceberg_catalog.db.part_tbl;
    
    -- Query by date partition
    SELECT * FROM iceberg_catalog.db.part_tbl WHERE dt = '2026-01-01';
    
    -- Query by category
    SELECT * FROM iceberg_catalog.db.part_tbl WHERE category = 'A';
    
    -- Query with multiple partition filters
    SELECT * FROM iceberg_catalog.db.part_tbl
    WHERE dt = '2026-01-01'
      AND category = 'A';
    
    -- Aggregate by category
    SELECT category, COUNT(*) AS count
    FROM iceberg_catalog.db.part_tbl
    GROUP BY category;
    
    -- Drop the database (use with caution — the database must be empty first)
    -- DROP DATABASE iceberg_catalog.db;
  5. In the session drop-down list, select the SQL session you created, and click Run. Results appear below the editor after the job completes.

    image

What's next