Create an Iceberg Catalog to query and manage Iceberg data - E-MapReduce

An Iceberg catalog is an external catalog that lets StarRocks access Iceberg data without manual table creation. Supported since StarRocks v2.4.

With an Iceberg catalog, you can:

Query Iceberg data directly without creating tables manually.
Process, model, and load Iceberg data into StarRocks using INSERT INTO or asynchronous materialized views (v2.5 and later).
Create and drop Iceberg databases and tables from StarRocks, and write data to Parquet-formatted Iceberg tables using INSERT INTO (v3.1 and later).

Prerequisites

Before you begin, make sure your StarRocks cluster can reach both the storage system and the metadata service of the Iceberg cluster.

Supported storage systems:

Hadoop Distributed File System (HDFS)
Alibaba Cloud Object Storage Service (OSS)

Supported metadata services:

Hive Metastore (HMS)
Data Lake Formation (DLF) 1.0 (Legacy)
Data Lake Formation (DLF)

Usage notes

The following table summarizes supported file formats, compression formats, and Iceberg table versions when querying Iceberg data from StarRocks.

File format	Compression formats	Iceberg table version
Parquet	SNAPPY, LZ4, ZSTD, GZIP, NO_COMPRESSION	v1: Supported. v2: Position deletes supported from StarRocks v3.1; equality deletes from v3.1.10+, v3.2.5+, and v3.3.
ORC	ZLIB, SNAPPY, LZO, LZ4, ZSTD, NO_COMPRESSION	v1: Supported. v2: Position deletes supported from StarRocks v3.0; equality deletes from v3.1.8+, v3.2.3+, and v3.3.

Create an Iceberg catalog

Syntax

CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
    "type" = "iceberg",
    MetastoreParams
)

Common parameters

The following parameters apply to all metadata service types.

Parameter	Required	Description
`catalog_name`	Yes	The catalog name. Must start with a letter and contain only letters (a–z, A–Z), digits (0–9), and underscores (`_`). Maximum 64 characters. Case-sensitive.
`comment`	No	A description of the catalog.
`type`	Yes	The data source type. Set to `iceberg`.

MetastoreParams

The MetastoreParams settings vary by metadata service. Choose the tab that matches your environment.

Use HMS

Property	Required	Description
`iceberg.catalog.type`	Yes	The catalog type in Iceberg. Set to `hive`.
`hive.metastore.uris`	Yes	The HMS URI. Format: `thrift://<metastore_ip>:<port>`. Default port: 9083.

Example

CREATE EXTERNAL CATALOG iceberg_catalog_hms
PROPERTIES
(
    "type" = "iceberg",
    "iceberg.catalog.type" = "hive",
    "hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083"
);

Use DLF 1.0 (Legacy)

Property	Required	Description
`iceberg.catalog.type`	Yes	The catalog type in Iceberg. Set to `dlf`.
`dlf.catalog.id`	No	The ID of an existing DLF data catalog. If not set, the system uses the default DLF catalog.

Example

CREATE EXTERNAL CATALOG iceberg_catalog_hms
PROPERTIES
(
    "type" = "iceberg",
    "iceberg.catalog.type" = "dlf",
    "dlf.catalog.id" = "sr_dlf"
);

Use DLF

Important

To use DLF, configure a Resource Access Management (RAM) user to perform operations in StarRocks Manager. For details, see Use a DLF catalog.

Property	Required	Description
`uri`	Yes	The DLF REST API address. Format: `http://<VPC_endpoint>/iceberg`, where `<VPC_endpoint>` is the DLF VPC endpoint for your region. For endpoint values, see Service endpoints. Example: `http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg`.
`iceberg.catalog.type`	Yes	The Iceberg catalog type. Set to `dlf_rest`.
`warehouse`	Yes	The Iceberg catalog name. Get this value from the Data Catalog page in the Data Lake Formation console.
`rest.signing-region`	Yes	The DLF region ID. Example: `cn-hangzhou`.

Example

CREATE EXTERNAL CATALOG iceberg_catalog
PROPERTIES
(
    "type" = "iceberg",
    "iceberg.catalog.type" = "dlf_rest",
    "uri" = "http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg",
    "warehouse" = "iceberg_test",
    "rest.signing-region" = "cn-hangzhou"
);

View an Iceberg catalog

Run SHOW CATALOGS to list all catalogs in the current StarRocks cluster.

SHOW CATALOGS;

To view the creation statement of a specific catalog, run SHOW CREATE CATALOG. For example:

SHOW CREATE CATALOG iceberg_catalog_hms;

Create an Iceberg database

If you have the CREATE DATABASE permission on an Iceberg catalog, use CREATE DATABASE to create a database in that catalog. Available from v3.1.

Note

Use GRANT and REVOKE to manage permissions for users and roles.

Switch to the target Iceberg catalog, then run:

CREATE DATABASE <database_name>
[PROPERTIES ("location" = "<prefix>://<path_to_database>/<database_name.db>/")]

The location parameter specifies the database file path. Set <prefix> based on your storage system:

HDFS: hdfs
OSS: oss

If you omit location, StarRocks creates the database in the default path of the current Iceberg catalog.

Switch between Iceberg catalogs and databases

Use either of the following approaches:

Set the catalog, then set the database:

-- Switch the active catalog for the current session.
SET CATALOG <catalog_name>;

-- Set the active database.
USE <db_name>;

Switch to a specific database in the target catalog directly:
```
USE <catalog_name>.<db_name>;
```

Drop an Iceberg database

If you have the DROP permission on an Iceberg database, use DROP DATABASE to drop it. Available from v3.1. Only empty databases can be dropped.

Note

Use GRANT and REVOKE to manage permissions for users and roles.

Dropping a database does not delete the corresponding file path on HDFS or OSS. Switch to the target Iceberg catalog, then run:

DROP DATABASE <database_name>;

Drop an Iceberg catalog

Run DROP CATALOG to drop an external catalog. For example:

DROP CATALOG iceberg_catalog_hms;

Create an Iceberg table

If you have the CREATE TABLE permission on an Iceberg database, use CREATE TABLE or CREATE TABLE AS SELECT (CTAS) to create a table. Available from v3.1.

Switch to the target Iceberg catalog and database, then run:

Syntax

CREATE TABLE [IF NOT EXISTS] [database.]table_name
(column_definition1[, column_definition2, ...
partition_column_definition1, partition_column_definition2...])
[partition_desc]
[PROPERTIES ("key" = "value", ...)]
[AS SELECT query]

Parameters

column_definition

col_name col_type [COMMENT 'comment']

Parameter	Description
`col_name`	The column name.
`col_type`	The column data type. Supported types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, DATE, DATETIME, CHAR, VARCHAR[(length)], ARRAY, MAP, STRUCT. Not supported: LARGEINT, HLL, BITMAP.

Note

All non-partition key columns default to NULL. Partition key columns must be declared last in the column list and cannot be NULL.

partition_desc

PARTITION BY (par_col1[, par_col2...])

StarRocks supports identity transforms only — one partition is created per unique partition value.

Note

Partition key columns must be declared last in the column list. Supported data types exclude FLOAT, DOUBLE, DECIMAL, and DATETIME. NULL values are not supported.

PROPERTIES

Declare Iceberg table properties in "key"="value" format.

Property	Description	Default
`location`	The Iceberg table file path. Not required when HMS is the metadata service.	—
`file_format`	The table file format. Currently only Parquet is supported.	`parquet`
`compression_codec`	The compression format. Supported values: SNAPPY, GZIP, ZSTD, LZ4. Deprecated since v3.2.3 — use the `connector_sink_compression_codec` session variable instead.	`gzip`

Examples

Create a non-partitioned table with id and score columns:

CREATE TABLE unpartition_tbl
(
    id int,
    score double
);

Create a partitioned table with id and dt as partition keys:

CREATE TABLE partition_tbl_1
(
    action varchar(20),
    id int NOT NULL,
    dt date NOT NULL
)
PARTITION BY (id, dt);

Create a partitioned table from a query result using CTAS:

CREATE TABLE partition_tbl_2
PARTITION BY (id, dt)
AS SELECT * FROM partition_tbl_1;

View an Iceberg table schema

View the table schema:

DESC[RIBE] <catalog_name>.<database_name>.<table_name>;

View the schema and storage location from the CREATE statement:

SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;

Insert data into an Iceberg table

If you have the INSERT permission on an Iceberg table, use INSERT to write data from a StarRocks table. Only Parquet-formatted Iceberg tables are supported. Available from v3.1.

Note

Use GRANT and REVOKE to manage permissions for users and roles.

Switch to the target Iceberg catalog and database, then run:

Syntax

INSERT {INTO | OVERWRITE} <table_name>
[ (column_name [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

-- Write data to a specific partition.
INSERT {INTO | OVERWRITE} <table_name>
PARTITION (par_col1=<value> [, par_col2=<value>...])
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }

Note

Partition key columns cannot be NULL. Make sure partition key columns have values during data loading.

Parameters

Parameter	Description
`INTO`	Appends data to the target table.
`OVERWRITE`	Overwrites all data in the target table.
`column_name`	Target columns for the import. Multiple columns are comma-separated. Specified columns must exist in the target table and include all partition key columns. Column order must match the query result. Unspecified non-partition columns default to `NULL`. If a data type mismatch occurs, an implicit conversion is attempted; if conversion fails, INSERT INTO returns a parsing error.
`expression`	An expression that assigns a value to the corresponding column.
`DEFAULT`	Assigns the default value to the corresponding column.
`query`	A query whose results are loaded into the target table. Supports any StarRocks SQL query syntax.
`PARTITION`	The target partition for data loading. All partition key columns of the target table must be specified. When PARTITION is specified, `column_name` cannot be used.

Examples

Insert three rows into partition_tbl_1:

INSERT INTO partition_tbl_1
VALUES
    ("buy", 1, "2023-09-01"),
    ("sell", 2, "2023-09-02"),
    ("buy", 3, "2023-09-03");

Insert a calculated result in a specific column order:

INSERT INTO partition_tbl_1 (id, action, dt) SELECT 1+1, 'buy', '2023-09-03';

Insert a result derived from the table itself:

INSERT INTO partition_tbl_1 SELECT 'buy', 1, date_add(dt, INTERVAL 2 DAY) FROM partition_tbl_1
WHERE id=1;

Insert into a specific partition — two equivalent methods:

-- Method 1
INSERT INTO partition_tbl_2 SELECT 'order', 1, '2023-09-01';

-- Method 2
INSERT INTO partition_tbl_2 PARTITION(dt='2023-09-01', id=1) SELECT 'order';

Overwrite a specific partition — two equivalent methods:

-- Method 1
INSERT OVERWRITE partition_tbl_1 SELECT 'close', 1, '2023-09-01';

-- Method 2
INSERT OVERWRITE partition_tbl_1 PARTITION(dt='2023-09-01', id=1) SELECT 'close';

Query data from an Iceberg table

List databases in the Iceberg cluster for the specified catalog:
```
SHOW DATABASES FROM <catalog_name>;
```
Switch to the target Iceberg catalog and database.

Query the target table:

SELECT count(*) FROM <table_name> LIMIT 10;

Drop an Iceberg table

If you have the DROP permission on an Iceberg table, use DROP TABLE to drop it. Available from v3.1.

Note

Use GRANT and REVOKE to manage permissions for users and roles.

Dropping a table does not delete the corresponding file path and data on HDFS or OSS. A force drop (using the FORCE keyword) deletes the data but not the file path.

Switch to the target Iceberg catalog and database, then run:

DROP TABLE <table_name> FORCE;

References

For more information about Iceberg, see Overview.