An Iceberg Catalog is an external catalog that StarRocks has supported since version 2.4.
Background
With an Iceberg Catalog, you can:
Query data in Iceberg directly through an Iceberg Catalog without creating tables manually.
Process, model, and import data from Iceberg into StarRocks using INSERT INTO or asynchronous materialized views (available in v2.5 and later).
Create or drop Iceberg databases and tables from StarRocks. You can also write data from StarRocks tables into Parquet-formatted Iceberg tables using INSERT INTO (available in v3.1 and later).
To ensure proper access to data in Iceberg, your StarRocks cluster must be able to access the storage system and metadata service of the Iceberg cluster. StarRocks currently supports the following storage systems and metadata services:
Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS).
Metadata services. Currently, supported metadata services include Hive Metastore (HMS) and Data Lake Formation (DLF) 1.0 (Legacy).
Usage notes
When you query Iceberg data from StarRocks, note the following.
File format | Compression format | Iceberg table version |
Parquet | SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION |
|
ORC | ZLIB, SNAPPY, LZO, LZ4, ZSTD, and NO_COMPRESSION |
|
Create an Iceberg catalog
Syntax
CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
"type" = "iceberg",
MetastoreParams
)Parameters
The parameters vary depending on the metadata service that Iceberg uses.
Use HMS
catalog_name: The name of the Iceberg catalog. This parameter is required. The name must meet the following requirements:It must consist of letters (a-z or A-Z), digits (0-9), or underscores (_), and must start with a letter.
The total length cannot exceed 64 characters.
The catalog name is case-sensitive.
comment: The description of the Iceberg catalog. This parameter is optional.type: The type of the data source. Set this toiceberg.MetastoreParams: The parameters for StarRocks to access the metadata service of the Iceberg cluster.Property
Description
iceberg.catalog.type
The type of catalog in Iceberg. The value must be
hive.hive.metastore.uris
The URI of the Hive Metastore. The format is
thrift://<ip_address_of_hive_metastore>:<port>. The default port is 9083.
Use DLF 1.0 (Legacy)
catalog_name: The name of the Iceberg catalog. This parameter is required. The name must meet the following requirements:It must consist of letters (a-z or A-Z), digits (0-9), or underscores (_), and must start with a letter.
The total length cannot exceed 64 characters.
The catalog name is case-sensitive.
comment: The description of the Iceberg catalog. This parameter is optional.type: The type of the data source. Set this toiceberg.MetastoreParams: The parameters for StarRocks to access the metadata service of the Iceberg cluster.Property
Description
iceberg.catalog.type
The type of catalog in Iceberg. The value must be
dlf.dlf.catalog.id
The ID of an existing data catalog in DLF. If you do not configure the
dlf.catalog.idparameter, the system uses the default DLF catalog.
Use DLF
If you use DLF, you must use a configured Resource Access Management (RAM) user to perform operations in StarRocks Manager. For more information, see Use a DLF Catalog.
catalog_name: The name of the Iceberg Catalog. This parameter is required. The name must meet the following requirements:It must start with a letter and can contain only letters (a-z or A-Z), digits (0-9), and underscores (_).
The length cannot exceed 64 characters.
comment: A description of the Iceberg Catalog. This parameter is optional.type: The type of the data source. Set this toiceberg.CatalogParams: The parameters for StarRocks to access the metadata of the Iceberg cluster. The parameter settings vary depending on the metadata type used by the Iceberg cluster.Parameter
Required
Description
<catalog_name>Yes
The name of the DLF data catalog, for example,
dlf_catalog.typeYes
The catalog type. For an Iceberg data source, enter the fixed value
iceberg.uriYes
The REST API address of DLF. The format is
http://<VPC_Endpoint>/iceberg.<VPC_Endpoint>is the VPC endpoint of DLF in the specified region. For specific values, see Service endpoints.
For example:http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg.iceberg.catalog.typeYes
The Iceberg Catalog type. For a DLF scenario, enter the fixed value
dlf_rest.warehouseYes
The name of the Iceberg Catalog. You can obtain it from the Data Catalog page of the Data Lake Formation console.
rest.signing-regionYes
The
Region IDof the DLF service, for example,cn-hangzhou.
Example
The following example creates an Iceberg catalog named iceberg_catalog_hms.
Use HMS
CREATE EXTERNAL CATALOG iceberg_catalog_hms
PROPERTIES
(
"type" = "iceberg",
"iceberg.catalog.type" = "hive",
"hive.metastore.uris" = "thrift://xx.xx.xx.xx:9083"
);Use DLF 1.0 (Legacy)
CREATE EXTERNAL CATALOG iceberg_catalog_hms
PROPERTIES
(
"type" = "iceberg",
"iceberg.catalog.type" = "dlf",
"dlf.catalog.id" = "sr_dlf"
);Use DLF
CREATE EXTERNAL CATALOG iceberg_catalog
properties
(
"type" = "iceberg",
"iceberg.catalog.type" = "dlf_rest",
"uri" = "http://cn-hangzhou-vpc.dlf.aliyuncs.com/iceberg",
"warehouse" = "iceberg_test",
"rest.signing-region" = "cn-hangzhou"
);View an Iceberg catalog
You can run SHOW CATALOGS to query all catalogs in the current StarRocks cluster.
SHOW CATALOGS;You can also run SHOW CREATE CATALOG to view the creation statement of an external catalog. For example, run the following command to view the creation statement of the Iceberg catalog iceberg_catalog_hms.
SHOW CREATE CATALOG iceberg_catalog_hms;Create an Iceberg database
Similar to the internal catalog of StarRocks, if you have the CREATE DATABASE permission on an Iceberg catalog, you can use CREATE DATABASE to create a database in that catalog. This feature is available from v3.1.
You can use the GRANT and REVOKE statements to grant and revoke permissions for users and roles.
Syntax
Switch to the target Iceberg catalog and then run the following statement to create an Iceberg database.
CREATE DATABASE <database_name>
[PROPERTIES ("location" = "<prefix>://<path_to_database>/<database_name.db>/")]Parameters
The location parameter specifies the file path of the database. It supports HDFS and Alibaba Cloud OSS:
If you use HDFS as the storage system, set
Prefixtohdfs.If you use Alibaba Cloud OSS as the storage system, set
Prefixtooss.
If you do not specify the location parameter, StarRocks creates the database in the default path of the current Iceberg catalog.
Switch between Iceberg catalogs and databases
You can switch to the target Iceberg catalog and database in the following ways:
First, use
SET CATALOGto specify the Iceberg catalog for the current session, and then use USE to specify the database.-- Switch the active catalog for the current session. SET CATALOG <catalog_name>; -- Specify the active database for the current session. USE <db_name>;Run USE to directly switch the session to a specific database in the target Iceberg catalog.
USE <catalog_name>.<db_name>;
Drop an Iceberg database
Similar to an internal StarRocks database, if you have the DROP permission on an Iceberg database, you can use DROP DATABASE to drop it. This feature is available from v3.1 and only supports dropping empty databases.
You can use the GRANT and REVOKE statements to grant and revoke permissions for users and roles.
The drop database operation does not delete the corresponding file path on HDFS or OSS. Switch to the target Iceberg catalog and then run the following statement to drop an Iceberg database.
DROP DATABASE <database_name>;Drop an Iceberg catalog
You can run DROP CATALOG to drop an external catalog. For example, run the following command to drop iceberg_catalog_hms.
DROP Catalog iceberg_catalog_hms;Create an Iceberg table
Similar to an internal StarRocks database, if you have the CREATE TABLE permission on an Iceberg database, you can use CREATE TABLE or CREATE TABLE AS SELECT (CTAS) to create a table in that database. This feature is available from v3.1. Switch to the target Iceberg catalog and database, and then run the following syntax to create an Iceberg table.
Syntax
CREATE TABLE [IF NOT EXISTS] [database.]table_name
(column_definition1[, column_definition2, ...
partition_column_definition1,partition_column_definition2...])
[partition_desc]
[PROPERTIES ("key" = "value", ...)]
[AS SELECT query]Parameters
column_definition
The syntax for
column_definitionis as follows.col_name col_type [COMMENT 'comment']The parameters are described in the following table.
Parameter
Description
col_nameThe name of the column.
col_typeThe data type of the column.
The following data types are currently supported: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, DATE, DATETIME, CHAR, VARCHAR[(length)], ARRAY, MAP, and STRUCT.
The LARGEINT, HLL, and BITMAP types are not supported.
NoteThe default value for all non-partition key columns is
NULL. Partition key columns must be declared last in the column list and cannot beNULL.partition_desc
The syntax for
partition_descis as follows.PARTITION BY (par_col1[, par_col2...])Currently, StarRocks only supports identity transforms. This means a partition is created for each unique partition value.
NotePartition key columns must be declared last in the column list. They support data types other than FLOAT, DOUBLE, DECIMAL, or DATETIME. NULL values are not supported.
PROPERTIES
You can declare the properties of the Iceberg table in the
PROPERTIESclause in the"key"="value"format. For more information, see Iceberg table properties. The following table lists some common properties.Property
Description
location
The file path of the Iceberg table. When you use HMS as the metadata service, you do not need to specify the
locationparameter.file_format
The file format of the Iceberg table. Currently, only the Parquet format is supported. Default:
parquet.compression_codec
The compression format for Iceberg tables. The supported formats are SNAPPY, GZIP, ZSTD, and LZ4. Default value:
gzip. This property has been deprecated since version 3.2.3. After this version, the compression algorithm for writing to Iceberg tables is uniformly controlled by theconnector_sink_compression_codecsession variable.
Examples
Create a non-partitioned table
unpartition_tblthat contains theidandscorecolumns.CREATE TABLE unpartition_tbl ( id int, score double );Create a partitioned table
partition_tbl_1that contains theaction,id, anddtcolumns, and defineidanddtas partition key columns.CREATE TABLE partition_tbl_1 ( action varchar(20), id int NOT NULL, dt date NOT NULL ) PARTITION BY (id,dt);Query data from the source table
partition_tbl_1and create a partitioned tablepartition_tbl_2based on the query result. Defineidanddtas the partition key columns forpartition_tbl_2.CREATE TABLE partition_tbl_2 PARTITION BY (id, dt) AS SELECT * from partition_tbl_1;
View an Iceberg table schema
You can view the schema of an Iceberg table in the following ways.
View the table schema.
DESC[RIBE] <catalog_name>.<database_name>.<table_name>;View the table schema and the storage location of the table file from the CREATE command.
SHOW CREATE TABLE <catalog_name>.<database_name>.<table_name>;
Insert data into an Iceberg table
Similar to an internal StarRocks table, if you have the INSERT permission on an Iceberg table, you can use INSERT to write data from a StarRocks table into that Iceberg table. Currently, you can only write data to Parquet-formatted Iceberg tables. This feature is available from v3.1.
You can use the GRANT and REVOKE statements to grant and revoke permissions for users and roles.
Switch to the target Iceberg catalog and database, and then run the following syntax to write data from a StarRocks table into a Parquet-formatted Iceberg table.
Syntax
INSERT {INTO | OVERWRITE} <table_name>
[ (column_name [, ...]) ]
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
-- Write data to a specified partition.
INSERT {INTO | OVERWRITE} <table_name>
PARTITION (par_col1=<value> [, par_col2=<value>...])
{ VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }Partition key columns cannot be NULL. Therefore, you must ensure that partition key columns have values during data loading.
Parameters
Parameter | Description |
INTO | Appends data to the target table. |
OVERWRITE | Overwrites the data in the target table. |
column_name | The target columns for the import. You can specify one or more columns. When you specify multiple columns, you must separate them with a comma ( |
expression | An expression used to assign a value to the corresponding column. |
DEFAULT | Assigns the default value to the corresponding column. |
query | A query statement. The result of the query is loaded into the target table. The query statement supports any SQL query syntax that StarRocks supports. |
PARTITION | The target partition for the data load. You must specify all partition key columns of the target table. The order of the specified partition key columns can be different from the order defined when the table was created. When you specify a partition, you cannot specify the target columns for the data load using |
Examples
The following write statements use the default Parquet format as an example.
Insert the following three rows of data into the
partition_tbl_1table.INSERT INTO partition_tbl_1 VALUES ("buy", 1, "2023-09-01"), ("sell", 2, "2023-09-02"), ("buy", 3, "2023-09-03");Insert the result data of a SELECT query that includes a simple calculation into the
partition_tbl_1table in the specified column order.INSERT INTO partition_tbl_1 (id, action, dt) SELECT 1+1, 'buy', '2023-09-03';Insert the result data of a SELECT query that reads data from the
partition_tbl_1table itself.INSERT INTO partition_tbl_1 SELECT 'buy', 1, date_add(dt, INTERVAL 2 DAY) FROM partition_tbl_1 WHERE id=1;Insert the result data of a SELECT query into the partition where
dt='2023-09-01'andid=1in thepartition_tbl_2table.Method 1
INSERT INTO partition_tbl_2 SELECT 'order', 1, '2023-09-01';Method 2
INSERT INTO partition_tbl_2 partition(dt='2023-09-01',id=1) SELECT 'order';
Overwrite all values in the
actioncolumn withclosein the partition wheredt='2023-09-01'andid=1in thepartition_tbl_1table:Method 1
INSERT OVERWRITE partition_tbl_1 SELECT 'close', 1, '2023-09-01';Method 2
INSERT OVERWRITE partition_tbl_1 partition(dt='2023-09-01',id=1) SELECT 'close';
Query data from an Iceberg table
Run SHOW DATABASES to view the databases in the Iceberg cluster that belong to the specified catalog.
SHOW DATABASES FROM <catalog_name>;Switch to the target Iceberg catalog and database.
Run SELECT to query the target table in the target database.
SELECT count(*) FROM <table_name> LIMIT 10;
Drop an Iceberg table
Similar to an internal StarRocks table, if you have the DROP permission on an Iceberg table, you can use DROP TABLE to drop it. This feature is available from v3.1.
You can use the GRANT and REVOKE statements to grant and revoke permissions for users and roles.
The drop table operation does not delete the corresponding file path and data on HDFS or OSS. A force delete, which is performed by adding the FORCE keyword, deletes the data on HDFS or OSS but does not delete the corresponding file path. Switch to the target Iceberg catalog and database, and then run the following statement to drop an Iceberg table.
DROP TABLE <table_name> FORCE;References
For more information about Iceberg, see Overview.