All Products
Search
Document Center

ApsaraDB for SelectDB:Iceberg data source

Last Updated:Feb 10, 2025

This topic describes how to integrate an Iceberg data source with ApsaraDB for SelectDB by using a catalog to perform federated analysis on the Iceberg data source.

Prerequisites

  • All nodes in your Iceberg cluster are connected to the SelectDB instance.

  • You have basic knowledge of catalogs and understand the operations that you can perform on catalogs. For more information, see Data lakehouse.

Usage notes

  • You can query only data in Iceberg tables of V1 and V2.

    Note

    Iceberg tables of V2 support only position deletes and do not support equality deletes.

  • ApsaraDB for SelectDB can only read data in external catalogs.

Procedure

In this example, an external catalog named iceberg_catalog is created. You can change the name based on your business requirements.

Step 1: Connect to an ApsaraDB for SelectDB instance

Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.

Step 2: Create an Iceberg catalog

SelectDB allows you to integrate external data sources by creating external catalogs. The API operations that you can call to access the metadata of Iceberg data sources vary with different catalog types. Create a catalog of the required type.

Access metadata of an Iceberg data source by using the Hive API

If your Iceberg data source is created based on Hive and you want SelectDB to access the metadata of the Iceberg data source by using the Hive API, you can create a catalog of the Hive metastore type to integrate the Iceberg data source. The procedure for creating an Iceberg catalog is the same as that for creating a Hive catalog in SelectDB. The following sample code provides an example. For more information, see Hive data source.

CREATE CATALOG iceberg_catalog PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

Access metadata of an Iceberg data source by using the Iceberg API

If you want SelectDB to access the metadata of an Iceberg data source by using the Iceberg API, you can create a catalog based on the service type of Iceberg.

The services that can access the metadata of an Iceberg data source by using the Iceberg API include Hadoop Distributed File System (HDFS), Hive, REST, and Data Lake Formation (DLF).

Important

The parameters related to the distributed file system in the Iceberg catalog are mapped to the parameters in the hdfs-site.xml configuration file of the Iceberg data source. The parameters and parameter descriptions are the same. Therefore, the values of the parameters related to the distributed file system in the Iceberg catalog must be the same as those of the parameters in the hdfs-site.xml configuration file of the Iceberg data source.

Hadoop catalog

-- Create a catalog in a non-high availability (HA) Hadoop cluster.
CREATE CATALOG iceberg_hadoop PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type' = 'hadoop',
    'warehouse' = 'hdfs://your-host:8020/dir/key'
);
-- Create a catalog in an HA Hadoop cluster.
CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type' = 'hadoop',
    'warehouse' = 'hdfs://your-nameservice/dir/key',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

If your data is stored in an Object Storage Service (OSS) bucket, you must configure the following parameters in the PROPERTIES parameter.

"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"
Parameters
Note

If your data is stored in an OSS bucket, the OSS-related parameters are required.

Parameter

Required

Description

type

Yes

The catalog type. Set the value to iceberg.

warehouse

Yes

The HDFS path of the data warehouse.

iceberg.catalog.type

Yes

The type of the Iceberg catalog. Set the value to hadoop.

dfs.nameservices

No

The names of the nameservices.

dfs.ha.namenodes.[nameservice ID]

No

The IDs of the NameNodes.

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

No

The Remote Procedure Call (RPC) URLs of the NameNodes. The number of URLs is the same as the number of NameNodes.

dfs.client.failover.proxy.provider.[nameservice ID]

No

The Java class that implements active NameNode connection for the HDFS client. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

oss.region

No

The ID of the region in which OSS data is accessed.

oss.endpoint

No

The endpoint that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints.

oss.access_key

No

The AccessKey ID that is used to access OSS data.

oss.secret_key

No

The AccessKey secret that is used to access OSS data.

Hive metastore

CREATE CATALOG iceberg_catalog PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='hms',
    'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
    'hadoop.username' = 'hive',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);

If your data is stored in an OSS bucket, you must configure the following parameters in the PROPERTIES parameter.

"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"
Parameters
Note

If your data is stored in an OSS bucket, the OSS-related parameters are required.

Parameter

Required

Description

type

Yes

The catalog type. Set the value to iceberg.

iceberg.catalog.type

Yes

The type of the Iceberg catalog. Set the value to hms.

hive.metastore.uris

Yes

The Uniform Resource Identifier (URI) of the Hive metastore.

hadoop.username

No

The username that is used to log on to the HDFS.

dfs.nameservices

No

The names of the nameservices.

dfs.ha.namenodes.[nameservice ID]

No

The IDs of the NameNodes.

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

No

The RPC URLs of the NameNodes. The number of URLs is the same as the number of NameNodes.

dfs.client.failover.proxy.provider.[nameservice ID]

No

The Java class that implements active NameNode connection for the HDFS client. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

oss.region

No

The ID of the region in which OSS data is accessed.

oss.endpoint

No

The endpoint that is used to access OSS data. For more information about how to obtain the endpoint, see Regions and endpoints.

oss.access_key

No

The AccessKey ID that is used to access OSS data.

oss.secret_key

No

The AccessKey secret that is used to access OSS data.

REST catalog

To use REST as an Iceberg catalog, you must deploy a REST service and implement the REST API to access Iceberg metadata.

CREATE CATALOG iceberg PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='rest',
    'uri' = 'http://172.21.0.1:8181'
);

If your data is stored in an HDFS and the HA mode is enabled, you must configure HDFS HA-related parameters when you create the Iceberg catalog.

CREATE CATALOG iceberg PROPERTIES (
    'type'='iceberg',
    'iceberg.catalog.type'='rest',
    'uri' = 'http://172.21.0.1:8181',
    'dfs.nameservices'='your-nameservice',
    'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
    'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
    'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
    'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Parameters

Parameter

Required

Description

type

Yes

The catalog type. Set the value to iceberg.

uri

Yes

The URI of the REST service.

iceberg.catalog.type

Yes

The type of the Iceberg catalog. Set the value to rest.

dfs.nameservices

No

The names of the nameservices.

dfs.ha.namenodes.[nameservice ID]

No

The IDs of the NameNodes.

dfs.namenode.rpc-address.[nameservice ID].[name node ID]

No

The RPC URLs of the NameNodes. The number of URLs is the same as the number of NameNodes. The value of this parameter is the same as that in the hdfs-site.xml file.

dfs.client.failover.proxy.provider.[nameservice ID]

No

The Java class that implements active NameNode connection for the HDFS client. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

Step 3: Query the catalog

You can execute the following statement to check whether the catalog is created.

SHOW CATALOGS; --Check whether the catalog is created.
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId    | CatalogName  | Type     | IsCurrent | CreateTime              | LastUpdateTime      | Comment                |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | iceberg_catalog | jdbc      |           | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 |                        |
|            0 | internal     | internal | yes       | UNRECORDED              | NULL                | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+

Step 4: Query data in the Iceberg databases and tables

  • Query data in the Iceberg databases and tables in the external catalog that you created.

    Note

    After you connect to the ApsaraDB for SelectDB instance, the internal catalog is used by default.

    1. Switch to the external catalog that you want to manage.

      SWITCH iceberg_catalog;
    2. Query data.

      After you switch to the external catalog, you can query and access the data in the external catalog in the same way as you query and access data in the internal catalog. For example, you can perform the following operations:

      • Execute the SHOW DATABASES; statement to query a list of databases.

      • Execute the USE test_db; statement to switch to the test_db database.

      • Execute the SHOW TABLES; statement to query a list of tables in a database.

      • Query the data of a table.

        A snapshot is generated in Apache Iceberg every time a write operation is performed on an Iceberg table.

        By default, SelectDB reads data only from snapshots of the latest version in Iceberg. You can execute the FOR time AS OF and FOR version AS OF statements to read data from historical snapshots based on the snapshot IDs or the time when the snapshots were generated. Example:

        -- Query data from the snapshot of the latest version that is created for the test_t table. 
        SELECT * FROM test_t;
        -- Query data from a snapshot that is generated at the specified point in time for the test_t table. 
        SELECT * FROM test_t FOR TIME AS OF "2022-10-07 17:20:37";
        -- Query data in the test_t table based on the ID of a specific snapshot that is generated for the table. 
        SELECT * FROM test_t FOR VERSION AS OF 868895038****72;

        If you do not know the metadata of an Iceberg table, such as the operation history, generated snapshots, and file metadata, you can use the ICEBERG_META function to query the metadata of the Iceberg table. For more information about the ICEBERG_META function, see ICEBERG_META.

  • Query data in the Iceberg databases and tables in the internal catalog.

    -- Query data from the snapshot of the latest version that is created for the test_t table in the test_db database in the iceberg_catalog catalog. 
    SELECT * FROM iceberg_catalog.test_db.test_t;
    -- Query data from a snapshot that is generated at the specified point in time for the test_t table. 
    SELECT * FROM iceberg_catalog.test_db.test_t FOR TIME AS OF "2022-10-07 17:20:37";
    -- Query data in the test_t table based on the ID of a specific snapshot that is generated for the table. 
    SELECT * FROM iceberg_catalog.test_db.test_t FOR VERSION AS OF 868895038****72;

Migrate data

After the Iceberg data source is integrated, you can execute the INSERT INTO statements to migrate historical data from the Iceberg data source to the SelectDB instance. For more information, see Import data by using INSERT INTO statements.

Data type mappings

The column mappings between an Iceberg data source and an SelectDB instance are the same as those between a Hive data source and an SelectDB instance. For more information, see Hive data source.