This topic describes how to integrate an Iceberg data source with ApsaraDB for SelectDB by using a catalog to perform federated analysis on the Iceberg data source.
Prerequisites
All nodes in your Iceberg cluster are connected to the SelectDB instance.
All nodes in the Iceberg cluster reside in the same virtual private cloud (VPC) as the SelectDB instance. If the nodes in the data source cluster reside in different VPCs, you must connect the nodes to the SelectDB instance. For more information, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
The IP addresses of all nodes in the Iceberg cluster are added to the IP address whitelist of the SelectDB instance. For more information, see Configure an IP address whitelist.
The IP addresses in the VPC in which the SelectDB instance resides are added to an IP address whitelist of the Iceberg cluster if the whitelist mechanism is supported for the Iceberg cluster.
To obtain the IP address of the SelectDB instance in the VPC to which the SelectDB instance belongs, you can perform the operations provided in How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
To obtain the public IP address of the SelectDB instance, you can run the ping command to ping the public IP address of the SelectDB instance.
You have basic knowledge of catalogs and understand the operations that you can perform on catalogs. For more information, see Data lakehouse.
Usage notes
You can query only data in Iceberg tables of V1 and V2.
NoteIceberg tables of V2 support only
position deletesand do not supportequality deletes.ApsaraDB for SelectDB can only read data in external catalogs.
Procedure
In this example, an external catalog named iceberg_catalog is created. You can change the name based on your business requirements.
Step 1: Connect to an ApsaraDB for SelectDB instance
Connect to an SelectDB instance. For more information, see Connect to an ApsaraDB for SelectDB instance by using a MySQL client.
Step 2: Create an Iceberg catalog
SelectDB allows you to integrate external data sources by creating external catalogs. The API operations that you can call to access the metadata of Iceberg data sources vary with different catalog types. Create a catalog of the required type.
Access metadata of an Iceberg data source by using the Hive API
If your Iceberg data source is created based on Hive and you want SelectDB to access the metadata of the Iceberg data source by using the Hive API, you can create a catalog of the Hive metastore type to integrate the Iceberg data source. The procedure for creating an Iceberg catalog is the same as that for creating a Hive catalog in SelectDB. The following sample code provides an example. For more information, see Hive data source.
CREATE CATALOG iceberg_catalog PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);Access metadata of an Iceberg data source by using the Iceberg API
If you want SelectDB to access the metadata of an Iceberg data source by using the Iceberg API, you can create a catalog based on the service type of Iceberg.
The services that can access the metadata of an Iceberg data source by using the Iceberg API include Hadoop Distributed File System (HDFS), Hive, REST, and Data Lake Formation (DLF).
The parameters related to the distributed file system in the Iceberg catalog are mapped to the parameters in the hdfs-site.xml configuration file of the Iceberg data source. The parameters and parameter descriptions are the same. Therefore, the values of the parameters related to the distributed file system in the Iceberg catalog must be the same as those of the parameters in the hdfs-site.xml configuration file of the Iceberg data source.
Hadoop catalog
-- Create a catalog in a non-high availability (HA) Hadoop cluster.
CREATE CATALOG iceberg_hadoop PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-host:8020/dir/key'
);-- Create a catalog in an HA Hadoop cluster.
CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 'hdfs://your-nameservice/dir/key',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);If your data is stored in an Object Storage Service (OSS) bucket, you must configure the following parameters in the PROPERTIES parameter.
"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"Parameters
If your data is stored in an OSS bucket, the OSS-related parameters are required.
Parameter | Required | Description |
type | Yes | The catalog type. Set the value to iceberg. |
warehouse | Yes | The HDFS path of the data warehouse. |
iceberg.catalog.type | Yes | The type of the Iceberg catalog. Set the value to hadoop. |
dfs.nameservices | No | The names of the nameservices. |
dfs.ha.namenodes.[nameservice ID] | No | The IDs of the NameNodes. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] | No | The Remote Procedure Call (RPC) URLs of the NameNodes. The number of URLs is the same as the number of NameNodes. |
dfs.client.failover.proxy.provider.[nameservice ID] | No | The Java class that implements active NameNode connection for the HDFS client. Default value: |
oss.region | No | The ID of the region in which OSS data is accessed. |
oss.endpoint | No | The |
oss.access_key | No | The |
oss.secret_key | No | The |
Hive metastore
CREATE CATALOG iceberg_catalog PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);If your data is stored in an OSS bucket, you must configure the following parameters in the PROPERTIES parameter.
"oss.access_key" = "ak"
"oss.secret_key" = "sk"
"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
"oss.region" = "oss-cn-beijing"Parameters
If your data is stored in an OSS bucket, the OSS-related parameters are required.
Parameter | Required | Description |
type | Yes | The catalog type. Set the value to iceberg. |
iceberg.catalog.type | Yes | The type of the Iceberg catalog. Set the value to hms. |
hive.metastore.uris | Yes | The Uniform Resource Identifier (URI) of the Hive metastore. |
hadoop.username | No | The username that is used to log on to the HDFS. |
dfs.nameservices | No | The names of the nameservices. |
dfs.ha.namenodes.[nameservice ID] | No | The IDs of the NameNodes. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] | No | The RPC URLs of the NameNodes. The number of URLs is the same as the number of NameNodes. |
dfs.client.failover.proxy.provider.[nameservice ID] | No | The Java class that implements active NameNode connection for the HDFS client. Default value: |
oss.region | No | The ID of the region in which OSS data is accessed. |
oss.endpoint | No | The |
oss.access_key | No | The |
oss.secret_key | No | The |
REST catalog
To use REST as an Iceberg catalog, you must deploy a REST service and implement the REST API to access Iceberg metadata.
CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = 'http://172.21.0.1:8181'
);If your data is stored in an HDFS and the HA mode is enabled, you must configure HDFS HA-related parameters when you create the Iceberg catalog.
CREATE CATALOG iceberg PROPERTIES (
'type'='iceberg',
'iceberg.catalog.type'='rest',
'uri' = 'http://172.21.0.1:8181',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);Parameters
Parameter | Required | Description |
type | Yes | The catalog type. Set the value to iceberg. |
uri | Yes | The URI of the REST service. |
iceberg.catalog.type | Yes | The type of the Iceberg catalog. Set the value to rest. |
dfs.nameservices | No | The names of the nameservices. |
dfs.ha.namenodes.[nameservice ID] | No | The IDs of the NameNodes. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] | No | The RPC URLs of the NameNodes. The number of URLs is the same as the number of NameNodes. The value of this parameter is the same as that in the hdfs-site.xml file. |
dfs.client.failover.proxy.provider.[nameservice ID] | No | The Java class that implements active NameNode connection for the HDFS client. Default value: |
Step 3: Query the catalog
You can execute the following statement to check whether the catalog is created.
SHOW CATALOGS; --Check whether the catalog is created.+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | iceberg_catalog | jdbc | | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 | |
| 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+Step 4: Query data in the Iceberg databases and tables
Query data in the Iceberg databases and tables in the external catalog that you created.
NoteAfter you connect to the ApsaraDB for SelectDB instance, the internal catalog is used by default.
Switch to the external catalog that you want to manage.
SWITCH iceberg_catalog;Query data.
After you switch to the external catalog, you can query and access the data in the external catalog in the same way as you query and access data in the internal catalog. For example, you can perform the following operations:
Execute the
SHOW DATABASES;statement to query a list of databases.Execute the
USE test_db;statement to switch to the test_db database.Execute the
SHOW TABLES;statement to query a list of tables in a database.Query the data of a table.
A snapshot is generated in Apache Iceberg every time a write operation is performed on an Iceberg table.
By default, SelectDB reads data only from snapshots of the latest version in Iceberg. You can execute the
FOR time AS OFandFOR version AS OFstatements to read data from historical snapshots based on the snapshot IDs or the time when the snapshots were generated. Example:-- Query data from the snapshot of the latest version that is created for the test_t table. SELECT * FROM test_t; -- Query data from a snapshot that is generated at the specified point in time for the test_t table. SELECT * FROM test_t FOR TIME AS OF "2022-10-07 17:20:37"; -- Query data in the test_t table based on the ID of a specific snapshot that is generated for the table. SELECT * FROM test_t FOR VERSION AS OF 868895038****72;If you do not know the metadata of an Iceberg table, such as the operation history, generated snapshots, and file metadata, you can use the ICEBERG_META function to query the metadata of the Iceberg table. For more information about the ICEBERG_META function, see ICEBERG_META.
Query data in the Iceberg databases and tables in the internal catalog.
-- Query data from the snapshot of the latest version that is created for the test_t table in the test_db database in the iceberg_catalog catalog. SELECT * FROM iceberg_catalog.test_db.test_t; -- Query data from a snapshot that is generated at the specified point in time for the test_t table. SELECT * FROM iceberg_catalog.test_db.test_t FOR TIME AS OF "2022-10-07 17:20:37"; -- Query data in the test_t table based on the ID of a specific snapshot that is generated for the table. SELECT * FROM iceberg_catalog.test_db.test_t FOR VERSION AS OF 868895038****72;
Migrate data
After the Iceberg data source is integrated, you can execute the INSERT INTO statements to migrate historical data from the Iceberg data source to the SelectDB instance. For more information, see Import data by using INSERT INTO statements.
Data type mappings
The column mappings between an Iceberg data source and an SelectDB instance are the same as those between a Hive data source and an SelectDB instance. For more information, see Hive data source.