ApsaraDB for SelectDB supports federated queries against external data sources through catalogs. This topic explains how to connect a Hive cluster to SelectDB using a Hive catalog so you can query Hive data directly without moving it.
Use cases
| Scenario | Description |
|---|---|
| Query acceleration | Run SelectDB's distributed query engine directly against Hive data without ETL. |
| Data integration | Read Hive data and load it into SelectDB internal tables using INSERT INTO statements. |
Supported configurations
| Dimension | Supported |
|---|---|
| Hive versions | Hive 1, Hive 2, Hive 3 |
| Table types | Managed tables, external tables, some Hive views |
| Metadata formats | Hive, Iceberg, and Hudi metadata stored in a Hive metastore |
| Access mode | Read-only — SelectDB cannot write to external catalogs |
| Storage backends | HDFS (non-HA and HA), Object Storage Service (OSS) |
Prerequisites
Before you begin, make sure that:
All nodes in your Hive cluster are in the same virtual private cloud (VPC) as the SelectDB instance, or you have established cross-VPC connectivity. For help with connection failures, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
The IP addresses of all Hive cluster nodes are added to the SelectDB instance's IP address whitelist. See Configure an IP address whitelist.
If the Hive cluster enforces its own whitelist, the SelectDB instance's VPC IP addresses are also added to the Hive cluster's whitelist. To find the SelectDB instance's VPC IP address, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
If the Hive cluster uses HDFS as storage, the following ports are open so that data can transfer between the Hive cluster and SelectDB:
Parameter Default port hive.metastore.uris9083 dfs.namenode.rpc-address8020 dfs.datanode.address9866 You are familiar with catalog concepts and operations. See Data lakehouse.
Sample environment
The walkthrough in this topic uses the following setup. Substitute your own values wherever these parameters appear.
| Parameter | Value |
|---|---|
| High availability (HA) mode | Non-HA |
| Hive cluster storage | HDFS |
| Source database | test_db |
| Source table | test_t |
Prepare the source data
Log on to the Hive cluster you want to query.
Create a database named
test_db.CREATE DATABASE IF NOT EXISTS test_db;Create a table named
test_t.CREATE TABLE IF NOT EXISTS test_t ( id INT, name STRING, age INT );Insert sample data.
INSERT INTO TABLE test_t VALUES (1, 'Alice', 25), (2, 'Bob', 30), (3, 'Charlie', 35), (4, 'David', 40), (5, 'Eve', 45);
Connect a Hive cluster to SelectDB
Step 1: Connect to an ApsaraDB for SelectDB instance
Connect to your SelectDB instance using a MySQL client. See Connect to an ApsaraDB for SelectDB instance by using a MySQL client.
Step 2: Create a Hive catalog
Create an external catalog that maps to your Hive cluster. SelectDB automatically discovers the databases and tables in the catalog and maps Hive column types to SelectDB types. For the full type mapping table, see Column data type mappings.
Select the configuration that matches your storage backend.
HDFS-based Hive (non-HA)
Use this configuration when your HDFS has a single active NameNode.
CREATE CATALOG hive_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://<metastore-host>:9083'
);Example:
CREATE CATALOG hive_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://master-1-1.c-7fa25a1a****.cn-hangzhou.emr.aliyuncs.com:9083'
);| Parameter | Required | Description |
|---|---|---|
type | Yes | Catalog type. Set to hms. |
hive.metastore.uris | Yes | Hive metastore URI in the format thrift://<host>:<port>. Default port: 9083. Run SET hive.metastore.uris on a Hive client to get the URI. |
HDFS-based Hive (HA)
Use this configuration when your HDFS uses high availability with multiple NameNodes.
CREATE CATALOG <catalog_name> PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://<metastore-host>:9083',
'hadoop.username' = '<hadoop-username>',
'dfs.nameservices' = '<nameservice-id>',
'dfs.ha.namenodes.<nameservice-id>' = '<nn1>,<nn2>',
'dfs.namenode.rpc-address.<nameservice-id>.<nn1>' = '<host1>:<port1>',
'dfs.namenode.rpc-address.<nameservice-id>.<nn2>' = '<host2>:<port2>',
'dfs.client.failover.proxy.provider.<nameservice-id>' = '<dfs.client.failover.proxy.provider.your-nameservice>'
);| Parameter | Required | Description |
|---|---|---|
type | Yes | Catalog type. Set to hms. |
hive.metastore.uris | Yes | Hive metastore URI. Format: thrift://<host>:<port>. Default port: 9083. |
hadoop.username | No | Username for HDFS access. |
dfs.nameservices | No | NameService ID. Must match the value in hdfs-site.xml. |
dfs.ha.namenodes.[nameservice ID] | No | NameNode IDs. Must match the value in hdfs-site.xml. |
dfs.namenode.rpc-address.[nameservice ID].[name node ID] | No | Remote procedure call (RPC) address of each NameNode. Must match the values in hdfs-site.xml. |
dfs.client.failover.proxy.provider.[nameservice ID] | No | Java class for active NameNode failover. Use org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider in most cases. |
OSS-based Hive
Use this configuration when Hive data is stored in Object Storage Service (OSS). The syntax is identical to the HDFS variant; only the storage parameters differ.
CREATE CATALOG hive_catalog PROPERTIES (
"type" = "hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
"oss.endpoint" = "oss-cn-beijing.aliyuncs.com",
"oss.access_key" = "<your-access-key-id>",
"oss.secret_key" = "<your-access-key-secret>"
);| Parameter | Required | Description |
|---|---|---|
type | Yes | Catalog type. Set to hms. |
hive.metastore.uris | Yes | Hive metastore URI. Format: thrift://<host>:<port>. |
oss.endpoint | Yes | OSS endpoint for the region where your data is stored. See Regions and endpoints. |
oss.access_key | Yes | AccessKey ID used to access OSS. |
oss.secret_key | Yes | AccessKey secret used to access OSS. |
Step 3: Verify the catalog
Confirm the catalog was created successfully.
SHOW CATALOGS;Expected output:
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | hive_catalog | hms | | 2024-07-19 17:09:08.058 | 2024-07-19 18:04:37 | |
| 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+Step 4: Query Hive data
After connecting, SelectDB defaults to the internal catalog. Switch to the Hive catalog to access Hive data.
Both query styles below are equivalent — use whichever fits your workflow.
-- Style 1: Switch catalog, then use database and table
SWITCH hive_catalog;
USE test_db;
SELECT * FROM test_t;
-- Style 2: Use the fully qualified table name (no switch required)
SELECT * FROM hive_catalog.test_db.test_t;To explore the catalog before running queries:
SWITCH hive_catalog;
SHOW DATABASES; -- list all databases in the Hive catalog
USE test_db;
SHOW TABLES; -- list all tables in test_dbAfter switching to the external catalog, you navigate databases and tables the same way as in the internal catalog.
Migrate data from Hive to SelectDB
After the Hive catalog is connected, use INSERT INTO statements to migrate historical Hive data into SelectDB internal tables. See Import data by using INSERT INTO statements.
Column data type mappings
SelectDB automatically maps Hive column types to SelectDB types when you create a catalog. The mappings below apply to Hive, Iceberg, and Hudi metadata stored in a Hive metastore.
Complex types can be nested. For example:
array<map<string, int>>map<string, array<int>>struct<col1: array<int>, col2: map<int, date>>
| Hive metastore type | SelectDB type | Note |
|---|---|---|
boolean | boolean | |
tinyint | tinyint | |
smallint | smallint | |
int | int | |
bigint | bigint | |
float | float | |
double | double | |
char | char | |
varchar | varchar | |
decimal | decimal | |
date | date | |
timestamp | datetime | |
array<type> | array<type> | Supports nesting, for example array<map<string, int>>. |
map<KeyType, ValueType> | map<KeyType, ValueType> | Supports nesting, for example map<string, array<int>>. |
struct<col1: Type1, col2: Type2, ...> | struct<col1: Type1, col2: Type2, ...> | Supports nesting, for example struct<col1: array<int>, col2: map<int, date>>. |
| Other | unsupported |
What's next
Data lakehouse — Learn more about external catalogs and federated query concepts.
Import data by using INSERT INTO statements — Move Hive data into SelectDB internal tables.