Starting with V2.2, Hologres supports accessing data lake data in OSS using Hive Metastore (HMS). If you use an EMR cluster to build a data lake on OSS, you can configure Hologres to accelerate read and write operations for OSS and OSS-HDFS data.
Prerequisites
-
You have activated OSS. For more information, see Quick Start.
-
You have created an EMR data lake cluster and created test data. For more information, see Create a cluster. The EMR cluster must meet the following conditions:
-
The Hive version is 3.1.3 or later.
-
Kerberos identity authentication is disabled.
-
For Metadata, you have selected Self-managed RDS or Built-in MySQL.
-
-
You have purchased a Hologres instance, enabled data lake acceleration, and created a database. For more information, see Purchase a Hologres instance and Create a database.
NoteTo enable data lake acceleration, go to the Hologres instance list. In the Actions column of the target instance, click Data Lake Acceleration and confirm your choice.
-
A network connection is established.
Submit a network connection request. After your request is received, the Alibaba Cloud Hologres support team will contact you to help you complete the following steps to establish a network connection:
Log on to the VPC console to create a reverse endpoint. For more information, see Access Alibaba Cloud services. For the Endpoint Service parameter, select Other Endpoint Services and enter the name of the endpoint service in the region where the EMR cluster is located. The endpoint service names for each region are as follows.
Region
Endpoint service name
Beijing
com.aliyuncs.privatelink.cn-beijing.epsrv-2zeokrydzjd6kx3cbwmb
Shanghai
com.aliyuncs.privatelink.cn-shanghai.epsrv-uf61fvlfwta7f7dv9n3x
Zhangjiakou
com.aliyuncs.privatelink.cn-zhangjiakou.epsrv-8vbno4k4wwvys0eg2swp
Note-
If an endpoint service name is not available for your region, the Hologres team will create one for you and provide the name after you submit the network connection request.
-
A virtual private cloud (VPC) is an isolated network environment built on Alibaba Cloud. VPCs are logically isolated from each other and from classic networks, and cannot communicate by default. Because the Hologres service was created before VPCs, it is deployed in a classic network. Therefore, you must configure a reverse endpoint to establish a network connection.
-
The current network configuration uses an IP address for the connection. If the IP address of the EMR cluster changes, you must reconfigure the connection.
-
Limitations
-
Read-only secondary Hologres instances do not support the data lake acceleration feature.
-
Operations such as
UPDATE,DELETE, orTRUNCATEare not supported on foreign tables. -
Mapping foreign tables from HMS using Auto Load is not supported.
-
Hive clusters with Kerberos identity authentication enabled are not supported.
Procedure
-
Execute a SQL command to create the extension.
Creating an extension requires superuser permissions. This operation affects the entire database and needs to be performed only once per database.
CREATE EXTENSION IF NOT EXISTS hive_fdw; -
Create a foreign server based on
hive_fdwand configure the endpoint information.CREATE SERVER IF NOT EXISTS <server_name> FOREIGN DATA WRAPPER hive_fdw OPTIONS ( hive_metastore_uris 'thrift://<IP address of Hive Metastore>:<port number>', oss_endpoint 'oss-<nation>-<region>-internal.aliyuncs.com | <bucket>.oss-<nation>-<region>.oss-dls.aliyuncs.com' );Parameter
Required
Description
Example
server_name
Yes
A custom name for the foreign server.
hive_server
hive_metastore_uris
Yes
The URI of the Hive Metastore. The format is
thrift://<IP address of Hive Metastore>:<port number>. The default port number is 9083.NoteLog on to the E-MapReduce console. In the Actions column of the target cluster, click Node Management. On the Node Management tab, find the Internal IP of the master node. This IP address is the IP address of the Hive metastore.
thrift://172.16.0.250:9083
oss_endpoint
Yes
The endpoint of OSS. Choose one based on your business needs:
-
Native OSS storage: For better access performance, use the internal endpoint of OSS.
-
OSS-HDFS storage: Only internal network access is supported.
NoteLog on to the OSS console. Go to the overview page of the bucket. In the Access Ports area, find the OSS endpoint address.
-
OSS
oss-cn-shanghai-internal.aliyuncs.com -
OSS-HDFS
<bucket_name>.cn-beijing.oss-dls.aliyuncs.com
-
-
(Optional) Create a user mapping.
Hologres lets you use the
CREATE USER MAPPINGcommand to map a user to a specific foreign server. For example, the owner of a foreign server can useCREATE USER MAPPINGto allow a Resource Access Management (RAM) user (123xxx) to access external data in OSS. For more information aboutCREATE USER MAPPING, see the PostgreSQL documentation.CREATE USER mapping FOR <account> server <server_name> options ( dlf_access_id '<yourAccessKeyId>', dlf_access_key '<yourAccessKeySecret>', oss_access_id '<yourAccessKeyId>', oss_access_key '<yourAccessKeySecret>' );The following is an example:
-- Create a user mapping for the current user CREATE USER mapping FOR current_user server <server_name> options ( dlf_access_id 'yourAccessKeyId', dlf_access_key 'yourAccessKeySecret', oss_access_id 'yourAccessKeyId', oss_access_key 'yourAccessKeySecret' ); -- Create a user mapping for RAM user 123xxx CREATE USER mapping FOR "p4_123xxx" server <server_name> options ( dlf_access_id 'yourAccessKeyId', dlf_access_key 'yourAccessKeySecret', oss_access_id 'yourAccessKeyId', oss_access_key 'yourAccessKeySecret' ); -- Delete the user mapping Drop USER MAPPING FOR CURRENT_USER server <server_name>; Drop USER MAPPING FOR "p4_123xxx" server <server_name>; -
Create a foreign table.
Hologres supports the following commands to create foreign tables:
-
CREATE FOREIGN TABLE: Creates a single foreign table. This command lets you customize the table by specifying a subset of columns. It is suitable for scenarios where you need to create only a few foreign tables or do not need to map all fields.
-
IMPORT FOREIGN SCHEMA: Creates foreign tables in batches. This command is suitable for scenarios where you need to create multiple foreign tables or perform batch mapping from an external data source.
Note-
Hologres supports reading partitioned tables in OSS. The supported data types for partition keys are TEXT, VARCHAR, and INT. When you use CREATE FOREIGN TABLE, you can define the partition fields as regular fields because this command only maps the schema without storing data. When you use IMPORT FOREIGN SCHEMA, you do not need to define the table fields because the system automatically handles the field mapping.
-
If a table in the external schema has the same name as an existing internal table in Hologres, IMPORT FOREIGN SCHEMA skips the creation of that foreign table. In this case, use CREATE FOREIGN TABLE to define a unique name for the foreign table.
-- Using CREATE FOREIGN TABLE CREATE FOREIGN TABLE <holo_schema_name>.<table_name> ( { column_name data_type } [, ... ] ] ) ) SERVER <hive_server_name> OPTIONS ( schema_name '<ext_db_name>', table_name '<ext_table_name>' ); -- Using IMPORT FOREIGN SCHEMA IMPORT FOREIGN SCHEMA <ext_db_name> [ { limit TO | EXCEPT } ( table_name [, ...] ) ] FROM server <hive_server_name> INTO <holo_schema_name> options( if_table_exist 'update', if_unsupported_type 'error' ); -
-
Query the foreign table.
After the foreign table is created, you can directly query it to read data from OSS.
-
Non-partitioned table
SELECT * FROM <holo_schema>.<hive_table>; -
Partitioned table
SELECT * FROM <holo_schema>.<hive_partition_table> WHERE <partition_key> = '<partition_value>';
-