In Hologres V1.3.26 and later, you can read data from and write data to OSS-HDFS. This topic describes how to build a data lake based on OSS-HDFS.
Background information
OSS-HDFS (JindoFS) is a cloud-native data lake storage service. OSS-HDFS is seamlessly integrated with compute engines in the Hadoop ecosystem and provides better performance than native OSS storage in scenarios in which offline extract, transform, and load (ETL) operations are performed based on Hive and Spark. For more information about OSS-HDFS, see What is OSS-HDFS?. In Hologres, you can use Data Lake Formation (DLF) to manage metadata of data in OSS-HDFS. You can also use JindoSDK to access data in OSS-HDFS and write data back to OSS-HDFS. The write-back feature supports only tables of the ORC, Parquet, CSV, or SequenceFile format. For more information about how to use DLF to read data from and write data to OSS, see Use DLF to read data from and write data to OSS.
Prerequisites
DLF is activated. For more information, see Quick start. For more information about the regions in which DLF is supported, see Supported regions and endpoints.
OSS-HDFS is enabled and data is prepared. For more information, see Enable OSS-HDFS and grant access permissions.
Data lake acceleration is enabled for your Hologres instance. For more information, see Configure the environment.
Procedure
Obtain the endpoint of the bucket for which OSS-HDFS is enabled.
You need to configure the endpoint of the bucket for which OSS-HDFS is enabled in Hologres. You can obtain the endpoint on the Overview page of the bucket in the OSS console.
Create foreign servers and configure related information.
ImportantIf you want to access data of multiple environments that is stored in OSS-HDFS, you need to configure multiple foreign servers. For example, if you want to access data in OSS-HDFS and data in OSS, you need to create a foreign server for the foreign table that maps to the table in OSS-HDFS and a foreign server for the table in OSS. For the foreign server of OSS-HDFS, set oss_endpoint to the endpoint of the bucket for which OSS-HDFS is enabled, and specify this foreign server when you create the foreign table and import data to the foreign table. For the foreign server of OSS, set oss_endpoint to the endpoint that is used to access OSS.
After you obtain the endpoint of the bucket, you can configure the endpoints of DLF and OSS in Hologres. Syntax:
CREATE EXTENSION IF NOT EXISTS dlf_fdw; CREATE SERVER IF NOT EXISTS <servername> FOREIGN data wrapper dlf_fdw options ( dlf_region 'cn-<region>', dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com', oss_endpoint 'oss-cn-<region>.oss-dls.aliyuncs.com' -- The endpoint of bucket for which OSS-HDFS is enabled. );
The following table describes the parameters.
Parameter
Description
Example
servername
The name of the server.
dlf_server
dlf_region
The ID of the region where DLF resides.
For more information about the regions in which DLF is supported, see Supported regions and endpoints.
cn-beijing
dlf_endpoint
The endpoint of DLF. We recommend that you use the internal endpoint of DLF for better access performance. For more information about the regions in which DLF is supported, see Supported regions and endpoints.
If you want to access DLF across regions, you must use a public endpoint.
ImportantAccess by using a public endpoint incurs network fees and performance overhead. For more information about network fees, see Billing overview.
dlf-share.cn-beijing.aliyuncs.com
oss_endpoint
The endpoint of the bucket for which OSS-HDFS is enabled, which is obtained in Step 1. You can access OSS-HDFS only by using an internal endpoint. This indicates that you cannot access OSS-HDFS across regions.
NoteThe endpoint of OSS. If you use native OSS to create a foreign server, we recommend that you use the internal endpoint of OSS to improve access performance. If you want to access OSS from Hologres or DLF across regions, you must use a public endpoint. For more information, see OSS domain names.
cn-hangzhou.oss-dls.aliyuncs.com
Create a foreign table and use the foreign table to read data from and write data to OSS-HDFS.
Create a single foreign table. Create a foreign table to map the metadata table named dlf_oss_test in the DLF metadatabase named dlfpro. The foreign table resides in the public schema of Hologres.
-- Method 1: CREATE FOREIGN TABLE dlf_oss_test ( id text, pt text ) SERVER dlf_server -- The created foreign server. options ( schema_name 'dlfpro', table_name 'dlf_oss_test' ); -- Method 2: IMPORT FOREIGN SCHEMA dlfpro LIMIT TO ( dlf_oss_test ) FROM SERVER dlf_server INTO public options (if_table_exist 'update');
Create multiple foreign tables at a time. Map all tables in the DLF metadatabase named dlfpro to the public schema of Hologres to create multiple foreign tables in Hologres. These foreign tables are named after the tables in the dlfpro metadatabase in DLF.
Map all metadata tables in the dlfpro metadatabase.
IMPORT FOREIGN SCHEMA dlfpro FROM SERVER dlf_server INTO public options (if_table_exist 'update');
Map the specified metadata tables in the dlfpro metadatabase.
IMPORT FOREIGN SCHEMA dlfpro ( table1, table2, tablen ) FROM SERVER dlf_server INTO public options (if_table_exist 'update');
Query data.
After one or more foreign tables are created, you can query data in these foreign tables to read OSS data. The foreign tables can be viewed in the schema of Hologres or the table schema directory in the HoloWeb console. You can write data back to foreign tables in the CSV, Parquet, or ORC format.
Query data from a non-partitioned table.
SELECT * FROM dlf_oss_test;
Query data from a partitioned table.
SELECT * FROM <partition_table> where dt = '2013';
For more information, see Procedures.