All Products
Search
Document Center

Hologres:Build a data lake based on OSS-HDFS

Last Updated:Feb 21, 2024

In Hologres V1.3.26 and later, you can read data from and write data to OSS-HDFS. This topic describes how to build a data lake based on OSS-HDFS.

Background information

OSS-HDFS (JindoFS) is a cloud-native data lake storage service. OSS-HDFS is seamlessly integrated with compute engines in the Hadoop ecosystem and provides better performance than native OSS storage in scenarios in which offline extract, transform, and load (ETL) operations are performed based on Hive and Spark. For more information about OSS-HDFS, see What is OSS-HDFS?. In Hologres, you can use Data Lake Formation (DLF) to manage metadata of data in OSS-HDFS. You can also use JindoSDK to access data in OSS-HDFS and write data back to OSS-HDFS. The write-back feature supports only tables of the ORC, Parquet, CSV, or SequenceFile format. For more information about how to use DLF to read data from and write data to OSS, see Use DLF to read data from and write data to OSS.

Prerequisites

Procedure

  1. Obtain the endpoint of the bucket for which OSS-HDFS is enabled.

    You need to configure the endpoint of the bucket for which OSS-HDFS is enabled in Hologres. You can obtain the endpoint on the Overview page of the bucket in the OSS console. Domain name

  2. Create foreign servers and configure related information.

    Important

    If you want to access data of multiple environments that is stored in OSS-HDFS, you need to configure multiple foreign servers. For example, if you want to access data in OSS-HDFS and data in OSS, you need to create a foreign server for the foreign table that maps to the table in OSS-HDFS and a foreign server for the table in OSS. For the foreign server of OSS-HDFS, set oss_endpoint to the endpoint of the bucket for which OSS-HDFS is enabled, and specify this foreign server when you create the foreign table and import data to the foreign table. For the foreign server of OSS, set oss_endpoint to the endpoint that is used to access OSS.

    After you obtain the endpoint of the bucket, you can configure the endpoints of DLF and OSS in Hologres. Syntax:

    CREATE EXTENSION IF NOT EXISTS dlf_fdw;
    
    CREATE SERVER IF NOT EXISTS <servername> FOREIGN data wrapper dlf_fdw options (
        dlf_region 'cn-<region>',
        dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com',
        oss_endpoint 'oss-cn-<region>.oss-dls.aliyuncs.com' -- The endpoint of bucket for which OSS-HDFS is enabled.
    );

    The following table describes the parameters.

    Parameter

    Description

    Example

    servername

    The name of the server.

    dlf_server

    dlf_region

    The ID of the region where DLF resides.

    For more information about the regions in which DLF is supported, see Supported regions and endpoints.

    cn-beijing

    dlf_endpoint

    • The endpoint of DLF. We recommend that you use the internal endpoint of DLF for better access performance. For more information about the regions in which DLF is supported, see Supported regions and endpoints.

    • If you want to access DLF across regions, you must use a public endpoint.

      Important

      Access by using a public endpoint incurs network fees and performance overhead. For more information about network fees, see Billing overview.

    dlf-share.cn-beijing.aliyuncs.com

    oss_endpoint

    The endpoint of the bucket for which OSS-HDFS is enabled, which is obtained in Step 1. You can access OSS-HDFS only by using an internal endpoint. This indicates that you cannot access OSS-HDFS across regions.

    Note

    The endpoint of OSS. If you use native OSS to create a foreign server, we recommend that you use the internal endpoint of OSS to improve access performance. If you want to access OSS from Hologres or DLF across regions, you must use a public endpoint. For more information, see OSS domain names.

    cn-hangzhou.oss-dls.aliyuncs.com
  3. Create a foreign table and use the foreign table to read data from and write data to OSS-HDFS.

    • Create a single foreign table. Create a foreign table to map the metadata table named dlf_oss_test in the DLF metadatabase named dlfpro. The foreign table resides in the public schema of Hologres.

      -- Method 1:
      CREATE FOREIGN TABLE dlf_oss_test
      (
        id text,
        pt text
      )
      SERVER dlf_server -- The created foreign server.
      options
      (
        schema_name 'dlfpro',
        table_name 'dlf_oss_test'
      );
      
      
      -- Method 2:
      IMPORT FOREIGN SCHEMA dlfpro LIMIT TO
      (
        dlf_oss_test
      )
      FROM SERVER dlf_server INTO public options (if_table_exist 'update');
    • Create multiple foreign tables at a time. Map all tables in the DLF metadatabase named dlfpro to the public schema of Hologres to create multiple foreign tables in Hologres. These foreign tables are named after the tables in the dlfpro metadatabase in DLF.

      • Map all metadata tables in the dlfpro metadatabase.

        IMPORT FOREIGN SCHEMA dlfpro
        FROM SERVER dlf_server INTO public options (if_table_exist 'update');
      • Map the specified metadata tables in the dlfpro metadatabase.

        IMPORT FOREIGN SCHEMA dlfpro
        (
          table1,
          table2,
          tablen
        )
        FROM SERVER dlf_server INTO public options (if_table_exist 'update');
  4. Query data.

    After one or more foreign tables are created, you can query data in these foreign tables to read OSS data. The foreign tables can be viewed in the schema of Hologres or the table schema directory in the HoloWeb console. You can write data back to foreign tables in the CSV, Parquet, or ORC format.

    • Query data from a non-partitioned table.

      SELECT * FROM dlf_oss_test;
    • Query data from a partitioned table.

      SELECT * FROM <partition_table> where dt = '2013';

For more information, see Procedures.