This topic describes how to use Data Lake Formation (DLF) to read data from and write data to Object Storage Service (OSS) in Hologres.

Background information

As the digital transformation of enterprises develops, the amount of data that enterprises need to store increases exponentially. Traditional data analysis faces great challenges in terms of the cost, scale, and data diversity. Hologres works with DLF and OSS to provide the data lake acceleration service DLF_FDW based on data lakehouses. This service helps enterprises implement cost-effective storage of a large amount of data, centralized metadata management, and efficient data analysis and data insights.

Alibaba Cloud DLF is a fully managed service that helps you build data lakes and data lakehouses in the cloud. The service provides centralized metadata management, centralized permission and security management, and convenient data ingestion and exploration capabilities for data lakes in the cloud. For more information about DLF, see Overview.

DLF uses Alibaba Cloud OSS as its source of cloud data lakes. OSS is a secure, cost-effective, and highly reliable service that can store large amounts of data and all types of files. It can provide 99.9999999999% (twelve 9's) of data durability and has become the standard for data lake storage. For more information about OSS, see What is OSS?.

Hologres is seamlessly integrated with DLF and OSS. Hologres allows you to use DLF to read data from and write data to OSS by using foreign tables. Foreign tables are used to map fields rather than store data. Hologres can directly accelerate the reads and writes of a variety of types of data stored in OSS buckets. This reduces development and O&M costs, breaks down data silos, and realizes business insights.

Features

In Hologres V1.1 and later, files in the ORC, Parquet, CSV, or SequenceFile format can be read from OSS. In Hologres V1.3 and later, files in the HUDI or Delta format can be read from OSS, and files in the ORC, Parquet, CSV, or SequenceFile format can be written to OSS. You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your Hologres instance is earlier than V1.1, join DingTalk group 32314975 for technical support or submit a ticket to apply for updating your instance.

Hologres V1.3.26 and later allow you to read data from and write data to OSS-HDFS. OSS-HDFS (JindoFS) is a cloud-native data lake storage service. Compared with native OSS storage, OSS-HDFS is seamlessly integrated with compute engines in the Hadoop ecosystem and provides better performance in offline extract, transform, load (ETL) of big data based on Hive and Spark. OSS-HDFS is fully compatible with Hadoop Distributed File System (HDFS) APIs and supports Portable Operating System Interface (POSIX). You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields. For more information about OSS-HDFS, see Overview of the OSS-HDFS service.

  • The following table describes the file formats that are supported by Hologres.
    File format Hologres V1.1 Hologres V1.3
    HUDI (COW) Not supported Data reads supported
    Delta Not supported Data reads supported
    CSV Data reads supported Data reads and writes supported
    Parquet Data reads supported Data reads and writes supported
    ORC Data reads supported Data reads and writes supported
    SequenceFile Data reads supported Data reads and writes supported
  • Data type mappings

    For more information about the mappings between DLF and Hologres data types, see Data types.

Prerequisites

  • DLF is activated. For more information, see Quick start. For more information about regions in which DLF is supported, see Supported regions and endpoints.
  • OSS is activated, a bucket is created, and data is uploaded to the bucket. For more information, see Get started with OSS.
  • A bucket policy is configured to authorize you to access the data in the OSS bucket. If you do not have the required permissions, the foreign table that was created cannot be used to query the data in the OSS bucket. For more information about OSS authorization, see Configure and manage bucket policies.
  • If you want to use the OSS-HDFS service, you must enable OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.

Limits

  • You can read data only from an OSS bucket that resides in the same region as your Hologres instance.
  • When you export data to OSS, you cannot execute the INSERT ON CONFLICT, UPDATE, or DELETE statement.
  • You can read data from or write data to partitioned tables in DLF. However, only columns of the TEXT, VARCHAR, and INT types can be used as partition keys.
  • Only Hologres V1.3 and later allow you to write data to OSS objects in the ORC, Parquet, CSV, and SequenceFile formats.
  • Data lake acceleration is not supported for read-only secondary instances.

Usage notes

  • Configure the environment.
    1. Configure the backend to enable DLF_FDW.

      Before you use the data lake acceleration service, you must configure the backend to enable DLF_FDW. After DLF_FDW is enabled, the current system resources (1 CPU core and 4 GB of memory) are used by default. You do not need to purchase additional resources.

      Log on to the Hologres console, go to the Instances or Instance Details page, click Data Lake Acceleration in the Actions column of the instance that you want to manage, and then click Enable in the message that appears. Then, DLF_FDW is automatically configured in the backend, and the Hologres instance is automatically restarted. After the instance is restarted, you can use the data lake acceleration service.
      Note You can configure the backend to enable DLF_FDW in the Hologres console. If the Data Lake Acceleration button is unavailable, submit a ticket or join DingTalk group 32314975 to configure the backend.
    2. Install an extension.
      Before you use DLF to read data from and write data to OSS, you must assume the superuser role and execute the following statement to install an extension in your Hologres database. An extension is installed at the database level. For each database, you need to install an extension only once.
      create extension if not exists dlf_fdw;
    3. Create a server.
      After the extension is installed, you must create a server for Hologres to connect to DLF and OSS. You can use one of the following methods to create a server. We recommend that you use the internal endpoints of DLF and OSS for better performance.
      • Use the default DLF data catalog and native OSS storage to create a server. The following sample code provides an example:
        -- View the existing server.
        select * from pg_foreign_server;
        
        -- Delete the existing server.
        drop server <server_name> cascade;
        
        -- Create a server.
        CREATE SERVER IF NOT EXISTS < servername > FOREIGN data wrapper dlf_fdw options (
            dlf_region 'cn-<region>',
            dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com',
            oss_endpoint 'oss-cn-<region>-internal.aliyuncs.com',
        );
      • Use OSS-HDFS for data lake storage.
        Create an external server to connect Hologres to DLF and OSS-HDFS. The following sample code provides an example:
        -- Delete the existing server.
        drop server <servername> cascade;
        
        -- Create a server.
        CREATE SERVER IF NOT EXISTS <server_name> FOREIGN data wrapper dlf_fdw options (
            dlf_region 'cn-<region>',
            dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com',
            oss_endpoint 'oss-cn-<region>.oss-dls.aliyuncs.com' -- The domain name of the bucket for which OSS-HDFS is enabled.
        );
      • Parameters
        Parameter Description Example
        server_name The name of the server. oss_server
        dlf_region The ID of the region where DLF resides. Valid values:
        • China (Beijing): cn-beijing
        • China (Hangzhou): cn-hangzhou
        • China (Shanghai): cn-shanghai
        • China (Shenzhen): cn-shenzhen
        • China (Zhangjiakou): cn-zhangjiakou
        • Singapore: ap-southeast-1
        • Germany (Frankfurt): eu-central-1
        • US (Virginia): us-east-1
        cn-hangzhou
        dlf_endpoint The internal endpoint of DLF. This endpoint is used by MaxCompute and Hologres to access DLF. For more information, see Supported regions and endpoints.
        dlf-share.cn-shanghai.aliyuncs.com
        oss_endpoint
        • The internal endpoint of OSS. For more information, see OSS domain names.
        • The domain name of the bucket for which OSS-HDFS is enabled.
        • oss-cn-shanghai-internal.aliyuncs.com
        • cn-hangzhou.oss-dls.aliyuncs.com
    4. (Optional) Create a user mapping.
      By default, the AccessKey ID and AccessKey secret of your Alibaba Cloud account are used to access DLF metadata and read table data. For more information about how to obtain the AccessKey ID and AccessKey secret of your Alibaba Cloud account, see the "Create an AccessKey pair" step of the Create an Alibaba Cloud account topic. If you do not want to use your Alibaba Cloud account, you can use the following statement to create a user mapping and specify the account that you want to use to access data. Make sure that the specified account has the required permissions to query data. For more information, see CREATE USER MAPPING.
      create user mapping for <The unique ID (UID) of the account> server hm_dlf_server options
      (
          dlf_access_id 'accessid', 
          dlf_access_key 'accesskey',
          oss_access_id 'accessid', 
          oss_access_key 'accesskey'
      );
      Examples:
      -- Create a user mapping for your Alibaba Cloud account.
      create user mapping for current_user server hm_dlf_server options
      (
          dlf_access_id 'LTAI5txxx', 
          dlf_access_key 'y8LUUyyy',
          oss_access_id 'LTAI5txxx', 
          oss_access_key 'y8LUUyyy'
      );
      
      -- Create a user mapping for the RAM user whose UID is 123xxx.
      create user mapping for "p4_123xxx" server hm_dlf_server options
      (
          dlf_access_id 'LIlY5txxx', 
          dlf_access_key 'KsjkXKyyy',
          oss_access_id 'LIlY5txxx', 
          oss_access_key 'KsjkXKyyy'
      );
      
      --Delete a user mapping.
      Drop USER MAPPING for CURRENT_USER server hm_dlf_server options;
      Drop USER MAPPING for "p4_123xxx" server hm_dlf_server options;
  • Read DLF data that is stored in OSS.
    In this example, the DLF data source is used. Before you start, you must prepare a metadata table in DLF and make sure that data is extracted from OSS to the table. To use DLF to read OSS data by creating a foreign table in Hologres, perform the following steps:
    1. Create a foreign table in the Hologres instance.
      After the server is created, you can execute the IMPORT FOREIGN SCHEMA statement to create one or more foreign tables in Hologres to read the OSS data that is extracted by DLF.
      • Sample code
        IMPORT FOREIGN SCHEMA remote_schema
            [ { limit to | except } ( table_name [, ...] ) ]
            from server server_name
            into local_schema
            [ options ( option 'value' [, ... ] ) ]
      • Parameters
        Parameter Description
        remote_schema The name of the metadatabase that is created in DLF.
        table_name The name of the metadata table that is created in DLF.
        server_name The name of the server that is created in Hologres.
        local_schema The name of the schema in Hologres.
        options The options that you can set when you use the IMPORT FOREIGN SCHEMA statement. For more information, see IMPORT FOREIGN SCHEMA.
      • Examples
        • Create a single foreign table.
          Create a foreign table to map the metadata table named dlf_oss_test in the DLF metadatabase named dlfpro. The foreign table resides in the public schema of Hologres. Set the if_table_exist option to check whether a foreign table with the same name exists. If such a foreign table exists, update the existing foreign table.
          IMPORT FOREIGN SCHEMA dlfpro LIMIT TO
          (
            dlf_oss_test
          )
          FROM SERVER dlf_server INTO public options (if_table_exist 'update');
        • Create multiple foreign tables at a time.

          Map all tables in the DLF metadatabase named dlfpro to the public schema of Hologres to create multiple foreign tables in Hologres. These foreign tables are named after the tables in the dlfpro metadatabase in DLF.

          • Import all tables in the DLF metadatabase.
            IMPORT FOREIGN SCHEMA dlfpro
            FROM SERVER dlf_server INTO public options (if_table_exist 'update');
          • Import multiple tables.
            IMPORT FOREIGN SCHEMA dlfpro limit to
            (
              table1,
              table2,
              tablen
            )
            FROM SERVER dlf_server INTO public options (if_table_exist 'update');
    2. Query data.
      After one or more foreign tables are created, you can query data in these tables to read OSS data.
      select * from dlf_oss_test;
  • Import data to an internal table in Hologres.
    You can import data from a foreign table to an internal table in Hologres. This way, you can directly query the data with better user experience.
    1. Create an internal table in Hologres.
      The internal table that you want to create in Hologres must use the same schema as the foreign table. The following sample code provides an example:
      CREATE TABLE IF NOT EXISTS holo_dlf_oss_test(
       uuid int,
       name string,
       price double);
    2. Synchronize data from a foreign table to an internal table.
      Execute the Insert into ... select ... statement to synchronize data from a foreign table to an internal table. The following sample code provides an example:
       INSERT INTO holo_dlf_oss_test SELECT uuid,name,price FROM dlf_oss_test;
    3. Query data in an internal table.
      Execute the following statement to query data in an internal table:
      select * from holo_dlf_oss_test;
  • Write data back to OSS.
    If data is updated, you need to write the data back to OSS and then process the data by using an external engine such as E-MapReduce (EMR). In this case, you can directly insert data into a foreign table by executing an SQL statement.
    1. Write data back to OSS.
      Execute the following SQL statement to write the data in an internal table in Hologres back to OSS:
      insert into <foreign_table_name>(<col_name>,......) select <col_name>,...... from <holo_table_name>;
      Parameters
      Parameter Description
      foreign_table_name The name of the foreign table.
      holo_table_name The name of the source internal table in Hologres.
      col_name The name of the column in the foreign table or internal table.
    2. Query the data that is written to OSS.
      After the data is written to OSS, you can execute the following SQL statement in the Hive or Spark engine of EMR to query the written data:
      SELECT * FROM <foreign_table_name> WHERE <col_name> = value;
      If a success message is returned, Hologres has written the data back to OSS and EMR can read the data.

FAQ

When you create a foreign table to read OSS data that is extracted by DLF, the ERROR: babysitter not ready,req:name:"HiveAccess" error message appears.
  • Possible cause

    The backend is not configured.

  • Solution

    Submit a ticket or join DingTalk group 32314975 to configure the backend.