All Products
Search
Document Center

Hologres:Use DLF to read data from and write data to OSS

Last Updated:Apr 15, 2024

Hologres works with Data Lake Formation (DLF) and Object Storage Service (OSS) to provide the data lake acceleration service. The service supports flexible data access and analysis and delivers efficient data processing capabilities. This significantly accelerates queries and analysis of data in OSS data lakes. This topic describes how to use DLF to read data from and write data to OSS in Hologres.

Prerequisites

  • DLF is activated. For more information, see Quick start. For more information about regions in which DLF is supported, see Supported regions and endpoints.

  • OSS is activated, a bucket is created, and data is uploaded to the bucket. For more information, see Getting started with OSS.

  • A bucket policy is configured to authorize you to access the data in the OSS bucket by using foreign tables. If you do not have the required permissions, the foreign table that you create cannot be used to query the data in the OSS bucket. For more information about how to configure a bucket policy, see Configure and manage bucket policies.

  • Optional. OSS-HDFS is enabled. If you want to use OSS-HDFS, this prerequisite must be met. For more information, see Enable OSS-HDFS and grant access permissions.

Usage notes

  • You can execute the INSERT INTO statement to export Hologres data to OSS. Other statements such as INSERT ON CONFLICT, UPDATE, and DELETE cannot be used to export Hologres data to OSS.

  • Only Hologres V1.3 and later allow you to write data to OSS, and only data in the ORC, Parquet, CSV, and SequenceFile formats can be written to OSS.

  • Data lake acceleration is not supported for read-only secondary instances.

  • You can execute the IMPORT FOREIGN SCHEMA statement to import data of partitioned tables in OSS to Hologres. You can query data from up to 512 partitions in Hologres at a time. You must add filter conditions to limit the number of partitions to 512 per query.

  • DLF enables Hologres to load data in specific partitions of source tables to the memory and cache of Hologres by using foreign tables. To ensure query performance, you need to filter data in partitions to limit the maximum size of data to 200 GB per query.

  • The UPDATE, DELETE, and TRUNCATE statements cannot be executed on foreign tables.

Procedures

Configure the environment

  1. Enable DLF_FDW for a Hologres instance at the backend.

    Log on to the Hologres console, go to the Instances page, find the instance for which you want to enable DLF_FDW, and then click Data Lake Acceleration in the Actions column. In the message that appears, click Enable. You can also click the instance for which you want to enable DLF_FDW. On the Instance Details page, turn on Data Lake Acceleration in the Basic Information section. In this case, DLF_FDW is automatically enabled at the backend, and the Hologres instance is automatically restarted. After the instance is restarted, you can use the data lake acceleration service.

    Note

    If the Data Lake Acceleration button is unavailable, manually upgrade your Hologres instance in the Hologres console or join a Hologres DingTalk group to apply for an instance upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join a Hologres DingTalk group, see Obtain online support for Hologres.

    After DLF_FDW is enabled, the current system resources (one CPU core and 4 GB of memory) are used by default. You do not need to purchase additional resources.

  2. Install an extension.

    Before you use DLF to read data from OSS, you must use the account assigned with the superuser role and execute the following statement to install an extension in your Hologres database. An extension is installed at the database level. For each database, you need to install an extension only once.

    create extension if not exists dlf_fdw;
  3. Create a foreign server.

    Hologres supports the DLF multi-catalog feature. If you have only one E-MapReduce (EMR) cluster, use the default DLF catalog to create a foreign server. If you have multiple EMR clusters, use a custom catalog to specify the EMR cluster that you want your Hologres instance to connect to. You can also use native OSS or OSS-HDFS as the data source. You can use one of the following methods to create a foreign server:

    • Use the default DLF catalog and native OSS to create a foreign server. Sample statements:

      -- View existing servers. (meta_warehouse_server and odps_server are built-in servers and cannot be modified or dropped.)
      select * from pg_foreign_server;
      
      -- Drop an existing server.
      drop server <server_name> cascade;
      
      -- Create a foreign server.
      CREATE SERVER IF NOT EXISTS <server_name> FOREIGN data wrapper dlf_fdw options (
          dlf_region 'cn-<region>',
          dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com',
          oss_endpoint 'oss-cn-<region>-internal.aliyuncs.com'
      );
    • Use OSS-HDFS for data lake storage.

      • Obtain the endpoint of the OSS-HDFS bucket.

        If you want to use DLF_FDW to access data stored in OSS-HDFS, you must configure the endpoint of the OSS-HDFS bucket. If OSS-HDFS is enabled for your bucket, you can obtain the endpoint of the OSS-HDFS bucket from the Overview page of the bucket in the OSS console.

      • Create a foreign server and configure the endpoint.

        After you obtain the endpoint of the OSS-HDFS bucket, you can configure DLF_FDW OSS_Endpoint. Sample statements:

        CREATE EXTENSION IF NOT EXISTS dlf_fdw;
        
        CREATE SERVER IF NOT EXISTS <server_name> FOREIGN data wrapper dlf_fdw options (
            dlf_region 'cn-<region>',
            dlf_endpoint 'dlf-share.cn-<region>.aliyuncs.com',
            oss_endpoint 'oss-cn-<region>.oss-dls.aliyuncs.com' -- The endpoint of the OSS-HDFS bucket.
        );
      • The following table describes the parameters in the preceding syntax.

        Parameter

        Description

        Example

        server_name

        The name of the server.

        dlf_server

        dlf_region

        The ID of the region where DLF resides. Valid values:

        • cn-beijing

        • cn-hangzhou

        • cn-shanghai

        • cn-shenzhen

        • cn-zhangjiakou

        • ap-southeast-1

        • eu-central-1

        • us-east-1

        • ap-southeast-5

        cn-hangzhou

        dlf_endpoint

        The DLF endpoint. We recommend that you use the internal endpoint of DLF for better access performance.

        • China (Beijing): dlf-share.cn-beijing.aliyuncs.com

        • China (Hangzhou): dlf-share.cn-hangzhou.aliyuncs.com

        • China (Shanghai): dlf-share.cn-shanghai.aliyuncs.com

        • China (Shenzhen): dlf-share.cn-shenzhen.aliyuncs.com

        • China (Zhangjiakou): dlf-share.cn-zhangjiakou.aliyuncs.com

        • Singapore: dlf-share.ap-southeast-1.aliyuncs.com

        • Germany (Frankfurt): dlf-share.eu-central-1.aliyuncs.com

        • US (Virginia): dlf-share.us-east-1.aliyuncs.com

        • Indonesia (Jakarta): dlf-share.ap-southeast-5.aliyuncs.com

        dlf-share.cn-shanghai.aliyuncs.com

        oss_endpoint

        • The OSS endpoint. If you use native OSS for storage, we recommend that you use the internal endpoint of OSS for better access performance.

        • If you use OSS-HDFS for storage, only an internal endpoint can be used for access.

        • OSS

          oss-cn-shanghai-internal.aliyuncs.com
        • OSS-HDFS

          cn-hangzhou.oss-dls.aliyuncs.com
  4. Optional. Create a user mapping.

    Hologres allows you to execute the CREATE USER MAPPING statement to enable a user to access DLF and OSS. For example, the owner of a foreign server can execute the CREATE USER MAPPING statement to enable the RAM user whose UID is 123xxx to access data in OSS.

    You must make sure that the specified account has the required permissions to query data. For more information, see CREATE USER MAPPING.

    create user mapping for <Unique ID (UID) of the account> server <server_name> options
    (
        dlf_access_id 'accessid', 
        dlf_access_key 'accesskey',
        oss_access_id 'accessid', 
        oss_access_key 'accesskey'
    );

    Examples:

    -- Create a user mapping for your Alibaba Cloud account.
    create user mapping for current_user server <server_name> options
    (
        dlf_access_id 'LTAI5txxx', 
        dlf_access_key 'y8LUUyyy',
        oss_access_id 'LTAI5txxx', 
        oss_access_key 'y8LUUyyy'
    );
    
    -- Create a user mapping for the Resource Access Management (RAM) user whose UID is 123xxx.
    create user mapping for "p4_123xxx" server <server_name> options
    (
        dlf_access_id 'LIlY5txxx', 
        dlf_access_key 'KsjkXKyyy',
        oss_access_id 'LIlY5txxx', 
        oss_access_key 'KsjkXKyyy'
    );
    
    -- Drop the user mappings.
    Drop USER MAPPING for CURRENT_USER server <server_name>;
    Drop USER MAPPING for "p4_123xxx" server <server_name>;

Read data from OSS

In this example, the DLF data source is used. Before you start, you must prepare a metadata table in DLF and make sure that data is extracted from OSS to the table. This section describes how to use DLF to read OSS data by using foreign tables in Hologres.

  1. Create one or more foreign tables in the Hologres instance.

    After the server is created, you can use the CREATE FOREIGN TABLE statement to create a foreign table in Hologres or use the IMPORT FOREIGN SCHEMA statement to create one or more foreign tables in Hologres to read the OSS data that is extracted by DLF.

    Note

    If the name of an OSS foreign table that you want to create is the same as that of a Hologres internal table, the IMPORT FOREIGN SCHEMA statement cannot be used to create the foreign table. In this case, you can use the CREATE FOREIGN TABLE statement to create a foreign table with a name that is unique in Hologres.

    Hologres allows you to read data from partitioned tables in OSS. However, only columns of the TEXT, VARCHAR, and INT data types can be used as partition keys. Foreign tables created by using the CREATE FOREIGN TABLE statement are used to map fields rather than store data, and partition fields in the foreign tables can be specified as common fields. Foreign tables created by using the IMPORT FOREIGN SCHEMA statement automatically process table field mappings, and you do not need to configure table fields.

    • Syntax

      -- Method 1
      CREATE FOREIGN TABLE [ IF NOT EXISTS ] oss_table_name ( [
      { column_name data_type }
        [, ... ]
        ] )
      SERVER <server_name>
      options
      (
        schema_name '<dlf_database_name>',
        table_name '<dlf_table_name>'
      );
      
      -- Method 2
      IMPORT FOREIGN SCHEMA schema_name
          [ { limit to | except } ( table_name [, ...] ) ]
          from server <server_name>
          into local_schema
          [ options ( option 'value' [, ... ] ) ]
    • Parameters

      Parameter

      Description

      remote_schema

      The name of the metadatabase that is created in DLF.

      table_name

      The name of the metadata table that is created in DLF.

      server_name

      The name of the server that is created in Hologres.

      local_schema

      The name of the schema in Hologres.

      options

      The options that you can configure when you use the IMPORT FOREIGN SCHEMA statement. For more information, see IMPORT FOREIGN SCHEMA.

    • Examples

      • Create a single foreign table.

        Create a foreign table to map the metadata table named dlf_oss_test in the DLF metadatabase named dlfpro. The foreign table resides in the public schema of Hologres. Configure the if_table_exist option to check whether a foreign table with the same name exists. If such a foreign table exists, update the existing foreign table.

        -- Method 1
        CREATE FOREIGN TABLE dlf_oss_test_ext
        (
          id text,
          pt text
        )
        SERVER <server_name>
        options
        (
          schema_name 'dlfpro',
          table_name 'dlf_oss_test'
        );
        
        -- Method 2
        IMPORT FOREIGN SCHEMA dlfpro LIMIT TO
        (
          dlf_oss_test
        )
        FROM SERVER <server_name> INTO public options (if_table_exist 'update');
      • Create multiple foreign tables at a time.

        Map all tables in the DLF metadatabase named dlfpro to the public schema of Hologres to create multiple foreign tables in Hologres. These foreign tables are named after the tables in the dlfpro metadatabase in DLF.

        • Import data of all tables in the dlfpro metadatabase to the foreign tables.

          IMPORT FOREIGN SCHEMA dlfpro
          FROM SERVER <server_name> INTO public options (if_table_exist 'update');
        • Import data of multiple tables in the dlfpro metadatabase to the foreign tables.

          IMPORT FOREIGN SCHEMA dlfpro
          (
            table1,
            table2,
            tablen
          )
          FROM SERVER <server_name> INTO public options (if_table_exist 'update');
  2. Query data.

    After one or more foreign tables are created, you can query data in these foreign tables to read OSS data.

    • Non-partitioned table

      SELECT * FROM dlf_oss_test;
    • Partitioned table

      SELECT * FROM partition_table where dt = '2013';

Import data to a Hologres internal table

You can import data from a foreign table to a Hologres internal table. This way, you can directly query the data with better user experience.

  1. Create an internal table in Hologres.

    The internal table that you want to create in Hologres must use the same schema as the foreign table. Sample statement:

    CREATE TABLE IF NOT EXISTS holo_dlf_oss_test(
     uuid int,
     name string,
     price double);
  2. Synchronize data from a foreign table to the internal table.

    Execute the INSERT INTO ... SELECT ... statement to synchronize data from the foreign table to the internal table. Sample statement:

    INSERT INTO holo_dlf_oss_test SELECT uuid,name,price FROM dlf_oss_test;
  3. Query data in the internal table.

    Execute the following statement to query data in the internal table:

    select * from holo_dlf_oss_test;

Export data to OSS

If data is updated, you need to write the data back to OSS and then process the data by using an external engine such as EMR. You can execute an SQL statement to directly insert data into OSS by using a foreign table.

  1. Write data back to OSS.

    Execute the following SQL statement to write the data in a Hologres internal table back to OSS:

    insert into <foreign_table_name>(<col_name>,......) select <col_name>,...... from <holo_table_name>;

    The following table describes the parameters in the preceding statement.

    Parameter

    Description

    foreign_table_name

    The name of the foreign table.

    holo_table_name

    The name of the Hologres internal table.

    col_name

    The name of the column.

  2. Query the data that is written back to OSS.

    After the data is written back to OSS, you can execute the following SQL statement in the Hive or Spark engine of EMR to query the written data:

    SELECT * FROM <foreign_table_name> WHERE <col_name> = value;

    If a success message is returned, Hologres has written the data back to OSS and EMR can read the data.

FAQ

When I create a foreign table to read OSS data that is extracted by DLF, the ERROR: babysitter not ready,req:name:"HiveAccess" error message is displayed. What do I do?

  • Cause

    DLF_FDW is not enabled at the backend.

  • Solution

    On the Instances page in the Hologres console, click Data Lake Acceleration in the Actions column of the desired instance to enable DLF_FDW at the backend.

References

Accelerate queries on data in data lakes