All Products
Search
Document Center

Use OSS foreign tables to access OSS data

Last Updated: Jan 21, 2022

This topic describes how to use Object Storage Service (OSS) foreign tables to access OSS data. OSS foreign tables are developed based on the PostgreSQL Foreign Data Wrapper (FDW) framework to access OSS data for data analysis.

Overview

You can use OSS foreign tables to perform the following operations:

  • Import OSS data to the internal row-oriented tables or column-oriented tables of the AnalyticDB for PostgreSQL instance for accelerated data analysis.

  • Query and analyze large amounts of OSS data.

  • Join OSS foreign tables to internal tables for data analysis.

OSS foreign tables allow you to access data objects in the ORC, Parquet, JSON, JSON Lines, and CSV formats. Access to GZIP- or standard Snappy-compressed CSV objects is also supported. You can partition an OSS foreign table based on one or more columns to filter out undesired partitions when you query a specific partition.

OSS data sources include business applications, log archives of Alibaba Cloud Log Service, and extract, transform, load (ETL) operations of Data Lake Analytics (DLA).

Differences between OSS foreign tables and OSS external tables

  • AnalyticDB for PostgreSQL allows you to use OSS external tables to import and export data. However, OSS external tables do not meet the analysis requirements of large amounts of OSS data.

  • OSS foreign tables are developed based on the PostgreSQL FDW framework and support ORC and CSV objects. Access to GZIP-compressed CSV objects is also supported. OSS foreign tables can be partitioned based on one or more fields. You can collect the statistics for OSS foreign tables so that the optimizer can generate an optimal execution plan.

  • Foreign tables are superior to external tables in terms of performance, features, and stability. Therefore, the Greenplum community plans to replace external tables with foreign tables.

Use an OSS foreign table

To use an OSS foreign table, you must define OSS FOREIGN TABLE on OSS FOREIGN SERVER by using the USER MAPPING user. You can use CREATE USER MAPPING to specify the user, CREATE SERVER to specify the OSS server, and CREATE FOREIGN TABLE to specify the OSS foreign table.

For example, you can use the ossutil command line tool to view the following information about a TPC-H lineitem table in an OSS bucket. The following command provides an example on how to view the information by using ossutil:

ossutil ls oss://adbpg-tpch/data/tpch_data_10x/lineitem.tbl

In the command:

  • adbpg-tpch: the name of the OSS bucket.

  • data/tpch_data_10x/...: the path of the object in the bucket.

A similar list is returned:

LastModifiedTime                   Size(B)  StorageClass   ETAG                                  ObjectName
2020-03-12 09:29:48 +0800 CST    144144997      Standard   1F426F2FFC70A0262D2D69183BC3A0BD-57   oss://adbpg-tpch/data/tpch_data_10x/lineitem.tbl.1
2020-03-12 09:29:58 +0800 CST    145177420      Standard   CFE2CFF1C8059547DC9F1711E77F74DD-57   oss://adbpg-tpch/data/tpch_data_10x/lineitem.tbl.10
2020-03-10 21:23:24 +0800 CST    145355168      Standard   35C6227D1C29F1236A92A4D5D7922625-57   oss://adbpg-tpch/data/tpch_data_10x/lineitem.tbl.11
... ...

The following section describes how to create and use an OSS foreign table in detail.

Create an OSS server

To create an OSS server, define an OSS server you want to access. You must specify the endpoint parameter.

Syntax:

CREATE SERVER oss_serv              -- The name of the OSS server.
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint '<oss endpoint>',  -- The endpoint of the OSS server.
        bucket '<oss bucket>'       -- The bucket that contains the data object.
  );

The following table describes the values of OPTIONS. For more information, see CREATE SERVER.

Option

Type

Required

Description

oss endpoint

String

Yes

The endpoint of the OSS server.

Note

If you want to use an Alibaba Cloud server to access your AnalyticDB for PostgreSQL instance, use an internal endpoint to prevent incurring Internet traffic. An internal endpoint contains the keyword internal.

oss bucket

String

Optional

The bucket that contains the data object. You must create a bucket in OSS in advance.

Note

You must specify a bucket for either an OSS server or an OSS table. If you specify a bucket for both, the bucket value for the OSS table overwrites that for the OSS server.

Note
  • The following fault tolerance parameters can be specified when you access OSS. You can retain the default values.

  • If the following parameters use default values, a timeout is triggered if the transmission rate is less than 1 KB/s for consecutive 1,500 seconds. For more information, see Error handling.

speed_limit

Numeric value

Optional

Specifies the minimum transmission rate.

speed_time

Numeric value

Optional

Specifies the maximum duration for maintaining the minimum transmission rate.

connect_timeout

Numeric value

Optional

Specifies the connection timeout period.

dns_cache_timeout

Numeric value

Optional

Specifies the timeout period for DNS resolution.

Create a user mapping to the OSS server

After you create an OSS server, you must create a user that accesses the OSS server. To define the mapping from an AnalyticDB for PostgreSQL user to the OSS server user, you can create an OSS user mapping.

Syntax

  • Create a user mapping.

    CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
        SERVER servername
        [ OPTIONS ( option 'value' [, ... ] ) ]
  • Delete a user mapping.

    DROP USER MAPPING [ IF EXISTS ] FOR { user_name | USER | CURRENT_USER | PUBLIC }
        SERVER server_name

Parameter options

The following table describes the parameter options. For more information, see CREATE USER MAPPING.

Option

Required

Description

id

Yes

Specifies the account ID used to access OSS.

key

Yes

Specifies the AccessKey secret of the account used to access OSS.

username

Optional

Specifies the name of an existing user that is mapped to the foreign server.

CURRENT_USER or USER

Optional

Specifies that the mapping is created or deleted for the current user.

PUBLIC

Optional

Specifies all roles, including roles to create.

Examples

CREATE USER MAPPING FOR PUBLIC  -- Create a user mapping to the OSS server for all users.
    SERVER oss_serv                 -- Specify the OSS server that you want to access.
    OPTIONS ( 
      id '<oss access id>',         -- Specify the AccessKey ID of the account used to access OSS.
      key '<oss access key>'        -- Specify the AccessKey secret of the account used to access OSS.
    );

  • Create an OSS foreign table

    After you create an OSS server and a user used to access the OSS server, you can define an OSS foreign table. OSS foreign tables allow you to access data objects in a variety of formats to meet the requirements of different business scenarios.

    Objects in the following formats are supported:

    Syntax

    • Create an OSS foreign table

      CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
          column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
            [, ... ]
      ] )
          SERVER server_name
        [ OPTIONS ( option 'value' [, ... ] ) ]
    • Delete an OSS foreign table

      DROP FOREIGN TABLE [ IF EXISTS ] name [, ...] [ CASCADE | RESTRICT ]

    Parameter options

    The following two tables describe the options. For more information, see CREATE FOREIGN TABLE.

    Common options

    Option

    Type

    Required

    Remarks

    filepath

    String

    Yes. Select one of the three parameters.

    Note

    The three parameters specify the path of the object in the OSS bucket.

    This path directs to a single object.

    prefix

    String

    None

    dir

    String

    None

    bucket

    String

    Optional

    You must specify a bucket for either an OSS server or an OSS table. If you specify a bucket for both, the bucket value for the OSS table overwrites that for the OSS server.

    format

    String

    Yes

    The format of the object. Valid values:

    • CSV.

    • TEXT.

    • ORC.

    • PARQUET.

    • JSON. For more information about JSON standards, see Introducing JSON.

    • JSONLINE. You can view JSON Lines as line feed-delimited JSON. All data that can be read by using JSON Lines can be read by using JSON, but not vice versa. We recommend that you use JSON Lines if possible. For more information about JSON Lines standards, see JSONLINE.

    Options for CSV and TEXT

    Note

    Unless otherwise specified, the options described in the following table apply only to the CSV and TEXT formats. The parameter configurations are invalid for formats such as ORC and Parquet.

    Option

    Type

    Required

    Default value

    Remarks

    filetype

    String

    Optional

    plain

    Valid values:

    • plain: The system reads only the raw binary data.

    • gzip: The system reads the raw binary data and decompresses the package by using GZIP.

    • snappy: The system reads raw binary data and decompresses the package by using standard Snappy. You can read only standard Snappy-compressed objects. You cannot read objects compressed by using Hadoop-Snappy. For more information, see snappy/format_description.txt.

      You can also use this parameter to specify the compression type of the input object in the JSON or JSON Lines format. Valid values: plain and gzip.

    log_errors

    Boolean

    Optional

    false

    Specifies whether to record errors in log files.

    For more information, see Error tolerance mechanism.

    segment_reject_limit

    Numeric value

    Optional

    None

    Specifies the maximum number of error lines before the execution is aborted. If the value contains a percent sign (%), it indicates the percentage of error lines. Otherwise, the value indicates the number of error lines.

    Examples:

    • segment_reject_limit = '10' indicates that execution is aborted when the number of error lines in a segment exceeds 10.

    • segment_reject_limit = '10' indicates that execution is aborted when the number of error lines in a segment is more than 10% of the processed lines.

    The following table describes the options for formatting. For more information, see COPY.

    header

    Boolean

    Optional

    false

    Specifies whether the source file contains the header row. This option applies only to the CSV format.

    delimiter

    String

    Optional

    • Default value for TEXT objects: the Tab key.

    • Default value for CSV objects: a comma (,).

    The field delimiter. Only single-byte characters are allowed.

    quote

    String

    Optional

    Double quotation marks (").

    The column quotation marks.

    • This parameter applies only to the CSV format.

    • Only single-byte characters are allowed.

    escape

    String

    Optional

    By default, the value is the same as the value of the quote parameter.

    Specifies the character that appears before a character that is the same as the value of the quote parameter.

    • Only single-byte characters are allowed.

    • This parameter applies only to the CSV format.

    null

    String

    • Default value for TEXT objects: \N

    • Default value for CSV objects: spaces that are not enclosed in double quotation marks(").

    Specifies the empty string for a file.

    encoding

    String

    Optional

    By default, the encoding format on the client is used.

    Specifies the encoding format of the data object.

    force_not_null

    Boolean

    Optional

    false

    If the parameter is set to true, the values of specified columns are not matched against the empty string.

    force_null

    Boolean

    Optional

    false

    • If this parameter is set to true, the column values that match the empty string are returned as NULL even if the values are quoted.

    • If this parameter is not specified, only unquoted column values that match the empty string are returned as NULL.

    Examples

    CREATE FOREIGN TABLE x(i int, j int)
    SERVER oss_serv OPTIONS (format 'jsonline')
    PARTITION BY LIST (j) ( 
        VALUES('20181218'), 
        VALUES('20190101')
    );
    Note

    After you create an OSS foreign table, you can use one of the following statements to check whether OSS objects that match the table meet expectations:

    • Statement 1: explain verbose select * from <OSS foreign table>;

    • Statement 2: select * from get_oss_table_meta('<OSS foreign table>');

  • Fault tolerance mechanism

    When you create an OSS foreign table, you can set the log_errors and segment_reject_limit parameters to prevent unexpected exits due to error lines in the raw data during a scan of the OSS foreign table. The following list describes these parameters:

    • log_errors: specifies whether to record the information of error lines.

    • segment_reject_limit: specifies the maximum allowable percentage of all parsed lines that can be error lines.

      Note

      Only OSS foreign tables in the CSV and TEXT formats support the fault tolerance mechanism.

    To implement the error tolerance mechanism, perform the following steps:

    1. Create an OSS foreign table based on the FDW feature for fault tolerance.

      CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text)
          server oss_serv
          options (log_errors 'true',         -- Record the information of error lines.
                   segment_reject_limit '10', -- The number of error lines cannot exceed 10. Otherwise, the system returns an error and exits. 
                   dir 'error_sales/',        -- Specify the OSS object directory that the foreign table matches.
                   format 'csv',              -- Parse objects in the CSV format.
                   encoding 'utf8');          -- Specify the encoding format.
    2. Scan the created foreign table.

      Note

      To show the fault tolerance effect, three error lines are added to the OSS objects.

      The following statement provides an example on how to execute the query:

      SELECT * FROM oss_error_sales ;

      A similar output is returned:

       id |     value     |     x
      ----+---------------+-----------
        1 |  0.1102213212 | abcdefg
        1 |  0.1102213212 | abcdefg
        2 | 0.92983182312 | mmsmda123
        3 |     0.1123123 | abbs
        1 |  0.1102213212 | abcdefg
        2 | 0.92983182312 | mmsmda123
        3 |     0.1123123 | abbs
        1 |  0.1102213212 | abcdefg
        1 |  0.1102213212 | abcdefg
        2 | 0.92983182312 | mmsmda123
        3 |     0.1123123 | abbs
        3 |     0.1123123 | abdsa
        1 |  0.1102213212 | abcdefg
        2 | 0.92983182312 | mmsmda123
        3 |     0.1123123 | abbs
        1 |  0.1102213212 | abcdefg
        2 | 0.92983182312 | mmsmda123
        3 |     0.1123123 | abbs
      (18 rows)
    3. View the log of error lines.

      The following statement provides an example on how to execute the query:

      SELECT * FROM gp_read_error_log('oss_error_sales');

      A similar output is returned:

                 cmdtime            |     relname     |        filename         | linenum | bytenum |                          errmsg                           | rawdata | rawbytes
      ------------------------------+-----------------+-------------------------+---------+---------+-----------------------------------------------------------+---------+----------
       2020-04-22 19:37:35.21125+08 | oss_error_sales | error_sales/sales.2.csv |       2 |         | invalid byte sequence for encoding "UTF8": 0xed 0xab 0xad |         | \x
       2020-04-22 19:37:35.21125+08 | oss_error_sales | error_sales/sales.2.csv |       3 |         | invalid byte sequence for encoding "UTF8": 0xed 0xab 0xad |         | \x
       2020-04-22 19:37:35.21125+08 | oss_error_sales | error_sales/sales.3.csv |       2 |         | invalid byte sequence for encoding "UTF8": 0xed 0xab 0xad |         | \x
      (3 rows)                         
    4. Delete the log of error lines.

      The following statement provides an example on how to execute the query:

      SELECT gp_truncate_error_log('oss_error_sales');

      A similar output is returned:

       gp_truncate_error_log
      -----------------------
       t
      (1 row)

    Use an OSS foreign table

    1. To import OSS data to an internal table, perform the following steps:

      1. Distribute data evenly to multiple objects in OSS.

        Note

        • All compute nodes of an AnalyticDB for PostgreSQL instance read data in parallel from the data objects stored in OSS based on the round-robin algorithm.

        • We recommend that you use the same data encoding format for data objects and databases to simplify the encoding process and improve efficiency. The default database encoding format is UTF-8.

        • Multiple CSV and TEXT objects can be read in parallel. By default, four objects can be read in parallel.

        • To maximize the read efficiency, we recommend that you set the number of objects that can be read in parallel to an integer multiple of the number of compute node cores. The number of compute node cores is the product of the number of compute nodes and the number of cores per compute node.

        • If the number of objects is small, we recommend that you split the source objects into multiple objects so that multiple nodes can scan the objects in parallel. For more information, see Split files.

      2. Create an OSS foreign table for your AnalyticDB for PostgreSQL instance.

      3. Execute one of the following statements to import data:

        • INSERT INTO statement

          INSERT INTO <Destination on-premises table> SELECT * FROM <OSS foreign table>;
        • CREATE TABLE AS statement

          CREATE TABLE <Destination on-premises table> AS SELECT * FROM <OSS foreign table>;

      Examples:

      • Example 1: Use the INSERT statement to import data from oss_lineitem_orc to an internal AOCS table

        Create an internal table named aocs_lineitem. Example:

        CREATE TABLE aocs_lineitem (
            l_orderkey bigint,
            l_partkey bigint,
            l_suppkey bigint,
            l_linenumber bigint,
            l_quantity double precision,
            l_extendedprice double precision,
            l_discount double precision,
            l_tax double precision,
            l_returnflag CHAR(1),
            l_linestatus CHAR(1),
            l_shipdate DATE,
            l_commitdate DATE,
            l_receiptdate DATE,
            l_shipinstruct CHAR(25),
            l_shipmode CHAR(10),
            l_comment VARCHAR(44)
        ) WITH (APPENDONLY=TRUE, ORIENTATION=COLUMN, COMPRESSTYPE=ZSTD, COMPRESSLEVEL=5);

        Import data from oss_lineitem_orc to the aocs_lineitem internal table. Example:

        INSERT INTO aocs_lineitem SELECT * FROM oss_lineitem_orc;
      • Example 2: Use the CREATE TABLE AS statement to import data from oss_lineitem_orc to an internal heap table

        CREATE TABLE heap_lineitem AS SELECT * FROM oss_lineitem_orc DISTRIBUTED BY (l_orderkey);
    2. Query and analyze OSS data.

      You can query an OSS foreign table in the same way as you query an internal table. The following list describes the common query scenarios:

      • Data filtering based on key-value pairs

        SELECT * FROM oss_lineitem_orc WHERE l_orderkey = 14062498;
      • Data aggregation

        SELECT count(*) FROM oss_lineitem_orc WHERE l_orderkey = 14062498;
      • A combination of the filter, group, and limit features

        SELECT l_partkey, sum(l_suppkey)
          FROM oss_lineitem_orc
         GROUP BY l_partkey
         ORDER BY l_partkey
         limit 10;
    3. Join OSS foreign tables to internal tables for data analysis.

      Join internal AOCS table aocs_lineitem to OSS foreign tables for a TPC-H Q3 query. Example:

      SELECT l_orderkey,
             sum(l_extendedprice * (1 - l_discount)) as revenue,
             o_orderdate,
             o_shippriority 
        FROM oss_customer,                                    -- OSS foreign table
             oss_orders,                                      -- OSS foreign table
             aocs_lineitem                                    -- Internal AOCS table
       WHERE c_mktsegment = 'furniture'
         and c_custkey = o_custkey
         and l_orderkey = o_orderkey
         and o_orderdate < date '1995-03-29'
         and l_shipdate > date '1995-03-29'
       GROUP BY l_orderkey, o_orderdate, o_shippriority
       ORDER BY revenue desc, o_orderdate
       limit 10;

    Use a partitioned OSS foreign table

    In addition to OSS foreign tables, AnalyticDB for PostgreSQL supports partitioned OSS foreign tables. If you use partition key columns in a WHERE clause to query data, you can reduce the amount of data to be pulled from OSS.

    To use the partitioning feature of FDW-based OSS foreign tables in AnalyticDB for PostgreSQL, make sure that the following requirements are met: The data of a partition in a partitioned OSS foreign table must be stored in the oss://bucket/partcol1=partval1/partcol2=partval2/ directory of the OSS server. In this directory, partcol1 and partcol2 indicate partition key columns, and partval1 and partval2 indicate the partition key column values that define the partition.

    Syntax

    • Create a partitioned OSS foreign table

      FDW-based OSS foreign tables in AnalyticDB for PostgreSQL and standard partitioned tables use the same syntax to perform partitioning. For more information about the syntax of partitioning standard partitioned tables, see Table partitioning.

      Note

      FDW-based OSS foreign tables in AnalyticDB for PostgreSQL support only list partitioning.

    • Delete a partitioned OSS foreign table

      You can execute the DROP FOREIGN TABLE statement to delete a regular foreign table. You can also execute this statement to delete a partitioned OSS foreign table.

    • Adjust the structure of a partitioned foreign table

      You can execute the ALTER TABLE statement to adjust the structure of a partitioned foreign table. You can add and delete partitions. For more information about the syntax, see ALTER TABLE. The following content provides examples on how to create and delete a partition. Execute the following statement to create the ossfdw_parttable table:

      CREATE FOREIGN TABLE ossfdw_parttable(            
        key text,
        value bigint,
        pt text,                                        -- The partition key of the foreign table.
        region text                                     -- The subpartition key of the foreign table.
      ) 
      SERVER oss_serv
      OPTIONS (dir 'PartationDataDirInOss/', format 'jsonline')
      PARTITION BY LIST (pt)                            -- Use the "pt" column as the partition key
      SUBPARTITION BY LIST (region)                     -- Use the "region" column as the subpartition key
          SUBPARTITION TEMPLATE (                       -- The template of the subpartition
             SUBPARTITION hangzhou VALUES ('hangzhou'),
             SUBPARTITION shanghai VALUES ('shanghai')
          )
      ( PARTITION "20170601" VALUES ('20170601'), 
        PARTITION "20170602" VALUES ('20170602'));

      Execute the ALTER TABLE statement to add a partition to the table. When you perform this operation, a subpartition is automatically created for the new partition based on the subpartition template that is defined in the preceding code block.

      ALTER TABLE ossfdw_parttable ADD PARTITION VALUES ('20170603');

      You can also create a subpartition for an existing partition.

      ALTER TABLE ossfdw_parttable ALTER PARTITION FOR ('20170603') ADD PARTITION VALUES('nanjing');

      Delete a partition.

      ALTER TABLE ossfdw_parttable DROP PARTITION FOR ('20170601');

      Delete a subpartition.

      ALTER TABLE ossfdw_parttable ALTER PARTITION FOR ('20170602') DROP PARTITION FOR ('hangzhou');

      When you create or add a foreign table partition, you can customize the partition instead of using a template. The following code block provides an example:

      CREATE FOREIGN TABLE ossfdw_parttable(            
        key text,
        value bigint,
        pt text,                                        -- The partition key of the foreign table.
        region text                                     -- The subpartition key of the foreign table.
      ) 
      SERVER oss_serv
      OPTIONS (dir 'PartationDataDirInOss/', format 'jsonline')
      PARTITION BY LIST (pt)                            -- Use the "pt" column as the partition key
      SUBPARTITION BY LIST (region)                     -- Use the "region" column as the subpartition key
      (
          -- The following two partitions have different subpartitions:
          VALUES('20181218')
          (
              VALUES('hangzhou'),
              VALUES('shanghai') 
          ),
          VALUES('20181219')
          (
              VALUES('nantong'),
              VALUES('anhui') 
          )    
      );

      Add a partition to the table. You must specify subpartitions for the new partition because no subpartition templates are defined. Example:

      ALTER TABLE ossfdw_parttable ADD PARTITION VALUES ('20181220')
      (
          VALUES('hefei'),
          VALUES('guangzhou') 
      );

    Scenarios

    Partitioned foreign tables are commonly used to access logs shipped by Log Service. For more information, see Use FDW-based OSS foreign tables to access logs shipped by Log Service. You can use a similar method to organize directories of the OSS server when you write data from business applications to OSS, and then create a partitioned OSS foreign table.

    Examples

    The following SQL statements are used to create the userlogin foreign table that has two partitions. The paths of the two partitions in OSS are oss://bucket/userlogin/month=201812/ and oss://bucket/userlogin/month=201901/.

    CREATE FOREIGN TABLE userlogin (
            uid integer,
            name character varying,
            source integer,
            logindate timestamp without time zone,
            month int
    ) SERVER oss_serv OPTIONS (
        dir 'userlogin/',
        format 'text'
    )
    PARTITION BY LIST (month)
    (
            VALUES ('201812'), 
            VALUES ('201901')       
    )

    Use OSS foreign tables to export data

    AnalyticDB for PostgreSQL allows you to export data to OSS by using foreign tables. OSS foreign tables support export of data objects in multiple formats for different business scenarios.

    • OSS foreign tables allow you to export data to uncompressed objects in the CSV and TEXT formats.

    • OSS foreign tables allow you to export data to GZIP-compressed objects in the CSV and TEXT formats.

    • OSS foreign tables allow you to export data to binary objects in the ORC format. For more information, see the Data type mappings between ORC files and AnalyticDB for PostgreSQL files section in this topic.

    Note

    AnalyticDB for PostgreSQL does not allow data export to partitioned foreign tables.

    Naming conventions of exported objects

    During export, multiple segments are used to export data to the same directory in parallel. Therefore, the name of the object exported to OSS uses the following rule:

    {tablename | prefix } _{timestamp}_{random_key}_{seg}{segment_id}_{fileno}.{ext}[.gz]
    • {tablename | prefix }: the name of the object after export. If prefix is used, the specified prefix is used as the prefix. If the dir option is used, the name of the OSS foreign table is used as the prefix.

    • {timestamp}: the timestamp when the data is exported. Format: YYYYMMDDHH24MISS.

    • {random_key}: the random key value.

    • {seg}{segment_id}: {seg} specifies the segment. {segment_id} specifies the segment ID. For example, seg1 specifies that the object is exported from Segment 1.

    • {fileno}: the SN of the object segment, which starts from 0.

    • {ext}: the format of the object after export. For example, ".csv" corresponds to "csv" set for the format option, ".txt" to "text", and ".orc" to "orc".

    • [.gz]: specifies that the object after export is a GZIP-compressed object.

      • Create a foreign table. Set the format to CSV. Use GZIP for compression, and use dir to specify the path.

        CREATE FOREIGN TABLE fdw_t_out_1(a int)
        SERVER oss_serv
        OPTIONS (format 'csv', filetype 'gzip', dir 'test/');
      • Specify the name of the OSS object after export. Example:

        fdw_t_out_1_20200805110207_1718599661_seg-1_0.csv.g
      • Create a foreign table, set the format to orc, and then use prefix to specify the prefix of the object after export.

        CREATE FOREIGN TABLE fdw_t_out_2(a int)
        SERVER oss_serv
        OPTIONS (format 'orc', prefix 'test/my_orc_test');
      • The name of the OSS object after export. Example:

        my_orc_test_20200924153043_1737154096_seg0_0.orc

    Parameter options

    For more information about the options of the AnalyticDB for PostgreSQL foreign table when you use the foreign table to export data, see Create an OSS foreign table.

    In addition, when you use a foreign table to export data, you can use specific options.

    Common options

    Option

    Type

    Unit

    Required

    Default value

    Remarks

    fragment_size

    Numeric value

    MB

    No

    256

    The size of the fragment when data is exported to OSS. If the size of the data written to the object exceeds the specified fragment_size value, data in excess of this value is distributed to a new segment.

    Note
    • Each fragment is an object fragment during export. Each fragment is complete and independent and therefore can be parsed by the foreign table of AnalyticDB for PostgreSQL. A row is not stored across fragments.

    • The size of a fragment does not strictly follow the specified fragment_size value. In most cases, the actual size is slightly greater than the specified size.

    • If you do not have special requirements, you can retain the default value without changing the fragment_size value.

    An example of how to use fragment_size:

    1. Create a foreign table whose format is CSV. The directory of the object after export is test/lineitem/. Set framgment_size to 512 MB.

      CREATE FOREIGN TABLE oss_lineitem (
          l_orderkey bigint,
          l_partkey bigint,
          l_suppkey bigint,
          l_linenumber bigint,
          l_quantity double precision,
          l_extendedprice double precision,
          l_discount double precision,
          l_tax double precision,
          l_returnflag CHAR(1),
          l_linestatus CHAR(1),
          l_shipdate DATE,
          l_commitdate DATE,
          l_receiptdate DATE,
          l_shipinstruct CHAR(25),
          l_shipmode CHAR(10),
          l_comment VARCHAR(44)
      ) server oss_serv
          options (
              dir 'test/lineitem/',
              format 'csv',
              fragment_size '512' -- If the size of a fragment exceeds 512 MB, a new segment is generated.
          );
    2. Write data from the internal table to the foreign table.

      INSERT INTO oss_lineitem SELECT * FROM lineitem;
    3. After export, view objects in OSS. The sizes of most object fragments are slightly greater than 512 MB.

      Command:

      ossutil -e endpoint -i id -k key ls oss://bucket/test/lineitem

      A similar output is returned:

      LastModifiedTime                   Size(B)  StorageClass   ETAG                                  ObjectName
      2020-09-24 14:12:01 +0800 CST    536875660      Standard   ED6C68093E738D09B1386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg15_7.csv
      2020-09-24 14:12:27 +0800 CST    536875604      Standard   FD25FA7C7109ABCDCB386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg15_8.csv
      2020-09-24 14:12:53 +0800 CST    536875486      Standard   7C3EDE6AFE354190E5386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg15_9.csv
      2020-09-24 14:09:07 +0800 CST    536875626      Standard   48B38E65A5BB8B5B03386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg1_0.csv
      2020-09-24 14:09:32 +0800 CST    536875858      Standard   AF5525D81166F02D1C386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg1_1.csv
      2020-09-24 14:13:08 +0800 CST    235457368      Standard   BF1FC0B81376AE14F4386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg1_10.csv
      2020-09-24 14:09:56 +0800 CST    536875899      Standard   20C824EBCAE2C5DB34386C5F00000000      oss://adbpg-tpch/test/lineitem/oss_lineitem2_20200924140843_1702924182_seg1_2.csv
      ...

      Command:

      ossutil -e endpoint -i id -k key du oss://adbpg-tpch/test/lineitem/

      A similar output is returned:

      storage class   object count            sum size(byte)
      ----------------------------------------------------------
      Standard       176                  89686183118
      ----------------------------------------------------------
      total object count: 176                  total object sum size: 89686183118
      total part count:   0                           total part sum size:   0
      
      total du size(byte):89686183118
      
      0.051899(s) elapsed

    CSV and TEXT options

    Notice

    Unless otherwise specified, the options described in the following table apply only to the CSV and TEXT formats. The option configurations are invalid for the ORC format.

    Option

    Type

    Unit

    Required

    Default value

    Remarks

    gzip_level

    Numeric value

    None

    No

    1

    The compression level of GZIP used when you export CSV or TEXT data to OSS. The highest compression level is 9.

    force_quote_all

    Boolean

    None

    No

    false

    Specifies whether to force reference all fields when you export CSV data.

    force_quote_all is valid only for the CSV format.

    Note
    • gzip_level takes effect only when filetype is set to gzip.

    • gzip_level A higher GZIP compression level results in smaller sizes of generated data objects and longer export time.

    • However, according to test results, a greater gzip_level value does not result in smaller file sizes, but does increase the export time. If you do not have special requirements, we recommend that you use the default value of gzip_level, which is 1.

    The following procedure provides an example on how to use gzip_level:

    1. Create a foreign table whose format is CSV. The directory of the object after export is test/lineitem2/. Set gzip_level to 9.

      CREATE FOREIGN TABLE oss_lineitem3 (
          l_orderkey bigint,
          l_partkey bigint,
          l_suppkey bigint,
          l_linenumber bigint,
          l_quantity double precision,
          l_extendedprice double precision,
          l_discount double precision,
          l_tax double precision,
          l_returnflag CHAR(1),
          l_linestatus CHAR(1),
          l_shipdate DATE,
          l_commitdate DATE,
          l_receiptdate DATE,
          l_shipinstruct CHAR(25),
          l_shipmode CHAR(10),
          l_comment VARCHAR(44)
      ) server oss_serv
          options (
              dir 'test/lineitem2/',
              format 'csv', filetype 'gzip' ,gzip_level '9'
          );
    2. Write data from the internal table to the foreign table.

      INSERT INTO oss_lineitem SELECT * FROM lineitem;
    3. After export, you can view the objects in OSS. The total size of the objects is 23 GB (25,141,481,880 bytes), which is far less than the fragment_size value of 83 GB (89,686,183,118 bytes) specified in the example.

      Command:

      ossutil -e endpoint -i id -k key ls oss://bucket/test/lineitem2

      A similar output is returned:

      2020-09-24 15:07:49 +0800 CST    270338060      Standard   9B1F7D7CB0748391C5456C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg8_3.csv.gz
      2020-09-24 15:10:08 +0800 CST    270467652      Standard   17DE92CAE57F64F550466C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg8_4.csv.gz
      2020-09-24 15:12:00 +0800 CST    219497966      Standard   FCFE4E457C0F6942C0466C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg8_5.csv.gz
      2020-09-24 15:01:10 +0800 CST    270617343      Standard   13AD24EF528ECDF836446C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_0.csv.gz
      2020-09-24 15:03:18 +0800 CST    270377032      Standard   759DBF9B999F8609B6446C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_1.csv.gz
      2020-09-24 15:05:28 +0800 CST    270284091      Standard   F9896A5CFF554F2838456C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_2.csv.gz
      2020-09-24 15:07:43 +0800 CST    270350284      Standard   C120BE98B47DAD5EBF456C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_3.csv.gz
      2020-09-24 15:10:04 +0800 CST    270477777      Standard   69B9B1E854B626364C466C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_4.csv.gz
      2020-09-24 15:12:00 +0800 CST    219358236      Standard   A4EB5DFFBD67AF6BC0466C5F00000000      oss://adbpg-tpch/test/lineitem2/oss_lineitem3_20200924145900_1220405754_seg9_5.csv.gz
      ....
      
      Object Number is: 96

      Command:

      ossutil -e endpoint -i id -k key du oss://adbpg-tpch/test/lineitem/

      A similar output is returned:

      storage class   object count            sum size(byte)
      ----------------------------------------------------------
      Standard       96                   25141481880
      ----------------------------------------------------------
      total object count: 96                   total object sum size: 25141481880
      total part count:   0                           total part sum size:   0
      
      total du size(byte):25141481880
      
      0.037620(s) elapsed

    The following procedure provides an example on how to use force_quote_all:

    1. Create a foreign table whose format is CSV. The directory of the object after export is test/lineitem/. Set force_quote_all to true.

      CREATE FOREIGN TABLE foreign_x (
        a int, b text
      ) server oss_serv
          options (
              dir 'test/x/',
              format 'csv', force_quote_all 'true'
          );
    2. Write data from the internal table to the foreign table.

      INSERT INTO foreign_x values(1, 'a'), (2, 'b');
    3. After export, view the object content in OSS. The value of each column is referenced by double quotation marks (").

      Command:

      cat foreign_x_20200923173618_447789894_seg4_0.csv

      A similar output is returned:

      "1","a"

      Command:

      cat foreign_x_20200923173618_447789894_seg5_0.csv

      A similar output is returned:

      "2","b"

    ORC option

    Notice

    Unless otherwise specified, the options described in the following table apply only to ORC. The option configurations are invalid for the CSV or TEXT format.

    Option

    Type

    Unit

    Required

    Default value

    Remarks

    orc_stripe_size

    Numeric value

    MB

    No

    64

    The size of each stripe in a single ORC object. You can specify the size based on which to export data to an ORC object.

    Note
    • A smaller orc_stripe_size value results in larger ORC objects and longer export time when the amount of data is the same.

    • If you do not have special parameters, you can use the default value 64 MB without changing the orc_stripe_size value.

    The following procedure provides an example on how to use orc_stripe_size:

    1. Create a foreign table whose format is ORC. The directory of the object after export is test/lineitem/. Set orc_stripe_size to 128 MB.

      CREATE FOREIGN TABLE oss_lineitem (
          l_orderkey bigint,
          l_partkey bigint,
          l_suppkey bigint,
          l_linenumber bigint,
          l_quantity double precision,
          l_extendedprice double precision,
          l_discount double precision,
          l_tax double precision,
          l_returnflag CHAR(1),
          l_linestatus CHAR(1),
          l_shipdate DATE,
          l_commitdate DATE,
          l_receiptdate DATE,
          l_shipinstruct CHAR(25),
          l_shipmode CHAR(10),
          l_comment VARCHAR(44)
      ) server oss_serv
      options ( dir 'test/lineitem/', format 'orc', orc_stripe_size '128');
    2. Write data from the internal table to the foreign table.

      INSERT INTO oss_lineitem SELECT * FROM lineitem;

    3. After export, view the meta information of an exported ORC object. The size of the stripe in the ORC object is about 128 MB.

    Examples

    1. Create a foreign table

      Create a foreign table whose format is CSV. The directory of the object after export is tt_csv.

      CREATE FOREIGN TABLE foreign_x (i int, j int)
      SERVER oss_serv
      OPTIONS (format 'csv', dir 'tt_csv/');
    2. Export data

      You can write data to a foreign table in the same manner as you write data to a standard table. You can use the INSERT INTO statement to write data.

      • Batch export data to the OSS foreign table. We recommend that you use this method.

        INSERT INTO foreign_x SELECT * FROM local_x;
      • Insert the list of values. We recommend that you do not use this method.

        INSERT INTO foreign_x VALUES (1,1), (2,2), (3,3);x;

    Split objects

    FDW-based OSS foreign tables support multi-node parallel scans. If the number of objects is small, we recommend that you split the source objects into multiple objects so that multiple nodes can scan the objects in parallel.

    For example, in Linux, you can run the split command to split objects in the TEXT or CSV format into multiple objects.

    Note

    A row in an object cannot be split into different objects. Therefore, objects must be split by rows.

    • Obtain the number of rows in the current object.

      wc -l csv_file
    • Split the object into small objects based on the specified number of rows. N specifies the number of rows in each small object.

      split -l N csv_file

    Collect statistics for foreign tables

    The data of OSS foreign tables is stored in OSS. By default, the statistics for foreign tables are not collected. If up-to-date statistics are unavailable, the query optimizer may generate inefficient query plans for complex queries such as queries on joined tables. You can use the ANALYZE statement to update statistics. You can use the following execution process:

    • Execute the EXPLAIN statement to view the query plan before the ANALYZE statement is executed.

      EXPLAIN <Table name>;
    • Use the ANALYZE statement to collect statistics for an OSS foreign table.

      ANALYZE <Table name>;
    • Use the EXPLAIN statement to view the execution plan after the ANALYZE statement is executed.

      EXPLAIN <Table name>;

    View execution plans

    To view the execution plan of an OSS foreign table, execute the following statements:

    EXPLAIN SELECT COUNT(*) FROM oss_lineitem_orc WHERE l_orderkey > 14062498;
    Note

    You can execute the EXPLAIN VERBOSE statement to view more information.

    View the object information of a specified OSS foreign table

    You can execute the following statement to obtain the object information of a specified OSS foreign table:

    SELECT * FROM get_oss_table_meta('<OSS FOREIGN TABLE>');

    Data type mappings between ORC objects and AnalyticDB for PostgreSQL files

    The following table describes the data type mappings between ORC objects and AnalyticDB for PostgreSQL files.

    AnalyticDB for PostgreSQL data type

    ORC data type

    bool

    BOOLEAN

    int2

    SHORT

    int4

    INT

    int8

    LONG

    float4

    FLOAT

    float8

    DOUBLE

    numeric

    DECIMAL

    char

    CHAR

    text

    STRING

    bytea

    BINARY

    timestamp

    TIMESTAMP

    date

    DATE

    int2[]

    LIST(SHORT)

    int4[]

    LIST(INT)

    int8[]

    LIST(LONG)

    float4[]

    LIST(FLOAT)

    float8[]

    LIST(DOUBLE)

    Note

    ORC data of the LIST type can be converted to only one-dimensional arrays in AnalyticDB for PostgreSQL.

    Data type mappings between Parquet objects and AnalyticDB for PostgreSQL files

    Objects in the Parquet format support the following data types: primitive types and logical types. We recommend that you use the following data types supported by AnalyticDB for PostgreSQL to map the data types supported by Parquet objects when you create a table. Otherwise, the data types are converted.

    Note

    Two nested data types in Parquet are not supported: ARRAY and MAP.

    The following table describes the data type mappings that apply when Parquet objects do not contain data of logical types.

    AnalyticDB for PostgreSQL data type

    Parquet data type

    bool

    BOOLEAN

    integer

    INT32

    bigint

    INT64

    timestamp

    INT96

    float4

    FLOAT

    float8

    DOUBLE

    bytea

    BYTE_ARRAY

    bytea

    FIXED_LEN_BYTE_ARRAY

    The following table shows the data type mappings that apply when Parquet objects contain data of logical types.

    AnalyticDB for PostgreSQL data type

    Parquet data type

    text

    STRING

    date

    DATE

    timestamp

    TIMESTAMP

    time

    TIME

    interval

    INTERVAL

    numeric

    DECIMAL

    smallint

    INT(8)/INT(16)

    integer

    INT(32)

    bigint

    INT(64)

    bigint

    UINT(8/16/32/64)

    json

    JSON

    jsonb

    BSON

    uuid

    UUID

    text

    ENUM

    Use FDW-based OSS foreign tables to access logs shipped by Log Service

    Partitioned FDW-based OSS foreign tables in AnalyticDB for PostgreSQL are widely used to access logs shipped by Log Service. Log Service is an end-to-end logging service developed by Alibaba Cloud and is widely used in big data scenarios. For more information about Log Service, see What is Log Service?

    1. To create a partitioned OSS foreign table based on the data shipped from Log Service to OSS, set Shard Format in the OSS LogShipper dialog box. For more information about the semantic information of configuration items, see Ship log data from Log Service to OSS. Log objects generated within the same month are stored in the same OSS directory. The configuration generates the following directory structure in OSS:

      oss://oss-fdw-test/adbpgossfdw
      ├── date=202002
      │   ├── userlogin_1585617629106546791_647504382.csv
      │   └── userlogin_1585617849232201154_647507440.csv
      └── date=202003
          └── userlogin_1585617944247047796_647508762.csv
    2. Then, create the following partitioned OSS foreign table based on the columns contained in the files shipped by Log Service:

      CREATE FOREIGN TABLE userlogin (
              uid integer,
              name character varying,
              source integer,
              logindate timestamp without time zone,
              "date" int
      ) SERVER oss_serv OPTIONS (
          dir 'adbpgossfdw/',
          format 'text'
      )
      PARTITION BY LIST ("date")
      (
              VALUES ('202002'), 
              VALUES ('202003')       
      )
    3. Finally, implement the required analysis logic on the partitioned foreign table. For example, you can query the number of all user logons in February 2020.

      The following statement provides a sample request:

      EXPLAIN SELECT uid, count(uid) FROM userlogin WHERE "date" = 202002 GROUP BY uid;

      A similar output is returned:

                                                                                  QUERY PLAN                                                                             
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------
       Gather Motion 3:1  (slice2; segments: 3)  (cost=5135.10..5145.10 rows=1000 width=12)
         ->  HashAggregate  (cost=5135.10..5145.10 rows=334 width=12)
               Group Key: userlogin_1_prt_1.uid
               ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=5100.10..5120.10 rows=334 width=12)
                     Hash Key: userlogin_1_prt_1.uid
                     -> HashAggregate  (cost=5100.10..5100.10 rows=334 width=12)
                           Group Key: userlogin_1_prt_1.uid
                           ->t;  Append  (cost=0.00..100.10 rows=333334 width=4)
                                 ->  Foreign Scan on userlogin_1_prt_1  (cost=0.00..100.10 rows=333334 width=4)
                                       Filter: (date = 202002)
                                       Oss Url: endpoint=oss-cn-hangzhou-zmf-internal.aliyuncs.com bucket=adbpg-regress dir=adbpgossfdw/date=202002/ filetype=plain|text
                                       Oss Parallel (Max 4) Get: total 0 file(s) with 0 bytes byte(s).
       Optimizer: Postgres query optimizer
      (13 rows)

      In the preceding statements, the FDW-based OSS foreign table reads only data generated in February 2020 to reduce the amount of data to read and maximize query efficiency.

    Common errors

    When the system scans an OSS foreign table, a similar error is returned:

    "oss server returned error: StatusCode=..., ErrorCode=..., ErrorMessage="...", RequestId=..."

    For more information about error types and how to handle errors, see Handle OSS errors.

    In the error message:

    • StatusCode: the HTTP status code.

    • ErrorCode: the error code returned by OSS.

    • ErrorMessage: the error message returned by OSS.

    • RequestId: the UUID that identifies the request. If the issue persists, contact technical support personnel and provide this ID.

    References