This topic describes how to use Object Storage Service (OSS) foreign tables to import OSS data for data analysis based on OSS foreign data wrapper (FDW).

Description

OSS FDW is developed based on PostgreSQL FDW. OSS FDW allows you to perform the following operations:

  • Import OSS data to row-oriented or column-oriented tables of AnalyticDB for PostgreSQL instances for accelerated data analysis.
  • Query and analyze large amounts of OSS data.
  • Join OSS foreign tables with AnalyticDB for PostgreSQL tables for data analysis.

OSS FDW supports the following data objects in a variety of business scenarios:

In addition, OSS FDW allows you to partition tables by one or more fields. Then, you can filter data by partition.

Comparison between OSS foreign tables and OSS external tables

OSS foreign tables are superior to OSS external tables in terms of performance, functionality, and stability. The following table describes their differences.

Feature OSS foreign table OSS external table
Data import from OSS or export to OSS Supported Supported
Analysis for large amounts of data Provides better performance than OSS external tables. May fail to meet the expected performance requirements.
Supported object formats Supports the following object formats:
  • Avro
  • ORC
  • Parquet
  • JSON
  • JSON Lines
  • CSV (GZIP- and standard Snappy-compressed)
  • TEXT (GZIP- and standard Snappy-compressed)
Supports the following object formats:
  • CSV (GZIP- and standard Snappy-compressed)
  • TEXT (GZIP- and standard Snappy-compressed)
Table partitions Supported Not supported

Preparations

Prepare OSS data

Prepare a sample file example.csv.

Obtain the OSS bucket information

The following procedure demonstrates how to obtain the bucket name, object path, endpoint, and bucket domain name.

  1. Log on to the OSS console.
  2. In the left-side navigation pane, click Buckets.
  3. On the Buckets page, click the name of a bucket.
    You can obtain a bucket name on the Buckets page.
  4. On the Files page, obtain an object path.
  5. In the left-side navigation pane, click Overview.
  6. In the Domain Names section of the Overview page, obtain an endpoint and a bucket domain name.
    We recommend that you use the domain name of VPC Access from ECS (Internal Network) for data access.

Obtain the AccessKey ID and AccessKey secret

For more information about how to obtain the AccessKey ID and AccessKey secret, see Obtain an AccessKey pair.

Create an OSS server

Execute a CREATE SERVER statement to create an OSS server. You must specify a name for the OSS server that you want to access. For more information about CREATE SERVER, see CREATE SERVER.

Syntax

CREATE SERVER server_name
    FOREIGN DATA WRAPPER fdw_name
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter Data type Required Description
server_name String Yes The name of the OSS server.
fdw_name String Yes The name of the foreign data wrapper that manages the server. It is automatically set to oss_fdw.

The following table describes the parameters involved in OPTIONS.

Parameter Data type Required Description
endpoint String Yes The domain name of the bucket. For information about how to obtain the domain name, see the "Preparations" section of this topic.
bucket String No The name of the bucket that stores a data object. For information about how to obtain the bucket name, see the "Preparations" section of this topic.
Note
  • The bucket parameter must be specified for at least one of the OSS server and OSS foreign table. For more information about the bucket parameter specified for the OSS foreign table, see the "Create an OSS foreign table" section of this topic.
  • If the bucket parameter is specified for both the OSS server and OSS foreign table, the bucket parameter specified in the OSS foreign table takes effect.
speed_limit Numeric No The transmission threshold that triggers a timeout. Default value: 1024. Unit: bytes.

If this parameter is specified, the speed_time parameter is required.

Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.
speed_time Numeric No The timeout period threshold. Default value: 90. Unit: seconds.

If this parameter is specified, the speed_limit parameter is required.

Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.
connect_timeout Numeric No The timeout period for connections. Default value: 10. Unit: seconds.
dns_cache_timeout Numeric No The timeout period for DNS resolution. Default value: 60. Unit: seconds.

Example

CREATE SERVER oss_serv
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint 'oss-cn-********.aliyuncs.com',
        bucket 'adb-pg'
  );

You can also execute an ALTER SERVER statement to modify configurations of an OSS server. For more information, see ALTER SERVER.

Examples for modifying configurations of an OSS server:

  • Modify a parameter of the OSS server.
    ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');
  • Add a parameter to the OSS server.
    ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');
  • Remove a parameter from the OSS server.
    ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);

You can also execute a DROP SERVER statement to delete an OSS server. For more information, see DROP SERVER.

Create a user mapping to the OSS server

After you create an OSS server, you must create a user that accesses the OSS server. You can execute a CREATE USER MAPPING statement to create a user mapping between an AnalyticDB for PostgreSQL database user and the user that accesses the OSS server. For more information, see CREATE USER MAPPING.

Syntax

CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
    SERVER servername
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter Data type Required Description
username String Yes, one of four The database username of the AnalyticDB for PostgreSQL instance to be specified to map.
USER String The database username of the current AnalyticDB for PostgreSQL instance to be mapped.
CURRENT_USER String
PUBLIC String Matches database usernames of all AnalyticDB for PostgreSQL instances, including users to be created later.
servername String Yes The name of the OSS server.

The following table describes the parameters involved in OPTIONS.

Parameter Data type Required Description
id String Yes The AccessKey ID that is used to access the OSS bucket. For more information about how to obtain an AccessKey ID, see Obtain an AccessKey pair.
key String Yes The AccessKey secret that is used to access the OSS bucket. For more information about how to obtain an AccessKey secret, see Obtain an AccessKey pair.

Example

CREATE USER MAPPING FOR PUBLIC
    SERVER oss_serv
    OPTIONS (
        id 'LTAI5t7Ge***************',
        key 'FikziJd2La*******************'
    );

You can also execute a DROP USER MAPPING statement to delete a user. For more information, see DROP USER MAPPING.

Create an OSS foreign table

After you create an OSS server and a user to access the server, you can execute a CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.

Syntax

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
    column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
      [, ... ]
] )
    SERVER server_name
  [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters

Parameter Data type Required Description
table_name String Yes The name of the OSS foreign table.
column_name String Yes The name of the column.
data_type String Yes The data type of the column.

The following table describes the parameters involved in OPTIONS.

Parameter Data type Required Description
filepath String Yes, one of three The object name that contains the object path in OSS.

If the filepath parameter is specified, only the specified object is selected.

prefix String The prefix of the object path. Only the specified prefix is matched and regular expressions are not supported.

If the prefix parameter is specified, all OSS objects whose paths contain this prefix are selected. Examples:

  • If you set prefix to test/filename, the following objects are imported:
    • test/filename
    • test/filenamexxx
    • test/filename/aa
    • test/filenameyyy/aa
    • test/filenameyyy/bb/aa
  • If you set prefix to test/filename/, only the following object out of the preceding objects is imported:
    • test/filename/aa
dir String The directory path in OSS. The directory path must end with a forward slash (/). Example: test/mydir/.

If the dir parameter is specified, all objects in the directory are selected, excluding its subdirectories and objects in the subdirectories.

bucket String No The name of the bucket that stores a data object. For information about how to obtain the bucket name, see the "Preparations" section of this topic.
Note
  • The bucket parameter must be specified for at least one of the OSS server and OSS foreign table.
  • If the bucket parameter is specified for both the OSS server and OSS foreign table, the bucket parameter specified in the OSS foreign table takes effect.
format String Yes The object format. Valid values:
  • csv
  • text
  • orc
  • avro
  • parquet
  • json

    For more information about JSON, see Introducing JSON.

  • jsonline

    In this format, JSON data is delimited by line feeds. All data that can be read by using JSON Lines can also be read by using JSON, but not vice versa. We recommend that you use JSON Lines if possible. For more information about JSON Lines, see JSON Lines.

filetype String No The object type. Default value: plain. Valid values:
  • plain: The system only reads the raw binary data.
  • gzip: The system reads the raw binary data and decompresses the package by using GZIP.
  • snappy: The system reads the raw binary data and decompresses the package by using Snappy.

    Only standard Snappy-compressed objects are supported. Hadoop Snappy-compressed objects are not supported.

Note
  • If the filetype parameter is specified, only CSV, TEXT, JSON, and JSON Lines objects are supported.
  • If the filetype parameter is set to snappy, JSON and JSON Lines objects are not supported.
log_errors Boolean No Specifies whether to record errors in log files. Default value: false. For more information, see Fault tolerance.
Note This parameter is available only for CSV and TEXT objects.
segment_reject_limit Numeric No The number of abort errors.

A value that includes a percent sign (%) indicates the percentage of error rows. A value without percent signs (%) indicates the number of error rows. Examples:

  • segment_reject_limit = '10' indicates that the task stops and quits with an error if the number of error rows exceeds 10.
  • segment_reject_limit = '10%' indicates that the task stops and quits with an error if the percentage of error rows exceeds 10% of the processed rows.
Note This parameter is available only for CSV and TEXT objects.
header Boolean No Specifies whether to include the header row for fields in the source object. Default value: false. Valid values:
  • true: includes the header row.
  • false: does not include the header row.
Note This parameter is available only for CSV objects.
delimiter String No The delimiter between fields. It can be set only to a single-byte character.
  • Default value for CSV objects: comma (,)
  • Default value for TEXT objects: tab key
Note This parameter is available only for CSV and TEXT objects.
quote String No The quotation mark that encloses fields. It can be set only to a single-byte character. Default value: double quotation mark (").
Note This parameter is available only for CSV objects.
escape String No The string that matches the quote parameter. It can be set only to a single-byte character. Default value: double quotation mark (").
Note This parameter is available only for CSV objects.
null String No The NULL string in objects.
  • Default value for CSV objects: \N
  • Default value for TEXT objects: empty character without being enclosed by quotation marks
Note This parameter is available only for CSV and TEXT objects.
encoding String No The encoding format of data objects. Default value: encoding format of the client.
Note This parameter is available only for CSV and TEXT objects.
force_not_null Boolean No Specifies whether the field value does not match an empty string. Default value: false. Valid values:
  • true: The field value does not match an empty string.
  • false: The field value matches an empty string.
Note This parameter is available only for CSV and TEXT objects.
force_null Boolean No The method used to process an empty string. Default value: false. Valid values:
  • true: An empty string is returned as NULL no matter whether its field value is enclosed by quotation marks.
  • false: An empty string is returned as NULL only if its field value is not enclosed by quotation marks.
Note This parameter is available only for CSV and TEXT objects.

Example

CREATE FOREIGN TABLE ossexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
Note After you create an OSS foreign table, you can use one of the following methods to check whether the OSS objects that the foreign table matches meet the expectation:
  • Method 1
    EXPLAIN VERBOSE SELECT * FROM <Name of the OSS foreign table>;
  • Method 2
    SELECT * FROM get_oss_table_meta('<Name of the OSS foreign table>');

You can also execute a DROP FOREIGN TABLE statement to delete an OSS foreign table. For more information, see DROP FOREIGN TABLE.

Import OSS data to an AnalyticDB for PostgreSQL table

  1. Upload the prepared sample file example.csv to OSS. For more information, see Upload objects.
    Note
    • We recommend that you use the same data encoding format for data objects and databases to eliminate the encoding process and improve efficiency. The default database encoding format is UTF-8.
    • All compute nodes of AnalyticDB for PostgreSQL use a polling mechanism to read data objects of OSS in parallel.
    • AnalyticDB for PostgreSQL can read multiple CSV and TEXT objects in parallel. By default, four objects can be read in parallel. To maximize the read efficiency, we recommend that you set the number of objects that can be read in parallel to an integer multiple of the number of compute node cores. The number of compute node cores is the product of the number of compute nodes and the number of cores per compute node. For information about how to split objects, see the "Split objects" section of the Use OSS foreign tables to access OSS data topic.
  2. Connect to an AnalyticDB for PostgreSQL database. For more information, see Use client tools to connect to an instance.
  3. Create an OSS server.
    CREATE SERVER oss_serv
        FOREIGN DATA WRAPPER oss_fdw
        OPTIONS (
            endpoint 'oss-cn-********.aliyuncs.com',
            bucket 'adb-pg'
      );
  4. Create a user to access the OSS server.
    CREATE USER MAPPING FOR PUBLIC
        SERVER oss_serv
        OPTIONS (
            id 'LTAI5t7Ge***************',
            key 'FikziJd2La*******************'
        );
  5. Create an OSS foreign table named ossexample.
    CREATE FOREIGN TABLE ossexample (
        date text,
        time text,
        open float,
        high float,
        low float,
        volume int
    ) SERVER oss_serv OPTIONS (dir 'oss_adb/', format 'csv');
  6. Execute an INSERT or CREATE TABLE AS statement to import OSS data.
    • INSERT
      1. To execute an INSERT statement to import data, create an AnalyticDB for PostgreSQL table named adbexample that has the same schema as the foreign table.
        CREATE TABLE adbexample (
            date text,
            time text,
            open float,
            high float,
            low float,
            volume int
        ) WITH (APPENDONLY=TRUE, ORIENTATION=COLUMN, COMPRESSTYPE=ZSTD, COMPRESSLEVEL=5);
      2. Execute an INSERT statement to import data of the foreign table ossexample to the AnalyticDB for PostgreSQL table adbexample.
        INSERT INTO adbexample SELECT * FROM ossexample;
    • CREATE TABLE AS

      Execute a CREATE TABLE AS statement to create an AnalyticDB for PostgreSQL table named adbexample and import data from the foreign table ossexample.

      CREATE TABLE adbexample AS SELECT * FROM ossexample DISTRIBUTED BY (volume);

Query and analyze OSS data

You can query the data of OSS foreign tables in the same way as that for AnalyticDB for PostgreSQL tables. The following query methods can be used:

  • Query data by specifying a key-value pair.
    SELECT * FROM ossexample WHERE volume = 5;
  • Query data by using an aggregate function.
    SELECT count(*) FROM ossexample WHERE volume = 5;
  • Query data by specifying columns and using GROUP BY and LIMIT clauses.
    SELECT low, sum(volume)
      FROM ossexample
     GROUP BY low
     ORDER BY low
     limit 5;

Join the OSS foreign table with an AnalyticDB for PostgreSQL table for data analysis

  1. Create an AnalyticDB for PostgreSQL table named example for join analysis and insert data into the table.
    CREATE TABLE example (id int, volume int);
    INSERT INTO example VALUES(1,1), (2,3), (4,5);
  2. Join the OSS foreign table ossexample with the AnalyticDB for PostgreSQL table example for data query.
    SELECT example.volume, min(high), max(low)
    FROM
    ossexample,
    example
    WHERE ossexample.volume = example.volume
    GROUP BY(example.volume)
    ORDER BY example.volume;

Fault tolerance

OSS FDW uses the log_errors and segment_reject_limit parameters to provide the fault tolerance feature so that the scanning of OSS foreign tables is not interrupted by errors in raw data.

For more information about the log_errors and segment_reject_limit parameters, see Create an OSS foreign table.

  • Create an OSS foreign table that supports fault tolerance.
    CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text)
        SERVER oss_serv
        OPTIONS (log_errors 'true',         -- Record the information of error rows.
                 segment_reject_limit '10', -- Specify the threshold to stop scanning. If the number of error rows exceeds 10, the scanning stops. 
                 dir 'error_sales/',        -- Specify the OSS object directory that the foreign table matches.
                 format 'csv',              -- Specify CSV as the format to parse objects.
                 encoding 'utf8');          -- Specify the encoding format. 
  • Query the logs of error rows.
    SELECT * FROM gp_read_error_log('oss_error_sales');
  • Delete the logs of error rows.
    SELECT gp_truncate_error_log('oss_error_sales');

Data type mappings between ORC and AnalyticDB for PostgreSQL

Data type of ORC Data type of AnalyticDB for PostgreSQL
BOOLEAN bool
SHORT int2
INT int4
LONG int8
FLOAT float4
DOUBLE float8
DECIMAL numeric
CHAR char
STRING text
BINARY bytea
TIMESTAMP timestamp
DATE date
LIST(SHORT) int2[]
LIST(INT) int4[]
LIST(LONG) int8[]
LIST(FLOAT) float4[]
LIST(DOUBLE) float8[]
Note The LIST data type in ORC can be converted to only a one-dimensional array in AnalyticDB for PostgreSQL.

Data type mappings between Parquet and AnalyticDB for PostgreSQL

Note Nested arrays or maps are not supported in Parquet.

The following table describes the data type mappings that apply when Parquet objects do not contain data of logical types.

Data type of Parquet Data type of AnalyticDB for PostgreSQL
BOOLEAN bool
INT32 integer
INT64 bigint
INT96 timestamp
FLOAT float4
DOUBLE float8
BYTE_ARRAY bytea
FIXED_LEN_BYTE_ARRAY bytea

The following table describes the data type mappings that apply when Parquet objects contain data of logical types.

Data type of Parquet Data type of AnalyticDB for PostgreSQL
STRING text
DATE date
TIMESTAMP timestamp
TIME time
INTERVAL interval
DECIMAL numeric
INT(8) and INT(16) smallint
INT(32) integer
INT(64) bigint
UINT(8/16/32/64) bigint
JSON json
BSON jsonb
UUID uuid
ENUM text

Data type mappings between Avro and AnalyticDB for PostgreSQL

The following table describes the data type mappings between Avro and AnalyticDB for PostgreSQL. Multiple Avro data types can be combined into a RECORD data type.

Data type of Avro (including the main type, logical type, and element type) Data type of AnalyticDB for PostgreSQL
BOOLEAN bool
INT integer and smallint
INT and DATE date
LONG bigint
LONG and TIMESTAMP timestamp and timestamptz
LONG and TIME time
FLOAT float
DOUBLE float8
BYTES bytea
BYTES and DECIMAL (precision, scale) numeric (precision, scale)
STRING char, bpchar, varchar, text, and numeric (without specifying precision or scale)
ARRAY and INT int4[] and int2[]
ARRAY and LONG int8[]
ARRAY and FLOAT float[]
ARRAY and DOUBLE float8[]
Note
  • The ARRAY data type in Avro can be converted to only a one-dimensional array in AnalyticDB for PostgreSQL.
  • Complex Avro data types that are not included in the preceding table are not supported.