Use OSS foreign tables for data lake analysis - AnalyticDB for PostgreSQL

Description

OSS FDW is developed based on PostgreSQL FDW. OSS FDW allows you to perform the following operations:

Import OSS data to row-oriented or column-oriented tables of AnalyticDB for PostgreSQL instances for accelerated data analysis.
Query and analyze large amounts of OSS data.
Join OSS foreign tables with AnalyticDB for PostgreSQL tables for data analysis.

OSS FDW supports the following data objects in a variety of business scenarios:

Uncompressed CSV, TEXT, JSON, and JSON Lines objects.
GZIP- and standard Snappy-compressed CSV and TEXT objects.
GZIP-compressed JSON and JSON Lines objects.
ORC binary objects.
Parquet binary objects.
Arvo binary objects.

Preparations

Prepare OSS data

Prepare a sample file named example.csv.

Obtain the OSS bucket information

The following procedure demonstrates how to obtain the bucket name, object path, endpoint, and bucket domain name.

Log on to the OSS console.
In the left-side navigation pane, click Buckets.
On the Buckets page, click the name of a bucket.
You can obtain a bucket name on the Buckets page.
On the Files page, obtain an object path.
In the left-side navigation pane, click Overview.
In the Domain Names section of the Overview page, obtain an endpoint and a bucket domain name.
We recommend that you use the domain name of VPC Access from ECS (Internal Network) for data access.

Obtain the AccessKey ID and AccessKey secret

For more information about how to obtain the AccessKey ID and AccessKey secret, see Obtain an AccessKey pair.

Create an OSS server

Execute a CREATE SERVER statement to create an OSS server. You must specify a name for the OSS server that you want to access. For more information about CREATE SERVER, see CREATE SERVER.

Syntax

CREATE SERVER server_name
    FOREIGN DATA WRAPPER fdw_name
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters


Parameter	Data type	Required	Description
server_name	String	Yes	The name of the OSS server.
fdw_name	String	Yes	The name of the foreign data wrapper that manages the server. It is automatically set to oss_fdw.

The following table describes the parameters involved in OPTIONS.


Parameter	Data type	Required	Description
endpoint	String	Yes	The domain name of the bucket. For information about how to obtain the domain name, see the "Preparations" section of this topic.
bucket	String	No	The name of the bucket that stores a data object. For information about how to obtain the bucket name, see the "Preparations" section of this topic. Note The bucket parameter must be specified for at least one of the OSS server and OSS foreign table. For more information about the bucket parameter specified for the OSS foreign table, see the "Create an OSS foreign table" section of this topic. If the bucket parameter is specified for both the OSS server and the OSS foreign table, the bucket parameter specified for the OSS foreign table takes effect.
speed_limit	Numeric	No	The transmission threshold that triggers a timeout. Default value: 1024. Unit: bytes. If this parameter is specified, the speed_time parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.
speed_time	Numeric	No	The timeout period threshold. Default value: 90. Unit: seconds. If this parameter is specified, the speed_limit parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive seconds, a timeout is triggered. For more information, see Error handling.
connect_timeout	Numeric	No	The timeout period for connections. Default value: 10. Unit: seconds.
dns_cache_timeout	Numeric	No	The timeout period for DNS resolution. Default value: 60. Unit: seconds.

Example

CREATE SERVER oss_serv
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint 'oss-cn-********.aliyuncs.com',
        bucket 'adb-pg'
  );

You can also execute an ALTER SERVER statement to modify configurations of an OSS server. For more information, see ALTER SERVER.

Examples for modifying configurations of an OSS server:

Modify a parameter of the OSS server.

ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');

Add a parameter to the OSS server.

ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');

Remove a parameter from the OSS server.

ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);

You can also execute a DROP SERVER statement to delete an OSS server. For more information, see DROP SERVER.

Create a user mapping to the OSS server

After you create an OSS server, you must create a user that accesses the OSS server. You can execute a CREATE USER MAPPING statement to create a user mapping between an AnalyticDB for PostgreSQL database user and the user that accesses the OSS server. For more information, see CREATE USER MAPPING.

Syntax

CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
    SERVER servername
    [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters


Parameter	Data type	Required	Description
username	String	Yes, one of four parameters	The database username of the AnalyticDB for PostgreSQL instance to be specified to map.
USER	String		The database username of the current AnalyticDB for PostgreSQL instance to be mapped.
CURRENT_USER	String
PUBLIC	String		Matches database usernames of all AnalyticDB for PostgreSQL instances, including users to be created later.
servername	String	Yes	The name of the OSS server.

The following table describes the parameters involved in OPTIONS.


Parameter	Data type	Required	Description
id	String	Yes	The AccessKey ID that is used to access the OSS bucket. For more information about how to obtain an AccessKey ID, see Obtain an AccessKey pair.
key	String	Yes	The AccessKey secret that is used to access the OSS bucket. For more information about how to obtain an AccessKey secret, see Obtain an AccessKey pair.

Example

CREATE USER MAPPING FOR PUBLIC
    SERVER oss_serv
    OPTIONS (
        id 'LTAI5t7Ge***************',
        key 'FikziJd2La*******************'
    );

You can also execute a DROP USER MAPPING statement to delete a user. For more information, see DROP USER MAPPING.

Create an OSS foreign table

After you create an OSS server and a user to access the server, you can execute a CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.

Syntax

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
    column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
      [, ... ]
] )
    SERVER server_name
  [ OPTIONS ( option 'value' [, ... ] ) ]

Parameters


Parameter	Data type	Required	Description
table_name	String	Yes	The name of the OSS foreign table.
column_name	String	Yes	The name of the column.
data_type	String	Yes	The data type of the column.

The following table describes the parameters involved in OPTIONS.


Parameter	Data type	Required	Description
filepath	String	Yes, one of three parameters	The object name that contains the OSS object path. If the filepath parameter is specified, only the specified object is selected.
prefix	String		The prefix of the object path. Regular expressions are not supported. If the prefix parameter is specified, all OSS objects stored in the object path whose name starts with the prefix are selected. Examples: If you set the prefix parameter to test/filename, the following objects are imported: test/filename test/filenamexxx test/filename/aa test/filenameyyy/aa test/filenameyyy/bb/aa If you set the prefix parameter to test/filename/, only the following object out of the preceding objects is imported: test/filename/aa
dir	String		The path of the OSS folder that stores data objects. The folder path must end with a forward slash (/). Example: test/mydir/. If the dir parameter is specified, all objects stored in the path are selected, excluding its subdirectories and objects in the subdirectories.
bucket	String	No	The name of the bucket that stores data objects. For information about how to obtain the bucket name, see the "Preparations" section of this topic. Note The bucket parameter must be specified for at least one of the OSS server and OSS foreign table. If the bucket parameter is specified for both the OSS server and the OSS foreign table, the bucket parameter specified for the OSS foreign table takes effect.
format	String	Yes	The object format. Valid values: csv text orc avro parquet json For more information about JSON, see Introducing JSON. jsonline In this format, JSON data is delimited by line feeds. All data that can be read by using JSON Lines can also be read by using JSON, but not vice versa. We recommend that you use JSON Lines if possible. For more information about JSON Lines, see JSON Lines.
filetype	String	No	The object type. Default value: plain. Valid values: plain: The system only reads the raw binary data. gzip: The system reads the raw binary data and decompresses the package by using GZIP. snappy: The system reads the raw binary data and decompresses the package by using Snappy. Only standard Snappy-compressed objects are supported. Hadoop Snappy-compressed objects are not supported. Note If the filetype parameter is specified, only CSV, TEXT, JSON, and JSON Lines objects are supported. If the filetype parameter is set to snappy, JSON and JSON Lines objects are not supported.
log_errors	Boolean	No	Specifies whether to record errors in log files. Default value: false. For more information, see Fault tolerance. Note This parameter takes effect only for CSV and TEXT objects.
segment_reject_limit	Numeric	No	The number of abort errors. A value that includes a percent sign (%) indicates the percentage of error rows. A value without percent signs (%) indicates the number of error rows. Examples: `segment_reject_limit = '10'` indicates that the task stops and quits with an error if the number of error rows exceeds 10. `segment_reject_limit = '10%'` indicates that the task stops and quits with an error if the percentage of error rows exceeds 10% of the processed rows. Note This parameter takes effect only for CSV and TEXT objects.
header	Boolean	No	Specifies whether to include the header row for fields in the source object. Default value: false. Valid values: true: includes the header row. false: does not include the header row. Note This parameter takes effect only for CSV objects.
delimiter	String	No	The delimiter between fields. It can be set only to a single-byte character. Default value for CSV objects: comma (,) Default value for TEXT objects: tab key Note This parameter takes effect only for CSV and TEXT objects.
quote	String	No	The quotation mark that encloses fields. It can be set only to a single-byte character. Default value: double quotation mark ("). Note This parameter takes effect only for CSV objects.
escape	String	No	The string that matches the quote parameter. It can be set only to a single-byte character. Default value: double quotation mark ("). Note This parameter takes effect only for CSV objects.
null	String	No	The NULL string in objects. Default value for CSV objects: \N Default value for TEXT objects: empty character without being enclosed by quotation marks Note This parameter takes effect only for CSV and TEXT objects.
encoding	String	No	The encoding format of data objects. Default value: encoding format of the client. Note This parameter takes effect only for CSV and TEXT objects.
force_not_null	Boolean	No	Specifies whether the field value cannot be an empty string. Default value: false. Valid values: true: The field value cannot be an empty string. false: The field value can be an empty string. Note This parameter takes effect only for CSV and TEXT objects.
force_null	Boolean	No	The method used to process an empty string. Default value: false. Valid values: true: An empty string is returned as NULL no matter whether the field value is enclosed by quotation marks. false: An empty string is returned as NULL only if the field value is not enclosed by quotation marks. Note This parameter takes effect only for CSV and TEXT objects.

Example

CREATE FOREIGN TABLE ossexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');

Note After you create an OSS foreign table, you can use one of the following methods to check whether the OSS objects that the foreign table matches meet the expectation:

Method 1

EXPLAIN VERBOSE SELECT * FROM <Name of the OSS foreign table>;

Method 2

SELECT * FROM get_oss_table_meta('<Name of the OSS foreign table>');

You can also execute a DROP FOREIGN TABLE statement to delete an OSS foreign table. For more information, see DROP FOREIGN TABLE.

Import OSS data to an AnalyticDB for PostgreSQL table

Upload the prepared sample file example.csv to OSS. For more information, see Upload objects.
Note
- We recommend that you use the same data encoding format for data objects and databases to eliminate the encoding process and improve efficiency. The default database encoding format is UTF-8.
- All compute nodes of AnalyticDB for PostgreSQL use a polling mechanism to read data objects of OSS in parallel.
- AnalyticDB for PostgreSQL can read multiple CSV and TEXT objects in parallel. By default, four objects can be read in parallel. To maximize the read efficiency, we recommend that you set the number of objects that can be read in parallel to an integer multiple of the number of compute node cores. The number of compute node cores is the product of the number of compute nodes and the number of cores per compute node.
Connect to an AnalyticDB for PostgreSQL database. For more information, see Use client tools to connect to an instance.

Create an OSS server.

CREATE SERVER oss_serv
    FOREIGN DATA WRAPPER oss_fdw
    OPTIONS (
        endpoint 'oss-cn-********.aliyuncs.com',
        bucket 'adb-pg'
  );

Create a user to access the OSS server.

CREATE USER MAPPING FOR PUBLIC
    SERVER oss_serv
    OPTIONS (
        id 'LTAI5t7Ge***************',
        key 'FikziJd2La*******************'
    );

Create an OSS foreign table named ossexample.

CREATE FOREIGN TABLE ossexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) SERVER oss_serv OPTIONS (dir 'oss_adb/', format 'csv');

Execute an INSERT or CREATE TABLE AS statement to import OSS data.
- INSERT
  1. To execute an INSERT statement to import data, create an AnalyticDB for PostgreSQL table named adbexample that has the same schema as the foreign table.
```
CREATE TABLE adbexample (
    date text,
    time text,
    open float,
    high float,
    low float,
    volume int
) WITH (APPENDONLY=TRUE, ORIENTATION=COLUMN, COMPRESSTYPE=ZSTD, COMPRESSLEVEL=5);
```
  2. Execute an INSERT statement to import data of the foreign table ossexample to the AnalyticDB for PostgreSQL table adbexample.
```
INSERT INTO adbexample SELECT * FROM ossexample;
```
- CREATE TABLE AS
  Execute a CREATE TABLE AS statement to create an AnalyticDB for PostgreSQL table named adbexample and import data from the foreign table ossexample.
```
CREATE TABLE adbexample AS SELECT * FROM ossexample DISTRIBUTED BY (volume);
```

Query and analyze OSS data

You can query the data of OSS foreign tables in the same way as that for AnalyticDB for PostgreSQL tables. The following query methods can be used:

Query data by specifying a key-value pair.

SELECT * FROM ossexample WHERE volume = 5;

Query data by using an aggregate function.

SELECT count(*) FROM ossexample WHERE volume = 5;

Query data by specifying columns and using GROUP BY and LIMIT clauses.

SELECT low, sum(volume)
  FROM ossexample
 GROUP BY low
 ORDER BY low
 limit 5;

Join the OSS foreign table with an AnalyticDB for PostgreSQL table for data analysis

Create an AnalyticDB for PostgreSQL table named example for join analysis and insert data into the table.
```
CREATE TABLE example (id int, volume int);
INSERT INTO example VALUES(1,1), (2,3), (4,5);
```

Join the OSS foreign table ossexample with the AnalyticDB for PostgreSQL table example for data query.

SELECT example.volume, min(high), max(low)
FROM
ossexample,
example
WHERE ossexample.volume = example.volume
GROUP BY(example.volume)
ORDER BY example.volume;

Fault tolerance

OSS FDW uses the log_errors and segment_reject_limit parameters to provide the fault tolerance feature so that the scanning of OSS foreign tables is not interrupted by errors in raw data.

For more information about the log_errors and segment_reject_limit parameters, see Create an OSS foreign table.

Create an OSS foreign table that supports fault tolerance.

CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text)
    SERVER oss_serv
    OPTIONS (log_errors 'true',         -- Record the information of error rows.
             segment_reject_limit '10', -- Specify the threshold to stop scanning. If the number of error rows exceeds 10, the scanning stops. 
             dir 'error_sales/',        -- Specify the OSS object directory that the foreign table matches.
             format 'csv',              -- Specify CSV as the format to parse objects.
             encoding 'utf8');          -- Specify the encoding format.

Query the logs of error rows.

SELECT * FROM gp_read_error_log('oss_error_sales');

Delete the logs of error rows.

SELECT gp_truncate_error_log('oss_error_sales');