This topic describes how to use Object Storage Service (OSS) foreign tables to import OSS data for data analysis based on OSS foreign data wrapper (FDW).
Description
OSS FDW is developed based on PostgreSQL FDW. OSS FDW allows you to perform the following operations:
- Import OSS data to row-oriented or column-oriented tables of AnalyticDB for PostgreSQL instances for accelerated data analysis.
- Query and analyze large amounts of OSS data.
- Join OSS foreign tables with AnalyticDB for PostgreSQL tables for data analysis.
OSS FDW supports the following data objects in a variety of business scenarios:
- Uncompressed CSV, TEXT, JSON, and JSON Lines objects.
- GZIP- and standard Snappy-compressed CSV and TEXT objects.
- GZIP-compressed JSON and JSON Lines objects.
- ORC binary objects.
- Parquet binary objects.
- Arvo binary objects.
Preparations
Prepare OSS data
Prepare a sample file named example.csv.
Obtain the OSS bucket information
The following procedure demonstrates how to obtain the bucket name, object path, endpoint, and bucket domain name.
Obtain the AccessKey ID and AccessKey secret
For more information about how to obtain the AccessKey ID and AccessKey secret, see Obtain an AccessKey pair.
Create an OSS server
Execute a CREATE SERVER statement to create an OSS server. You must specify a name for the OSS server that you want to access. For more information about CREATE SERVER, see CREATE SERVER.
Syntax
CREATE SERVER server_name
FOREIGN DATA WRAPPER fdw_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
---|---|---|---|
server_name | String | Yes | The name of the OSS server. |
fdw_name | String | Yes | The name of the foreign data wrapper that manages the server. It is automatically set to oss_fdw. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
---|---|---|---|
endpoint | String | Yes | The domain name of the bucket. For information about how to obtain the domain name, see the "Preparations" section of this topic. |
bucket | String | No | The name of the bucket that stores a data object. For information about how to obtain
the bucket name, see the "Preparations" section of this topic.
Note
|
speed_limit | Numeric | No | The transmission threshold that triggers a timeout. Default value: 1024. Unit: bytes.
If this parameter is specified, the speed_time parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive
seconds, a timeout is triggered. For more information, see Error handling.
|
speed_time | Numeric | No | The timeout period threshold. Default value: 90. Unit: seconds.
If this parameter is specified, the speed_limit parameter is required. Note By default, if less than 1,024 bytes of data is transmitted within 90 consecutive
seconds, a timeout is triggered. For more information, see Error handling.
|
connect_timeout | Numeric | No | The timeout period for connections. Default value: 10. Unit: seconds. |
dns_cache_timeout | Numeric | No | The timeout period for DNS resolution. Default value: 60. Unit: seconds. |
Example
CREATE SERVER oss_serv
FOREIGN DATA WRAPPER oss_fdw
OPTIONS (
endpoint 'oss-cn-********.aliyuncs.com',
bucket 'adb-pg'
);
You can also execute an ALTER SERVER statement to modify configurations of an OSS server. For more information, see ALTER SERVER.
Examples for modifying configurations of an OSS server:
- Modify a parameter of the OSS server.
ALTER SERVER oss_serv OPTIONS(SET endpoint 'oss-cn-********.aliyuncs.com');
- Add a parameter to the OSS server.
ALTER SERVER oss_serv OPTIONS(ADD connect_timeout '20');
- Remove a parameter from the OSS server.
ALTER SERVER oss_serv OPTIONS(DROP connect_timeout);
You can also execute a DROP SERVER statement to delete an OSS server. For more information, see DROP SERVER.
Create a user mapping to the OSS server
After you create an OSS server, you must create a user that accesses the OSS server. You can execute a CREATE USER MAPPING statement to create a user mapping between an AnalyticDB for PostgreSQL database user and the user that accesses the OSS server. For more information, see CREATE USER MAPPING.
Syntax
CREATE USER MAPPING FOR { username | USER | CURRENT_USER | PUBLIC }
SERVER servername
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
---|---|---|---|
username | String | Yes, one of four parameters | The database username of the AnalyticDB for PostgreSQL instance to be specified to map. |
USER | String | The database username of the current AnalyticDB for PostgreSQL instance to be mapped. | |
CURRENT_USER | String | ||
PUBLIC | String | Matches database usernames of all AnalyticDB for PostgreSQL instances, including users to be created later. | |
servername | String | Yes | The name of the OSS server. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
---|---|---|---|
id | String | Yes | The AccessKey ID that is used to access the OSS bucket. For more information about how to obtain an AccessKey ID, see Obtain an AccessKey pair. |
key | String | Yes | The AccessKey secret that is used to access the OSS bucket. For more information about how to obtain an AccessKey secret, see Obtain an AccessKey pair. |
Example
CREATE USER MAPPING FOR PUBLIC
SERVER oss_serv
OPTIONS (
id 'LTAI5t7Ge***************',
key 'FikziJd2La*******************'
);
You can also execute a DROP USER MAPPING statement to delete a user. For more information, see DROP USER MAPPING.
Create an OSS foreign table
After you create an OSS server and a user to access the server, you can execute a CREATE FOREIGN TABLE statement to create an OSS foreign table. For more information, see CREATE FOREIGN TABLE.
Syntax
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [
column_name data_type [ OPTIONS ( option 'value' [, ... ] ) ] [ COLLATE collation ] [ column_constraint [ ... ] ]
[, ... ]
] )
SERVER server_name
[ OPTIONS ( option 'value' [, ... ] ) ]
Parameters
Parameter | Data type | Required | Description |
---|---|---|---|
table_name | String | Yes | The name of the OSS foreign table. |
column_name | String | Yes | The name of the column. |
data_type | String | Yes | The data type of the column. |
The following table describes the parameters involved in OPTIONS.
Parameter | Data type | Required | Description |
---|---|---|---|
filepath | String | Yes, one of three parameters | The object name that contains the OSS object path.
If the filepath parameter is specified, only the specified object is selected. |
prefix | String | The prefix of the object path. Regular expressions are not supported.
If the prefix parameter is specified, all OSS objects stored in the object path whose name starts with the prefix are selected. Examples:
|
|
dir | String | The path of the OSS folder that stores data objects. The folder path must end with
a forward slash (/). Example: test/mydir/.
If the dir parameter is specified, all objects stored in the path are selected, excluding its subdirectories and objects in the subdirectories. |
|
bucket | String | No | The name of the bucket that stores data objects. For information about how to obtain
the bucket name, see the "Preparations" section of this topic.
Note
|
format | String | Yes | The object format. Valid values:
|
filetype | String | No | The object type. Default value: plain. Valid values:
Note
|
log_errors | Boolean | No | Specifies whether to record errors in log files. Default value: false. For more information,
see Fault tolerance.
Note This parameter takes effect only for CSV and TEXT objects.
|
segment_reject_limit | Numeric | No | The number of abort errors.
A value that includes a percent sign (%) indicates the percentage of error rows. A value without percent signs (%) indicates the number of error rows. Examples:
Note This parameter takes effect only for CSV and TEXT objects.
|
header | Boolean | No | Specifies whether to include the header row for fields in the source object. Default
value: false. Valid values:
Note This parameter takes effect only for CSV objects.
|
delimiter | String | No | The delimiter between fields. It can be set only to a single-byte character.
Note This parameter takes effect only for CSV and TEXT objects.
|
quote | String | No | The quotation mark that encloses fields. It can be set only to a single-byte character.
Default value: double quotation mark (").
Note This parameter takes effect only for CSV objects.
|
escape | String | No | The string that matches the quote parameter. It can be set only to a single-byte character.
Default value: double quotation mark (").
Note This parameter takes effect only for CSV objects.
|
null | String | No | The NULL string in objects.
Note This parameter takes effect only for CSV and TEXT objects.
|
encoding | String | No | The encoding format of data objects. Default value: encoding format of the client.
Note This parameter takes effect only for CSV and TEXT objects.
|
force_not_null | Boolean | No | Specifies whether the field value cannot be an empty string. Default value: false.
Valid values:
Note This parameter takes effect only for CSV and TEXT objects.
|
force_null | Boolean | No | The method used to process an empty string. Default value: false. Valid values:
Note This parameter takes effect only for CSV and TEXT objects.
|
Example
CREATE FOREIGN TABLE ossexample (
date text,
time text,
open float,
high float,
low float,
volume int
) SERVER oss_serv OPTIONS (dir 'dir_oss_adb/', format 'csv');
- Method 1
EXPLAIN VERBOSE SELECT * FROM <Name of the OSS foreign table>;
- Method 2
SELECT * FROM get_oss_table_meta('<Name of the OSS foreign table>');
You can also execute a DROP FOREIGN TABLE statement to delete an OSS foreign table. For more information, see DROP FOREIGN TABLE.
Import OSS data to an AnalyticDB for PostgreSQL table
Query and analyze OSS data
You can query the data of OSS foreign tables in the same way as that for AnalyticDB for PostgreSQL tables. The following query methods can be used:
- Query data by specifying a key-value pair.
SELECT * FROM ossexample WHERE volume = 5;
- Query data by using an aggregate function.
SELECT count(*) FROM ossexample WHERE volume = 5;
- Query data by specifying columns and using GROUP BY and LIMIT clauses.
SELECT low, sum(volume) FROM ossexample GROUP BY low ORDER BY low limit 5;
Join the OSS foreign table with an AnalyticDB for PostgreSQL table for data analysis
Fault tolerance
OSS FDW uses the log_errors and segment_reject_limit parameters to provide the fault tolerance feature so that the scanning of OSS foreign tables is not interrupted by errors in raw data.
For more information about the log_errors and segment_reject_limit parameters, see Create an OSS foreign table.
- Create an OSS foreign table that supports fault tolerance.
CREATE FOREIGN TABLE oss_error_sales (id int, value float8, x text) SERVER oss_serv OPTIONS (log_errors 'true', -- Record the information of error rows. segment_reject_limit '10', -- Specify the threshold to stop scanning. If the number of error rows exceeds 10, the scanning stops. dir 'error_sales/', -- Specify the OSS object directory that the foreign table matches. format 'csv', -- Specify CSV as the format to parse objects. encoding 'utf8'); -- Specify the encoding format.
- Query the logs of error rows.
SELECT * FROM gp_read_error_log('oss_error_sales');
- Delete the logs of error rows.
SELECT gp_truncate_error_log('oss_error_sales');