Use an external table to import data from OSS at a high speed - AnalyticDB for PostgreSQL

AnalyticDB for PostgreSQL supports the gpossext feature, which allows you to use an Object Storage Service (OSS) external table to import data from OSS to AnalyticDB for PostgreSQL in parallel.

Features

gpossext can read from and write to TEXT and CSV files, even when they are compressed in GZIP packages.

The following figure shows the gpossext architecture.

TEXT and CSV format description

The following parameters specify the formats of files read from and written to OSS. You can specify the parameters in the external table DDL parameters.

\n: the string used as a line delimiter or line break for TEXT and CSV files.
DELIMITER: the string used to delimit columns.
- If you specify the DELIMITER parameter, you must also specify the QUOTE parameter.
- Recommended column delimiters include commas (,), vertical bars (|), and special characters such as \t.
QUOTE: encloses user data that contains special characters by column.
- The pair of characters specified by the QUOTE parameter is used to distinguish user data from control characters.
- To improve efficiency, we recommend that you do not enclose data such as integers in QUOTE characters.
- QUOTE characters cannot be the same as those specified by DELIMITER. The default value of QUOTE is a pair of double quotation marks ("").
- User data that contains QUOTE characters must also contain ESCAPE characters to differentiate user data from machine code.
ESCAPE: the escape character.
- Place an escape character before a special character that needs to be escaped to indicate that it is not a special character.
- The default value of ESCAPE is the same as that of QUOTE.
- You can also use other characters as escape characters, such as backslashes (\) used by MySQL by default.

Table 1. Default control characters for TEXT and CSV files
Control character	TEXT	CSV
DELIMITER	\t (Tab)	, (Comma)
QUOTE	" (Double quotation mark)	" (Double quotation mark)
ESCAPE	N/A	" (Double quotation mark)
NULL	\N (Backslash-N)	Empty string without quotation marks

Note

All control characters must be single-byte characters.

Precautions

The syntax to create and use external tables is the same as that of Greenplum Database, except for the syntax of location-related parameters.
The performance of data import and export depends on the OSS performance and resources of AnalyticDB for PostgreSQL instances, such as CPU, I/O, memory, and network resources. To maximize import and export performance, we recommend that you use column store and compression when you create a table. For example, you can specify the following clause: "WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=5, BLOCKSIZE=1048576)". For more information, see CREATE TABLE.
To maximize import and export performance, you must make sure that the OSS bucket and the AnalyticDB for PostgreSQL instance reside in the same region.

Procedure

Create an OSS external table extension.
To use an OSS external table, you must first create an OSS external table extension in AnalyticDB for PostgreSQL. You must create an extension for each database that you need to access. Execute the following statement to create the extension:
```
CREATE EXTENSION IF NOT EXISTS oss_ext;
```
Store the data that you want to import to AnalyticDB for PostgreSQL in multiple OSS files. For more information, see Split files.
Note
Each data partition or compute node of AnalyticDB for PostgreSQL uses the round-robin algorithm to read OSS files in parallel. We recommend that the number of files be an integer multiple of the number of compute nodes. This way, read efficiency is improved.

Create a READABLE external table in AnalyticDB for PostgreSQL.

Execute the following statement to create an OSS external table:

CREATE [READABLE] EXTERNAL TABLE tablename
( columnname datatype [, ...] | LIKE othertable )
LOCATION ('ossprotocol')
FORMAT 'TEXT'
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote']
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
[ ENCODING 'encoding' ]
[ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]
ossprotocol:
   oss://oss_endpoint [prefix=prefix_name|dir=[folder/[folder/]...]/file_name|filepath=[folder/[folder/]...]/file_name]
    id=userossid key=userosskey bucket=ossbucket compressiontype=[none|gzip] async=[true|false]

The following table describes the parameters.

Parameter	Description
FORMAT	The file formats that are supported. Example: TEXT and CSV.
ENCODING	The format that is used to encode data in the files, for example, UTF-8.
LOG ERRORS	Specifies that improperly imported data can be ignored and written to error_table. You can also use the count parameter to specify the error reporting threshold. Note You can use `LOG ERRORS` to record information about the rows that fail to be imported in an internal file. `LOG ERRORS SEGMENT REJECT LIMIT 5;` You can use the `gp_read_error_log('external_table_name')` function to obtain information about the rows that fail to be imported. `SELECT * FROM gp_read_error_log('external_table_name');` After you delete the external table, the internal file is also deleted. You can also use the `gp_truncate_error_log('external_table_name')` function to delete the internal file. `SELECT gp_truncate_error_log('external_table_name');`
oss://oss_endpoint	The protocol and endpoint in the `<Protocol name>://oss_endpoint` format. The protocol name is oss, and the domain name of the OSS region is specified by oss_endpoint. Example: `oss://oss-cn-hangzhou.aliyuncs.com` Important If you access your AnalyticDB for PostgreSQL instance from an Alibaba Cloud server, use an internal domain name to avoid data transfer over the Internet. An internal domain name contains the keyword `internal`.
id	The AccessKey ID of your Alibaba Cloud account. For more information about how to obtain an AccessKey pair, see Create an AccessKey pair.
key	The AccessKey secret of your Alibaba Cloud account. For more information about how to obtain an AccessKey pair, see Create an AccessKey pair.
bucket	The bucket in which the data files are stored. You must create an OSS bucket before you export data.
prefix	The prefix of the path of a data file. Only the specified prefix is matched and regular expressions are not supported. Note You can configure only one of the following parameters: dir, filepath, or prefix. Data from all OSS files whose paths contain this prefix is imported to the READABLE external table. If you set prefix to test/filename, the following files are imported: test/filename test/filenamexxx test/filename/aa test/filenameyyy/aa test/filenameyyy/bb/aa If you set prefix to test/filename/, only the following file out of the preceding files is imported: test/filename/aa
dir	The path of the virtual directory in which the OSS files are stored. Note You can configure only one of the following parameters: prefix, filepath, or dir. The path of the directory must end with a `forward slash (/)`. Example: `test/mydir/`. If you specify this parameter when you create an external table to import data, all data files that are stored in the specified directory are imported. The data files that are stored in the subdirectories of the specified directory are not imported. The dir parameter is different from the filepath parameter and does not require you to specify the names of files in a directory.
filepath	The name of a file that contains the path in OSS. Note You can configure only one of the following parameter: prefix, dir, or filepath. You can specify this parameter only when you create a READABLE external table. This parameter is available only when you import data.
compressiontype	The compression format of the files that you want to import. Valid values: none: The files are not compressed. This is the default value. gzip: The files are compressed in the gzip format. Note Only the gzip format is supported.
compressionlevel	The compression level of the files that are written to OSS. Valid values: 1 to 9. Default value: 6. Example: `compressionlevel=6`
oss_connect_timeout	The connection timeout period. Default value: 10. Unit: seconds.
oss_dns_cache_timeout	The timeout period for DNS resolution. Default value: 60. Unit: seconds.
oss_speed_limit	The minimum amount of data that is transmitted per second. If the amount of data that is transmitted per second is less than the specified value for a specific period of time, a timeout is triggered. Unit: bytes. Default value: 1024. 1024 bytes is equal to 1 KB. If this parameter is specified, the oss_speed_time parameter is required. Note If you use the default values for the oss_speed_limit and oss_speed_time parameters and if the transmission rate is lower than 1 Kbit/s for 15 consecutive seconds, a timeout occurs. For more information, see Error handling.
oss_speed_time	The maximum period of time in which the minimum transmission rate can be tolerated. If the transmission rate is lower than the specified value for the specified period of time, a timeout is triggered. Default value: 15. Unit: seconds. If this parameter is specified, the oss_speed_limit parameter is required. Note If you use the default values for the oss_speed_limit and oss_speed_time parameters and if the transmission rate is lower than 1 Kbit/s for 15 consecutive seconds, a timeout occurs. For more information, see Error handling.
async	Specifies whether to enable asynchronous data import. You can enable an auxiliary thread to accelerate the import of data from OSS. By default, asynchronous data import is enabled. You can specify `async = false` or `async = f` to disable asynchronous data import. Asynchronous data import consumes more hardware resources than normal data import.

Import data in parallel
Execute the following statement in AnalyticDB for PostgreSQL to import data from OSS to AnalyticDB for PostgreSQL in parallel:
```
INSERT INTO <Destination table> SELECT * FROM <External table>
```

Examples

This topic uses a table named example to describe how to import data from OSS.

Create an extensiton for an OSS external table.
Execute the following statement to create the extension:
```
CREATE EXTENSION IF NOT EXISTS oss_ext;
```

Execute the following statement to create a destination table named example:

CREATE TABLE example
        (date text, time text, open float,
         high float, low float, volume int)
         DISTRIBUTED BY (date);

Create an OSS external table named ossexample to import data.

You can use the prefix parameter to specify the path of the data files that you want to import. Example:

CREATE READABLE EXTERNAL TABLE ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        prefix=osstest/example id=XXX
        key=XXX bucket=testbucket compressiontype=gzip')
        FORMAT 'csv' (QUOTE '''' DELIMITER E'\t')
        ENCODING 'utf8'
        LOG ERRORS SEGMENT REJECT LIMIT 5;

You can use the dir parameter to specify the path of the data files that you want to import. Example:

CREATE READABLE EXTERNAL TABLE ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        dir=osstest/ id=XXX
        key=XXX bucket=testbucket')
        FORMAT 'csv'
        LOG ERRORS SEGMENT REJECT LIMIT 5;

You can use the filepath parameter to specify the path of the data files that you want to import. Example:

CREATE READABLE EXTERNAL TABLE ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        filepath=osstest/example.csv id=XXX
        key=XXX bucket=testbucket')
        FORMAT 'csv'
        LOG ERRORS SEGMENT REJECT LIMIT 5;

Import data in parallel from the ossexample external table to the example table.
```
INSERT INTO example SELECT * FROM ossexample;
```

Execute the following query plan. The result shows that the compute nodes import data from OSS in parallel. The redistribution motion node hashes the data and distributes the data to corresponding compute nodes. The compute nodes that receive the data insert the data to a database.

EXPLAIN INSERT INTO example SELECT * FROM ossexample;
                                            QUERY PLAN
-----------------------------------------------------------------------------------------------
 Insert (slice0; segments: 4)  (rows=250000 width=92)
   ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.00..11000.00 rows=250000 width=92)
         Hash Key: ossexample.date
         ->  External Scan on ossexample  (cost=0.00..11000.00 rows=250000 width=92)
(4 rows)

SDK troubleshooting

When an error occurs during the import or export process, the error log contains the following information:

code: the HTTP status code of the request that has failed.
error_code: the error code that is returned by OSS.
error_msg: the error message that is returned by OSS.
req_id: the universally unique identifier (UUID) of the request that has failed. If you require assistance in solving a problem, you can submit a ticket that contains the req_id value of the failed request.

For more information, see Error responses. You can handle timeout-related errors by using parameters related to oss_ext.

AnalyticDB for PostgreSQL:Use an external table to import data from OSS at a high speed