All Products
Search
Document Center

AnalyticDB for PostgreSQL:Use OSS external tables to export data to OSS

Last Updated:May 30, 2023

AnalyticDB for PostgreSQL allows you to export data to Object Storage Service (OSS) tables in parallel by using the OSS external table feature called gpossext. AnalyticDB for PostgreSQL also supports GZIP compression for OSS external tables to reduce file size and storage costs.

Features

gpossext can read from and write to TEXT and CSV files, even when they are compressed in GZIP packages.

The following figure shows the gpossext architecture.

OSS

TEXT and CSV format description

The following parameters specify the formats of files read from and written to OSS. You can specify the parameters in the external table DDL parameters.

  • \n: the string used as a line delimiter or line break for TEXT and CSV files.

  • DELIMITER: the string used to delimit columns.

    • If you specify the DELIMITER parameter, you must also specify the QUOTE parameter.

    • Recommended column delimiters include commas (,), vertical bars (|), and special characters such as \t.

  • QUOTE: encloses user data that contains special characters by column.

    • The pair of characters specified by the QUOTE parameter is used to distinguish user data from control characters.

    • To improve efficiency, we recommend that you do not enclose data such as integers in QUOTE characters.

    • QUOTE characters cannot be the same as those specified by DELIMITER. The default value of QUOTE is a pair of double quotation marks ("").

    • User data that contains QUOTE characters must also contain ESCAPE characters to differentiate user data from machine code.

  • ESCAPE: the escape character.

    • Place an escape character before a special character that needs to be escaped to indicate that it is not a special character.

    • The default value of ESCAPE is the same as that of QUOTE.

    • You can also use other characters as escape characters, such as backslashes (\) used by MySQL by default.

Table 1. Default control characters for TEXT and CSV files

Control character

TEXT

CSV

DELIMITER

\t (Tab)

, (Comma)

QUOTE

" (Double quotation mark)

" (Double quotation mark)

ESCAPE

N/A

" (Double quotation mark)

NULL

\N (Backslash-N)

Empty string without quotation marks

Note

All control characters must be single-byte characters.

Precautions

  • The syntax to create and use external tables is the same as that of Greenplum Database, except for the syntax of location-related parameters.

  • The performance of data import and export depends on the OSS performance and resources of AnalyticDB for PostgreSQL instances, such as CPU, I/O, memory, and network resources. To maximize import and export performance, we recommend that you use column store and compression when you create a table. For example, you can specify the following clause: "WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=5, BLOCKSIZE=1048576)". For more information, see CREATE TABLE.

  • To maximize import and export performance, you must make sure that the OSS bucket and the AnalyticDB for PostgreSQL instance reside in the same region.

Procedure

  1. Create an OSS external table extension.

    To use an OSS external table, you must first create an OSS external table extension in AnalyticDB for PostgreSQL. You must create an extension for each database that you need to access. Execute the following statement to create the extension:

    CREATE EXTENSION IF NOT EXISTS oss_ext;
  2. Create a WRITABLE external table in AnalyticDB for PostgreSQL.

    Execute the following statement to create the WRITABLE external table:

    CREATE WRITABLE EXTERNAL TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
    LOCATION ('ossprotocol')
    FORMAT 'TEXT'
                   [( [DELIMITER [AS] 'delimiter']
                   [NULL [AS] 'null string']
                   [ESCAPE [AS] 'escape' | 'OFF'] )]
              | 'CSV'
                   [([QUOTE [AS] 'quote']
                   [DELIMITER [AS] 'delimiter']
                   [NULL [AS] 'null string']
                   [FORCE QUOTE column [, ...]] ]
                   [ESCAPE [AS] 'escape'] )]
    [ ENCODING 'encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
    ossprotocol:
       oss://oss_endpoint [prefix=prefix_name|dir=[folder/[folder/]...]/file_name]
        id=userossid key=userosskey bucket=ossbucket compressiontype=[none|gzip] async=[true|false]

    The following table describes the parameters.

    Parameter

    Description

    WRITABLE

    The keyword of the external table for data export. You must specify this keyword when you create an external table.

    FORMAT

    The supported file formats, such as TEXT and CSV.

    ENCODING

    The format that is used to encode data in the files, such as UTF-8.

    DISTRIBUTED BY

    The clause that writes data from compute nodes to OSS based on the distribution keys.

    oss://oss_endpoint

    The communication protocol and endpoint address in the <Protocol name>://oss_endpoint format. The protocol name is oss, and oss_endpoint indicates the domain name of the OSS region. Example:

    oss://oss-cn-hangzhou.aliyuncs.com
    Important

    If you access your AnalyticDB for PostgreSQL instance from an Alibaba Cloud server, use an internal endpoint to avoid incurring Internet traffic. An internal endpoint contains the keyword internal.

    id

    The AccessKey ID of your Alibaba Cloud account.

    key

    The AccessKey secret of your Alibaba Cloud account.

    bucket

    The bucket to which you want to export the data file. You must first create the OSS bucket before you export data.

    prefix

    The prefix of the path name corresponding to the data file. Prefixes are directly matched and cannot be controlled by regular expressions.

    Note

    You can specify only one of the prefix and dir parameters.

    When the WRITABLE external table is used for data export, unique names are generated for each exported file based on this parameter.

    For example, if you set this parameter to osstest/exp/outfromhdb, the files are exported to the osstest/exp/ path, and the names of all the exported files start with outfromhdb.

    dir

    The folder that is used to match objects stored in the OSS bucket.

    Note

    You can specify only one of the prefix and dir parameters.

    • A folder path must end with a forward slash (/). Example: test/mydir/.

    • If you use this parameter when you create an external table for data export, all data is exported as multiple files within the specified directory. The names of exported files are in the filename.x format, where x indicates a number. The values of x may not be consecutive.

    compressiontype

    The compression format of the exported files. Default value: none. Valid values:

    • none: The exported files are not compressed.

    • gzip: The exported files are compressed in the GZIP format.

    Note

    Only the GZIP format is supported.

    num_parallel_worker

    The number of parallel compression threads for data that is written to OSS. Valid values: 1 to 8. Default value: 3. Example:

    num_parallel_worker=3

    oss_flush_block_size

    The size of each data block that is written to OSS. Valid values: 1 to 128. Default value: 32. Unit: MB. Example:

    oss_flush_block_size=32

    oss_file_max_size

    The maximum size for each file that is written to OSS. If the limit is exceeded, subsequent data is written to another file. Valid values: 8 to 1000. Default value: 1024. Unit: MB. Example:

    oss_file_max_size=1024

    oss_connect_timeout

    The connection timeout period. Default value: 10. Unit: seconds.

    oss_dns_cache_timeout

    The timeout period for DNS resolution. Default value: 60. Unit: seconds.

    oss_speed_limit

    The minimum amount of data transmitted per second. If the amount of data transmitted per second is lower than the specified value for a specific period of time, a timeout is triggered. Unit: bytes. Default value: 1024. 1024 bytes equals 1 KB.

    If this parameter is specified, the oss_speed_time parameter is required.

    Note

    When the default values are used for the oss_speed_limit and oss_speed_time parameters, a timeout occurs if the transmission rate is lower than 1 Kbit/s for 15 consecutive seconds. For more information, see Error handling.

    oss_speed_time

    The maximum period of time that the minimum amount of data transmitted can be tolerated. If the amount of data transmitted per second is lower than a specific value for the specified period of time, a timeout is triggered. Default value: 15. Unit: seconds.

    If this parameter is specified, the oss_speed_limit parameter is required.

    Note

    When the default values are used for the oss_speed_limit and oss_speed_time parameters, a timeout occurs if the transmission rate is lower than 1 Kbit/s for 15 consecutive seconds. For more information, see Error handling

  3. Export data in parallel.

    Execute the following statement in the AnalyticDB for PostgreSQL database to export the data to OSS in parallel.

    INSERT INTO <External table> SELECT * FROM <Source table>

Example

In this example, data from a source table named example is exported to OSS.

  1. Execute the following statement to create an OSS external table extension:

    CREATE EXTENSION IF NOT EXISTS oss_ext;
  2. Execute the following statement to create a table named example that stores the data to be exported:

    CREATE TABLE example
            (date text, time text, open float,
             high float, low float, volume int)
             DISTRIBUTED BY (date);
  3. Create an OSS external table to which data is exported.

    • Execute the following statement if the prefix parameter is used to specify the path of the table:

      CREATE WRITABLE EXTERNAL TABLE ossexample_exp
              (date text, time text, open float, high float,
              low float, volume int)
              location('oss://oss-cn-hangzhou.aliyuncs.com
              prefix=osstest/exp/outfromhdb id=XXX
              key=XXX bucket=testbucket') FORMAT 'csv'
              DISTRIBUTED BY (date);
    • Execute the following statement if the dir parameter is used to specify the path of the table:

      CREATE WRITABLE EXTERNAL TABLE ossexample_exp
              (date text, time text, open float, high float,
              low float, volume int)
              location('oss://oss-cn-hangzhou.aliyuncs.com
              dir=osstest/exp/ id=XXX
              key=XXX bucket=testbucket') FORMAT 'csv'
              DISTRIBUTED BY (date);
  4. Execute the following statement to export data from the example table to OSS in parallel:

    INSERT INTO ossexample_exp SELECT * FROM example;

Execute the following query plan, which indicates that each compute node directly exports data to OSS without redistributing the data.

EXPLAIN INSERT INTO ossexample_exp SELECT * FROM example;

The following result is returned:

                          QUERY PLAN
---------------------------------------------------------------
 Insert (slice0; segments: 3)  (rows=1 width=92)
   ->  Seq Scan on example  (cost=0.00..0.00 rows=1 width=92)
(2 rows)

SDK troubleshooting

When an error occurs during the import or export process, the error log contains the following information:

  • code: the HTTP status code of the request that has failed.

  • error_code: the error code that is returned by OSS.

  • error_msg: the error message that is returned by OSS.

  • req_id: the universally unique identifier (UUID) of the request that has failed. If you require assistance in solving a problem, you can submit a ticket that contains the req_id value of the failed request.

For more information, see Error responses. You can handle timeout-related errors by using parameters related to oss_ext.

References