Object Storage Service (OSS) is a storage service that enables you to store any amount of data in Alibaba Cloud. AnalyticDB for PostgreSQL can import and export OSS data in parallel by using OSS external tables known as the gpossext function. AnalyticDB for PostgreSQL can also compress OSS external table files in the GZIP format to reduce the storage space and costs.

Currently, gpossext can read data from and write data to both gzip-compressed and uncompressed TEXT and CSV files.

OSS

This topic includes the following sections:

Operations

When you use OSS external tables in AnalyticDB for PostgreSQL, you may need to perform the following operations:

Create an OSS external table plug-in (oss_ext)

To use an OSS external table, you must first create an OSS external table plug-in in AnalyticDB for PostgreSQL. You must create a plug-in for each database that you want to access.

  • To create the plug-in, execute the CREATE EXTENSION IF NOT EXISTS oss_ext; statement.
  • To delete the plug-in, execute the DROP EXTENSION IF EXISTS oss_ext; statement.

Import data in parallel

Perform the following steps to import data:

  1. Distribute data evenly among multiple files in OSS.

    Notice

    All compute nodes of an AnalyticDB for PostgreSQL instance read data in parallel from the data files stored in OSS by using a round-robin algorithm. To increase read efficiency, we recommend that the number of data files in OSS be an integer multiple of the number of compute nodes in your AnalyticDB for PostgreSQL instance.

  2. Create a readable external table for each database in your AnalyticDB for PostgreSQL instance.

  3. Execute the following statement to import data in parallel:
    INSERT INTO <Destination table> SELECT * FROM <External table>

Export data in parallel

Perform the following steps to export data:

  1. Create a writable external table for each database in your AnalyticDB for PostgreSQL instance.

  2. Execute the following statement to export data to OSS in parallel:
    INSERT INTO <External table> SELECT * FROM <Source table>

Create the OSS external table syntax

Execute the following statements to create the OSS external table syntax:

CREATE [READABLE] EXTERNAL TABLE tablename
( columnname datatype [, ...] | LIKE othertable )
LOCATION ('ossprotocol')
FORMAT 'TEXT'
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote']
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
[ ENCODING 'encoding' ]
[ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]
CREATE WRITABLE EXTERNAL TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('ossprotocol')
FORMAT 'TEXT'
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote']
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
[ ENCODING 'encoding' ]
[ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]
ossprotocol:
   oss://oss_endpoint prefix=prefix_name
    id=userossid key=userosskey bucket=ossbucket compressiontype=[none|gzip] async=[true|false]
ossprotocol:
   oss://oss_endpoint dir=[folder/[folder/]...]/file_name
    id=userossid key=userosskey bucket=ossbucket compressiontype=[none|gzip] async=[true|false]
ossprotocol:
   oss://oss_endpoint filepath=[folder/[folder/]...]/file_name
    id=userossid key=userosskey bucket=ossbucket compressiontype=[none|gzip] async=[true|false]

Parameters

This section describes the parameters used in various operations including:

Common parameters

  • Protocol and endpoint: the communication protocol and endpoint address in the "Protocol name://oss_endpoint" format. The protocol name is oss and oss_endpoint indicates the domain name of the OSS region.

    Notice

    If you access your AnalyticDB for PostgreSQL instance from a host that is housed on Alibaba Cloud, we recommend that you use an internal endpoint containing "internal" in the name to avoid incurring Internet traffic.

  • id: the AccessKey ID of the OSS account.

  • key: the AccessKey secret of the OSS account.

  • bucket: the bucket where the data files you want to access are stored. It must be an existing bucket in OSS.

  • prefix: the prefix in the name of the directory used to store the data files you want to access. The prefix is directly matched and cannot be controlled by a regular expression. The prefix, filepath, and dir parameters are mutually exclusive. Only one of them can be specified at a time.

    • If you specify the prefix parameter when you create a readable external table for data import, all data files with the specified prefix in their names are imported from OSS.

      • If you set the prefix parameter to test/filename, all of the following data files will be imported:
        • test/filename
        • test/filenamexxx
        • test/filename/aa
        • test/filenameyyy/aa
        • test/filenameyyy/bb/aa
      • If you set the prefix parameter to test/filename/, only the following file out of the preceding files will be imported:
        • test/filename/aa
    • If you specify the prefix parameter when you create a writable external table for data export, a unique name with the specified prefix is generated for each data file exported to OSS.

  • dir: the virtual directory in OSS. The prefix, filepath, and dir parameters are mutually exclusive. Only only one of them can be specified at a time.

    • The specified directory must end with a forward slash (/), such as test/mydir/.
    • If you specify this parameter when you create an external table for data import, all data files stored within the directory are imported. However, the data files stored in other subdirectory levels are not imported. Unlike the filepath parameter, the dir parameter does not require the data files stored in it to follow specific naming conventions.
    • If you specify this parameter when you create an external table for data export, all data is exported as multiple files to the specified directory. The exported data files are named in the filename.x format, where x indicates a number. The values of x may be out of order.
  • filepath: the file name used to filter the OSS data files you want to import. The file name contains the directory name. The prefix, filepath, and dir parameters are mutually exclusive. Only one of them can be specified at a time. In addition, you can only specify the filepath parameter when you create a readable external table for data import.

    • The file name specified by this parameter must contain the directory name but not the bucket name.
    • Only the files named as filename or in the filename.x format are imported. In addition, the values of x must be consecutive numbers starting from 1. Assume that the following data files are stored in OSS and the filepath parameter is set to filename:
      filename
      filename.1
      filename.2
      filename.4,

      Then, the filename, filename.1, and filename.2 files are imported. The filename.4 file is not imported because there is not a file named filename.3.

Import mode parameters

  • async: specifies whether to import data asynchronously.

    • You can enable an auxiliary thread to expedite the import of data from OSS.

    • The asynchronous mode is enabled by default. If you want to disable it, set the async parameter to false or f.

    • The asynchronous mode consumes more hardware resources than the normal data import mode.

  • compressiontype: the format used to compress imported data files.

    • If you retain the default value none, imported files are not compressed.

    • If you set this parameter to gzip, imported files are compressed in the GZIP format. Currently, only the GZIP format is supported.

  • compressionlevel: the level of compression for data files written to OSS. Valid values: 1 to 9. Default value: 6.

Export mode parameters

  • oss_flush_block_size: the buffer size for a single data flush to OSS. Unit: MB. Valid values: 1 to 128. Default value: 32.

  • oss_file_max_size: the maximum size of a data file allowed to be written to OSS. If a data file reaches the maximum size, the data that remains is written to other data files. Unit: MB. Valid values: 8 to 4000. Default value: 1024.

  • num_parallel_worker: the maximum number of threads that are allowed to run in parallel to compress the data written to OSS. Valid values: 1 to 8. Default value: 3.

  • compressiontype: the compression format of exported data files.

    • If you retain the default value none, imported files are not compressed.

    • If you set this parameter to gzip, exported files are compressed in the GZIP format. Currently, only the GZIP format is supported.

Take note of the following points when you configure export mode parameters:

  • You must specify the WRITABLE keyword when creating an external table for data export.

  • Only the prefix and dir parameters are supported for data export. The filepath parameter is not supported.

  • You can use the DISTRIBUTED BY clause to write data from compute nodes to OSS based on the specified distribution key.

Other parameters

The following error-tolerance parameters can be used for data import and export:

  • oss_connect_timeout: the connection timeout period. Unit: seconds. Default value: 10.

  • oss_dns_cache_timeout: the DNS timeout period. Unit: seconds. Default value: 60.

  • oss_speed_limit: the minimum data transmission rate. Unit: byte/s. Default value: 1024.

  • oss_speed_time: the maximum wait period during which the data transmission rate can be lower than its minimum value. Unit: seconds. Default value: 15.

If you retain the default values of the preceding parameters, a timeout is triggered after the transmission rate is less than 1 KB/s for 15 continuous seconds. For more information, see OSS SDK troubleshooting.

AnalyticDB for PostgreSQL also supports parameters that are compatible with the Greenplum external table syntax. For more information, visit CREATE EXTERNAL TABLE in Greenplum Database Documentation. These parameters include:

  • FORMAT: the supported file format, such as TEXT or CSV.

  • ENCODING: the data encoding format of a file, such as UTF-8.

  • LOG ERRORS: indicates that the clause can ignore imported erroneous data and write the data to error_table. You can also use the count parameter to specify the error reporting threshold.

    Note
    • You can use LOG ERRORS to record information about the rows that fail to be imported in the profile that is associated with internal tables.
      create readable external table ossexample
              (date text, time text, open float, high float,
              low float, volume int)
              location('oss://oss-cn-hangzhou.aliyuncs.com
              prefix=osstest/example id=XXX
              key=XXX bucket=testbucket compressiontype=gzip')
              FORMAT 'csv' (QUOTE '''' DELIMITER E'\t')
              ENCODING 'utf8'
              LOG ERRORS SEGMENT REJECT LIMIT 5;
    • You can use the gp_read_error_log('external_table_name') function to obtain information about the rows that fail to be imported.
      select * from gp_read_error_log('external_table_name');
    • After the external table is deleted, the internal file is also deleted. You can also use the gp_truncate_error_log('external_table_name') function to delete the internal file.
      select gp_truncate_error_log('external_table_name');
    • Only AnalyticDB for PostgreSQL V4.3 allows you to use LOG ERRORS INTO error_table to specify the error table.
      create readable external table ossexample
              (date text, time text, open float, high float,
              low float, volume int)
              location('oss://oss-cn-hangzhou.aliyuncs.com
              prefix=osstest/example id=XXX
              key=XXX bucket=testbucket compressiontype=gzip')
              FORMAT 'csv' (QUOTE '''' DELIMITER E'\t')
              ENCODING 'utf8'
              LOG ERRORS INTO my_error_rows SEGMENT REJECT LIMIT 5;

Examples

# Create an OSS external table for data import.
create readable external table ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        prefix=osstest/example id=XXX
        key=XXX bucket=testbucket compressiontype=gzip')
        FORMAT 'csv' (QUOTE '''' DELIMITER E'\t')
        ENCODING 'utf8'
        LOG ERRORS SEGMENT REJECT LIMIT 5;
create readable external table ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        dir=osstest/ id=XXX
        key=XXX bucket=testbucket')
        FORMAT 'csv'
        LOG ERRORS SEGMENT REJECT LIMIT 5;
create readable external table ossexample
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        filepath=osstest/example.csv id=XXX
        key=XXX bucket=testbucket')
        FORMAT 'csv'
        LOG ERRORS SEGMENT REJECT LIMIT 5;
# Create an OSS external table for data export.
create WRITABLE external table ossexample_exp
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        prefix=osstest/exp/outfromhdb id=XXX
        key=XXX bucket=testbucket') FORMAT 'csv'
        DISTRIBUTED BY (date);
create WRITABLE external table ossexample_exp
        (date text, time text, open float, high float,
        low float, volume int)
        location('oss://oss-cn-hangzhou.aliyuncs.com
        dir=osstest/exp/ id=XXX
        key=XXX bucket=testbucket') FORMAT 'csv'
        DISTRIBUTED BY (date);
# Create a heap table for data loading.
create table example
        (date text, time text, open float,
         high float, low float, volume int)
         DISTRIBUTED BY (date);
# Import data to the example heap table from the ossexample table in parallel.
insert into example select * from ossexample;
# Export data from the example heap table to OSS in parallel.
insert into ossexample_exp select * from example;
# In the following execution plan, all compute nodes participate in data import.
# All compute nodes read data from OSS in parallel. AnalyticDB for PostgreSQL performs a redistribution motion operation to compute the data by using a hash algorithm, and then distributes the data to its compute nodes after computing. After a compute node receives data, it performs an insert operation to add the data to AnalyticDB for PostgreSQL.
explain insert into example select * from ossexample;
                                            QUERY PLAN
-----------------------------------------------------------------------------------------------
 Insert (slice0; segments: 4)  (rows=250000 width=92)
   ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.00..11000.00 rows=250000 width=92)
         Hash Key: ossexample.date
         ->  External Scan on ossexample  (cost=0.00..11000.00 rows=250000 width=92)
(4 rows)
# In the following query plan, each compute node directly exports local data to OSS without redistributing the data.
explain insert into ossexample_exp select * from example;
                          QUERY PLAN
---------------------------------------------------------------
 Insert (slice0; segments: 3)  (rows=1 width=92)
   ->  Seq Scan on example  (cost=0.00..0.00 rows=1 width=92)
(2 rows)

Precautions

  • AnalyticDB for PostgreSQL and Greenplum use similar syntax to create and use external tables, but differ in regard to location-related parameters.

  • Data import performance depends on the OSS performance as well as available resources of your AnalyticDB for PostgreSQL instance, such as CPU, I/O, memory, and network resources. To maximize the import performance, we recommend that you enable column-oriented storage and compression when you create a table. For example, you can specify the following clause: WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=zlib, COMPRESSLEVEL=5, BLOCKSIZE=1048576). For more information, see CREATE TABLE in Greenplum Database Documentation.

  • We recommend that OSS and AnalyticDB for PostgreSQL instances be in the same region for the best import performance.

TEXT and CSV formats

The following parameters specify the formats of files read from and written to OSS. They can be specified in the external DDL parameters.

  • \n: the string used as a line delimiter or line break for TEXT and CSV files.
  • DELIMITER: the string used to delimit columns.
    • If you specify the DELIMITER parameter, you must also specify the QUOTE parameter.
    • Recommended column delimiters include commas (,), vertical bars (|), \t, and other special characters.
  • QUOTE: a pair of characters used to enclose user data.
    • The pair of characters specified by the QUOTE parameter is used to distinguish user data from control characters.
    • For the sake of efficient coding, it is not required to enclose data such as integers in QUOTE characters.
    • QUOTE cannot be the same string specified in DELIMITER. The default value of QUOTE is a pair of double quotation marks (").
    • User data that contains QUOTE characters must also contain escape characters to differentiate the user data from code for the machine.
  • ESCAPE: the escape character used to distinguish data.
    • Escape characters are placed before characters that otherwise have a special meaning.
    • If ESCAPE is not specified, its default value is the same as QUOTE.
    • You can also use other characters as escape characters, such as the '\' used by MySQL by default.

Default control characters for TEXT and CSV files

Control character TEXT CSV
DELIMITER Tab (\t) Comma (,)
QUOTE Double quotation mark (") Double quotation mark (")
ESCAPE N/A Same as QUOTE
NULL Backslash plus N (\N) Empty string without quotation marks
Note

All control characters must be single-byte characters.

SDK troubleshooting

When errors occur during the import or export process, the error log contains the following information:

  • code: the HTTP status code of the error request.

  • error_code: the error code returned by OSS.

  • error_msg: the error message returned by OSS.

  • req_id: the UUID that identifies the request. You can provide the req_id to OSS development engineers for further help.

For more information, see OSS error response. Timeout errors can be handled using oss_ext related parameters.

Instructions

If the data import process takes an abnormally extended period of time, see the descriptions on import performance in the "Precautions" section.

References