All Products
Search
Document Center

ApsaraDB RDS:Use the oss_fdw extension to read and write foreign data text files

Last Updated:Oct 31, 2023

This topic describes how to import data from an Object Storage Service (OSS) bucket into an ApsaraDB RDS for PostgreSQL instance by using the oss_fdw extension. This topic also describes how to export data from an ApsaraDB RDS for PostgreSQL instance to an OSS bucket by using the oss_fdw extension.

Prerequisites

Your RDS instance runs PostgreSQL 10 or later.

Note

If your RDS instance runs PostgreSQL 14, the minor engine version of your RDS instance must be 20220830 or later. For more information, see Update the minor engine version.

Examples

# Create an oss_fdw extension for your RDS instance.
create extension oss_fdw;  
-- Create an OSS server. 
CREATE SERVER ossserver FOREIGN DATA WRAPPER oss_fdw OPTIONS 
     (host 'oss-cn-hangzhou.aliyuncs.com' , id 'xxx', key 'xxx',bucket 'mybucket');
-- Create a foreign OSS table named ossexample.
CREATE FOREIGN TABLE ossexample 
    (date text, time text, open float,
     high float, low float, volume int) 
     SERVER ossserver 
     OPTIONS ( dir 'osstest/', delimiter ',' ,
         format 'csv', encoding 'utf8', PARSE_ERRORS '100');
-- Create a table named example on your RDS instance. The table is used to store the data that is imported into your RDS instance. 
create table example
        (date text, time text, open float,
         high float, low float, volume int);
-- Import data from the ossexample table into the example table. 
insert into example select * from ossexample;

-- Use the oss_fdw extension to estimate the size of the ossexample table and formulate a query plan. 
explain insert into example select * from ossexample;
                             QUERY PLAN                              
---------------------------------------------------------------------
 Insert on example  (cost=0.00..1.60 rows=6 width=92)
   ->  Foreign Scan on ossexample  (cost=0.00..1.60 rows=6 width=92)
         Foreign OssFile: osstest/example.csv.0
         Foreign OssFile Size: 728
(4 rows)
-- Export data from the example table to the ossexample table. 
insert into ossexample select * from example;
explain insert into ossexample select * from example;
                           QUERY PLAN
-----------------------------------------------------------------
 Insert on ossexample  (cost=0.00..16.60 rows=660 width=92)
   ->  Seq Scan on example  (cost=0.00..16.60 rows=660 width=92)
(2 rows)

For more information about the parameters in the preceding examples, see the following sections.

Parameters supported by the oss_fdw extension

Similar to other foreign data wrappers (FDW), the oss_fdw extension encapsulates foreign data that is stored in OSS buckets. You can use the oss_fdw extension to read data from OSS buckets. This process is similar to the process of reading data tables. The oss_fdw extension provides some unique parameters that are used to access a specified OSS bucket and parse the OSS objects in the OSS bucket.

Note
  • The oss_fdw extension can read and write data to OSS objects in the CSV format. The OSS objects include the CSV objects that are compressed by using gzip.

  • The value of each parameter that is used by the oss_fdw extension must be enclosed in a pair of double quotation marks (""). In addition, the value of each parameter cannot contain unnecessary spaces.

Parameters in the CREATE SERVER statement

Parameter

Description

host

The internal endpoint of the OSS bucket.

id

The AccessKey ID of the account that is used to access OSS.

key

The AccessKey secret of the account that is used to access OSS.

bucket

The OSS bucket of the objects whose data you want to read or write. Before you configure this parameter, you must create an account that is used to access OSS.

The following table describes the fault tolerance parameters that are provided by OSS. If network connectivity is poor, you can adjust the values of these parameters to ensure successful import and export.

Parameter

Description

oss_connect_timeout

The timeout period of connections. Unit: seconds. Default value: 10.

oss_dns_cache_timeout

The timeout period of cached Domain Name System (DNS) records. Unit: seconds. Default value: 60.

oss_speed_limit

The minimum transmission rate that can be tolerated. Unit: bit/s. Default value: 1024. The default value is equal to 1 Kbit/s.

oss_speed_time

The maximum period of time for which the minimum transmission rate can be tolerated. Unit: seconds. Default value: 15.

Note

You can retain the default values of the oss_speed_limit and oss_speed_time parameters. In this case, if the transmission rate remains less than 1 Kbit/s for 15 consecutive seconds, a time-out error occurs.

Parameters in the CREATE FOREIGN TABLE statement

Parameter

Description

filepath

The object name that is used to match objects stored in the OSS bucket. The object name must contain an OSS path. You can configure one of the filepath and dir parameters. If you configure the filepath parameter, you can only import data from an OSS bucket into your RDS instance.

  • The object name does not contain an OSS bucket name.

  • The object name matches multiple objects that are stored in the OSS path. This allows you to import data from multiple objects into your RDS instance.

  • Only the data from the objects that are named in the following formats can be imported into your RDS instance: filepath and filepath.x. The values of the x variable must be consecutive integers that start from 1.

    For example, the OSS path stores five objects: filepath, filepath.1, filepath.2, filepath.3, and filepath.5. In this case, filepath, filepath.1, filepath.2, and filepath.3 are matched and imported, but filepath.5 cannot be matched or imported.

dir

The folder that is used to match objects stored in the OSS bucket. You can configure one of the filepath and dir parameters. If you configure the dir parameter, you can import and export data between OSS buckets and your RDS instance.

  • The folder must end with a forward slash (/).

  • The data of all the objects in the folder are matched and imported into your RDS instance. However, these objects do not include the subfolders and the objects stored in the subfolders.

prefix

The prefix of the OSS path in the OSS bucket. This parameter does not support regular expressions. You can configure only one of the prefix, filepath, and dir parameters.

format

The format that is supported for the objects stored in the OSS bucket. Only the CSV format is supported.

encoding

The format that is used to encode data in the objects stored in the OSS bucket. Common encoding formats in PostgreSQL are supported. These supported formats include UTF-8.

parse_errors

The mode that is used to tolerate faults during the parsing process. If an error occurs during the parsing process, the entire row that encounters the error is ignored.

delimiter

The delimiter that is used to separate columns in the objects stored in the OSS bucket.

quote

The quote character that is supported for the objects stored in the OSS bucket.

escape

The escape character that is supported for the objects stored in the OSS bucket.

null

Populates an empty column by using null values. For example, you specify the null 'test' setting. In this case, if the test column is empty, it is populated by using null values.

force_not_null

Populates an empty column by using empty strings rather than null values. For example, you specify the force_not_null 'id' setting. In this case, if the ID column is empty, it is populated by using empty strings rather than null values.

compressiontype

The compression format that is used to read and write data to the objects stored in the OSS bucket.

  • none: The data is not compressed. This is the default value.

  • gzip: The data is compressed in the GZIP format.

compressionlevel

The compression level that is used to write data to the objects stored in the OSS bucket. Valid values: 1 to 9. Default value: 6.

Note
  • The filepath and dir parameters are specified in the OPTIONS parameter.

  • You must specify the filepath parameter or the dir parameter. Do not specify both parameters.

  • If you export data from your RDS instance to the OSS bucket, you can specify only the dir parameter. You cannot specify the filepath parameter.

Parameters in the CREATE FOREIGN TABLE statement

  • oss_flush_block_size: the buffer size of the data that can be written to the OSS bucket at a time. Valid values: 1 to 128. Unit: MB. Default value: 32.

  • oss_file_max_size: the maximum amount of data that can be written to an object in the OSS bucket. If the amount of data that needs to be written reaches the maximum value, the data that remains is written to a new object. Valid values: 8 to 4000. Unit: MB. Default value: 1024.

  • num_parallel_worker: the maximum number of threads that can run in parallel to compress the data written to the OSS bucket. Valid values: 1 to 8. Default value: 3.

Auxiliary functions

FUNCTION oss_fdw_list_file (relname text, schema text DEFAULT 'public')

  • This function is used to obtain the name and size of the OSS object that the specified foreign table matches.

  • The size is measured in bytes.

select * from oss_fdw_list_file('t_oss');
              name              |   size    
--------------------------------+-----------
 oss_test/test.gz.1  | 739698350
 oss_test/test.gz.2  | 739413041
 oss_test/test.gz.3  | 739562048
(3 rows)

Auxiliary parameters

oss_fdw.rds_read_one_file: specifies the OSS object that a foreign table matches. This parameter is supported only when you import data from the OSS bucket into your RDS instance. If you specify this parameter, only the OSS object that the specified foreign table matches is imported.

Example: set oss_fdw.rds_read_one_file = 'oss_test/example16.csv.1';

set oss_fdw.rds_read_one_file = 'oss_test/test.gz.2';
select * from oss_fdw_list_file('t_oss');
              name              |   size    
--------------------------------+-----------
  oss_test/test.gz.2  | 739413041
(1 rows)

Usage notes

  • The oss_fdw extension is developed based on the PostgreSQL FOREIGN TABLE framework to manage foreign tables.

  • Data import performance varies based on the available PostgreSQL and OSS resources. The PostgreSQL resources are CPU, I/O, and memory.

  • To import data at high performance, make sure that your RDS instance and the OSS bucket reside in the same region. For more information, see OSS domain names.

  • If the error "ERROR: oss endpoint userendpoint not in aliyun white list" is reported when SQL statements are read from the foreign table, we recommend that you use the public OSS endpoint that is provided for the specified region. For more information, see Regions and endpoints.

Troubleshooting

If an import or export error occurs, the following error information is logged:

  • code: the HTTP status code of the request that has failed.

  • error_code: the error code that is returned by OSS.

  • error_msg: the error message that is returned by OSS.

  • req_id: the UUID of the request that has failed. If you require assistance in solving a problem, you can submit a ticket that contains the req_id value of the failed request.

For more information about various errors, see the following documentation. You can handle time-out errors by reconfiguring the parameters related to the oss_ext extension.

Encryption of AccessKey ID and AccessKey secret

If you do not encrypt the values of the id and key parameters in the CREATE SERVER statement, other users can obtain your AccessKey pair in plaintext by executing the select * from pg_foreign_server statement. You can use symmetric encryption to encrypt the values of the id and key parameters. Use different keys for different RDS instances. This further protects your AccessKey pair. However, you cannot add data types as you can in Greenplum. This prevents incompatibility with earlier versions.

The following snippet provides the encrypted values of the id and key parameters:

postgres=# select * from pg_foreign_server ;
  srvname  | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                                              srvoptions
-----------+----------+--------+---------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------
----------------------------------
 ossserver |       10 |  16390 |         |            |        | {host=oss-cn-hangzhou-zmf.aliyuncs.com,id=MD5xxxxxxxx,key=MD5xxxxxxxx,bucket=067862}

Each encrypted value starts with an MD5 string. The total length divided by 8 is 3. After these encrypted values are exported, they will not be encrypted again. Take note that you cannot create an AccessKey ID or AccessKey secret that starts with an MD5 string.