You can use the external data source access feature of AnalyticDB for PostgreSQL to access an external data source, such as a Hadoop Distributed File System (HDFS), Hive, or Java Database Connectivity (JDBC) data source. This feature allows you to perform query and analysis on external data sources after optimizing the data format.
Overview
External data source management provides high-performance parallel transmission for structured and unstructured data. The external data source access feature allows you to query and analyze external system data from AnalyticDB for PostgreSQL by using SQL syntax. The external data source access feature provides the following benefits:
Supports access from AnalyticDB for PostgreSQL to multiple data storage systems, such as HDFS, Hive, MySQL, PostgreSQL, and PolarDB for MySQL.
Supports various data formats, such as CSV, ORC, Parquet, JSON, and Avro.
Supports efficient data reads and writes on top of the parallel data processing capability of AnalyticDB for PostgreSQL.
Uses SQL syntax to access external data sources regardless of the data connection details.
Supported versions
AnalyticDB for PostgreSQL V6.0: V6.6.0.0 or later.
AnalyticDB for PostgreSQL V7.0: V7.0.2.3 or later.
AnalyticDB for PostgreSQL in Serverless mode: V2.1.1.3 or later.
Procedure
Step 1: Enable the external data service
- Log on to the AnalyticDB for PostgreSQL console.
- In the upper-left corner of the console, select a region.
- Find the instance that you want to manage and click the instance ID.
In the left-side navigation pane, click External Data Source Management.
On the External Data Source Access tab, click Enable External Data Service.
In the panel that appears, configure the parameters that are described in the following table and click OK.
Parameter
Description
Name
The name of the external data service. All types of characters are supported.
Service Description
The description of the external data service.
Step 2: Add a data source
The external data source access feature supports HDFS, Hive, and JDBC data sources.
Add an HDFS data source
Prerequisites
A self-managed Hadoop cluster or an E-MapReduce (EMR) cluster that supports Hadoop Common, HDFS, and YARN is created in the same virtual private cloud (VPC) as an AnalyticDB for PostgreSQL instance.
The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a security group of the Hadoop cluster. All default port numbers of the Hadoop services that may be used by the external data source access feature are added to the security group. The following table describes the Hadoop services. For more information, see Add a security group rule.
Service
Port number
Namenode IPC
Hadoop 2.x: 8020
Hadoop 3.x: 9820
EMR: 9000
For more information about the port numbers, see the dfs.namenode.rpc-address parameter in the hdfs-site.xml file.
DataNode IPC
Hadoop 2.x: 50020
Hadoop 3.x: 9867
For more information about the port numbers, see the dfs.datanode.ipc.address parameter in the hdfs-site.xml file.
DataNode
Hadoop 2.x: 50010
Hadoop 3.x: 9866
For more information about the port numbers, see the dfs.datanode.address parameter in the hdfs-site.xml file.
Hive metastore
9083
For more information about the port numbers, see the dfs.datanode.address parameter in the hive-site.xml file.
Hive HiveServer
10000
For more information about the port numbers, see the hive.server2.thrift.port parameter in the hive-site.xml file.
Zookeeper
2181
For more information about the port numbers,
see the clientPort parameter in the zoo.cfg file.
Kerberos Server
88 (transmission encryption and authentication) and 749 (kadmin)
For more information about the port numbers,
see the kdc_ports and kadmind_port parameters in the kdc.conf file.
JDBC (MySQL or PostgreSQL)
3306, 1521, and 5432
Internet Control Message Protocol (ICMP)
For network connectivity purposes, a security group rule is added to allow inbound and outbound ICMP traffic.
NoteTo obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.
Prepare a test dataset
Prepare a test dataset and run the hadoop fs -put <File name> <Hadoop cluster path> command to store the dataset in the Hadoop cluster path.
For example, to upload the /home/user/file.txt file to the /tmp directory of HDFS, run the hadoop fs -put /home/user/file.txt /tmp/file.tx command. Sample dataset:
1,test1,1,100
2,test2,2,90Procedure
On the External Data Service Access tab, select Hadoop Data Service from the Add Data Source drop-down list.
In the Add Hadoop Data Source panel, configure the parameters in the Network & Security Group, Configuration Files, and Initialization steps.
Configure the parameters in the Network & Security Group step. The following table describes the parameters.
Parameter
Description
Data Source Name
The name can contain letters, digits, hyphens (-), and underscores (_).
The name must start and end with a letter or digit.
The name can be up to 50 characters in length.
Data Source Description
The description of the data source.
Data Source Type
If the data source is an EMR cluster, select E-MapReduce (EMR).
If the data source is a self-managed Hadoop cluster, select Self-managed Hadoop Cluster.
Database Engine
The database engine. Select HDFS.
Data Source
The ID of the EMR cluster.
NoteThis parameter is not supported for self-managed Hadoop clusters.
Network Check
Click Network Check. A message indicating that the network is connected appears.
Click Next and configure the parameters in the Configuration Files step. The following table describes the parameters.
Parameter
Description
Method to obtain
Parameter Configuration
None
If the data source is an EMR cluster, click Obtain Automatic Configuration.
If the data source is a self-managed Hadoop cluster, you must paste the configuration file information.
hdfs-site.xml
The HDFS configuration parameters, such as the replication factor and block size.
Obtain the .xml files of self-managed Hadoop clusters from one of the following directories:
conf directory: suitable for earlier versions of Hadoop.
etc/hadoop directory: suitable for later versions of Hadoop. This directory is located in the main installation directory of Hadoop.
core-site.xml
The Hadoop configuration files, such as the I/O settings and file system configuration.
yarn-site.xml
The YARN configuration parameters that are used for cluster resource management and job scheduling.
mapred-site.xml
The EMR job parameters, such as the job executor and optimization parameters.
/etc/host
Resolves the hostnames of all Hadoop cluster nodes to IP addresses.
Use one of the following methods to obtain the file content:
Method 1: Log on to each host of the Hadoop cluster to obtain the mapping between the IP address and the hostname of the current node from the /etc/hosts file.
Method 2: Log on to the host where a Hadoop NameNode is located and run the
#!/bin/bashcommand. Sample result:echo $(hostname -i) $(hostname -f) hdfs dfsadmin -report | awk -F':' '/^Name:/ {printf $2} /^Hostname:/ {print $2}'
Click Save and Finish.
Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the HDFS data source is available.
Find the data source and click Check in the Actions column.
In the Check Service Configuration dialog box, enter the Hadoop cluster path of the dataset, such as /tmp/file.txt, and click Check.
If the check is successful, you can read and write HDFS external tables.
Add a Hive data source
Prerequisites
A self-managed Hive cluster or an EMR cluster that supports Hadoop Common, HDFS, Hive, and YARN is created in the same VPC as an AnalyticDB for PostgreSQL instance.
The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a security group of the Hive cluster. For more information, see Add a security group rule.
NoteTo obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.
Prepare a test dataset
Prepare a test dataset and run the hadoop fs -put <File name> <Hadoop cluster path> command to store the dataset in the Hadoop cluster path.
For example, to upload the /home/user/file.txt file to the /tmp directory of Hive, run the hadoop fs -put /home/user/file.txt /tmp/file.tx command. Sample dataset:
1,test1,1,100
2,test2,2,90Procedure
On the External Data Service Access tab, select Hadoop Data Service from the Add Data Source drop-down list.
In the Add Hadoop Data Source panel, configure the parameters in the Network & Security Group, Configuration Files, and Initialization steps.
Configure the parameters in the Network & Security Group step. The following table describes the parameters.
Parameter
Description
Data Source Name
The name can contain letters, digits, hyphens (-), and underscores (_).
The name must start and end with a letter or digit.
The name can be up to 50 characters in length.
Data Source Description
The description of the data source.
Data Source Type
If the data source is an EMR cluster, select E-MapReduce (EMR).
If the data source is a self-managed Hive cluster, select Self-managed Hadoop Cluster.
Database Engine
The database engine. Select HIVE.
Data Source
The ID of the EMR cluster.
NoteThis parameter is not supported for self-managed Hive clusters.
Network Check
Click Network Check. A message indicating that the network is connected appears.
Click Next and configure the parameters in the Configuration Files step. The following table describes the parameters.
Parameter
Description
Method to obtain
Parameter Configuration
None
If the data source is an EMR cluster, click Obtain Automatic Configuration.
If the data source is a self-managed Hive cluster, you must paste the configuration file information.
hdfs-site.xml
The HDFS configuration parameters, such as the replication factor and block size.
Obtain the .xml files of self-managed Hive clusters from one of the following directories:
conf directory: suitable for the standard installation.
etc/hadoop directory: suitable for the installation that integrates Hadoop and Hive.
core-site.xml
The Hadoop configuration files, such as the I/O settings and file system configuration.
yarn-site.xml
The YARN configuration parameters that are used for cluster resource management and job scheduling.
mapred-site.xml
The EMR job parameters, such as the job executor and optimization parameters.
hive-site.xml
The Hive configuration parameters, such as the metadata connection information, Hive memory, and Hive performance.
/etc/host
Resolves the hostnames of all Hive cluster nodes to IP addresses.
Use one of the following methods to obtain the file content:
Method 1: Log on to each host of the Hive cluster to obtain the mapping between the IP address and the hostname of the current node from the /etc/hosts file.
Method 2: Log on to the host where a Hive NameNode is located and run the
#!/bin/bashcommand. Sample result:echo $(hostname -i) $(hostname -f) hive dfsadmin -report | awk -F':' '/^Name:/ {printf $2} /^Hostname:/ {print $2}'
Click Save and Finish.
Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the Hive data source is available.
Find the data source and click Check in the Actions column.
In the Check Service Configuration dialog box, enter the Hadoop cluster path of the dataset, such as /tmp/file.txt, and click Check.
If the check is successful, you can read and write Hive external tables.
Add a JDBC data source
A JDBC data source is a data source that is compatible with MySQL and PostgreSQL, such as ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, PolarDB for MySQL, and PolarDB for PostgreSQL.
In this example, an ApsaraDB RDS for MySQL data source is used.
Prerequisites
An ApsaraDB RDS for MySQL instance is created in the same VPC as an AnalyticDB for PostgreSQL instance.
Databases and accounts are created in the ApsaraDB RDS for MySQL instance. For more information, see Create accounts and databases.
The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a whitelist of the ApsaraDB RDS for MySQL instance. For more information, see Configure an IP address whitelist.
NoteTo obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.
Procedure
On the External Data Service Access tab, select JDBC Data Service from the Add Data Source drop-down list.
In the Add JDBC Data Source panel, configure the parameters that are described in the following table.
Parameter
Description
Data Source Name
The name can contain letters, digits, hyphens (-), and underscores (_).
The name must start and end with a letter or digit.
The name can be up to 50 characters in length.
Database Engine
The database engine. Select MySQL.
Data Source Description
The description of the data source.
JDBC Connection String
The JDBC connection string that is used to connect to the ApsaraDB RDS for MySQL instance. Format:
jdbc:mysql://<servername>:<port>/<dbName><servername>:<port>: the endpoint and port number of the ApsaraDB RDS for MySQL instance.If you use an internal connection, enter the internal endpoint and internal port number of the ApsaraDB RDS for MySQL instance.
If you use a public connection, enter the public endpoint and public port number of the ApsaraDB RDS for MySQL instance.
To obtain the endpoint and port number, go to the Database Connection page of the ApsaraDB RDS for MySQL instance.
For more information, see View and change the endpoints and port numbers.
<dbName>: the name of the database in the ApsaraDB RDS for MySQL instance.
Username
The database account of the ApsaraDB RDS for MySQL instance.
Logon Password
The password of the database account.
Click Test Connection. If the network connection is successful, click OK.
If the network connection fails, check whether the JDBC connection string is correct.
Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the JDBC data source is available. You can read and write JDBC external tables.
Step 3: Read and write external table data
Select SQL syntax based on your data source type to read and write external table data.
HDFS external tables
Read and write text data
Syntax
CREATE [WRITABLE | READABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];Parameters
Parameter | Description |
[WRITABLE | READABLE] |
|
READABLE | Optional. The readable external table. |
path-to-hdfs-dir | The absolute path of the HDFS data. |
PROFILE | The data format of the HDFS external table. To use the text data format, set this parameter to For more information, see the "Appendix 1: Data formats supported by HDFS external tables" section of this topic. |
SERVER | The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab. If you do not specify this parameter, the server named Default is used. |
FORMAT | The text format.
|
delimiter | The data delimiter.
|
DISTRIBUTED BY | The distribution policy. If you want to load data from an existing AnalyticDB for PostgreSQL table to an external table, we recommend that you use the distribution policy or field names of the AnalyticDB for PostgreSQL table in the external table to prevent additional data migration between compute nodes when the data is loaded. |
Examples
Example 1: Create a writable external table. In the CREATE TABLE statement, specify the HDFS protocol, data format, and path. Insert data into the external table.
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writable_table ( a int, b int, c int ) LOCATION ('pxf://data/demo/pxf_hdfs_demo_table?PROFILE=hdfs:text&SERVER=test2') FORMAT 'TEXT' (delimiter='|'); INSERT INTO pxf_hdfs_writable_table SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);Example 2: Create a readable external table. In the CREATE TABLE statement, specify the HDFS protocol, data format, and path. Query data of the external table.
CREATE READABLE EXTERNAL TABLE pxf_hdfs_readable_table ( a int, b int, c int ) LOCATION ('pxf://data/demo/pxf_hdfs_demo_table?PROFILE=hdfs:text&SERVER=test2') FORMAT 'TEXT' (delimiter='|'); SELECT * FROM pxf_hdfs_readable_table order by 1;
Read and write Parquet data
To read and write Parquet primitive data types in AnalyticDB for PostgreSQL, you must map the Parquet data values to AnalyticDB for PostgreSQL table columns of the same type. For information about the mappings, see the "Appendix 3: Mappings between Parquet data types and AnalyticDB for PostgreSQL data types" section of this topic.
Syntax
CREATE [WRITABLE |READABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hdfs_dbName>.<hdfs_tableName>
?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];Parameters
Parameter | Description |
[WRITABLE | READABLE] |
|
<hdfs_dbName>.<hdfs_tableName> | The absolute path of the HDFS data. |
PROFILE | The data format of the HDFS external table. To use the Parquet data format, set this parameter to For more information, see the "Appendix 1: Data formats supported by HDFS external tables" section of this topic. |
SERVER | The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab. If you do not specify this parameter, the server named Default is used. |
FORMAT | The non-text format. To use the Parquet data format, specify FORMAT 'CUSTOM'. |
FORMATTER | Specifies whether the external table is readable or writable. Valid values:
|
DISTRIBUTED BY | The distribution policy. If you want to load data from an existing AnalyticDB for PostgreSQL table to an external table, we recommend that you use the distribution policy or field names of the AnalyticDB for PostgreSQL table in the external table to prevent additional data migration between compute nodes when the data is loaded. |
The hdfs:parquet profile supports encoding- and compression-related write options. You can specify the write options in the CREATE WRITABLE EXTERNAL TABLE LOCATION clause. The following table describes the custom write options.
Write option | Description |
COMPRESSION_CODEC | The compression codec alias. Valid values for writing Parquet data:
|
ROWGROUP_SIZE | The size of the row group. Unit: bytes. Default value: 8 × 1024 × 1024. A Parquet file consists of one or more row groups. This way, data is logically divided into rows. |
PAGE_SIZE | The size of the page. Unit: bytes. Default value: 1024 × 1024. A row group consists of column chunks that are divided into pages. |
DICTIONARY_PAGE_SIZE | The size of the dictionary page. Unit: bytes. Default value: 512 × 1024. When the external data source access feature writes Parquet files, dictionary encoding is enabled. Only one dictionary page exists per column, per row group. |
PARQUET_VERSION | The Parquet version. Valid values:
|
Examples
Example 1: Create a writable external table. In the CREATE TABLE statement, specify the HDFS protocol, Parquet data format, and path. Insert data into the external table.
CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writable_parquet_table ( a int, b int, c int ) LOCATION ('pxf://data/demo/pxf_hdfs_writable_parquet_table?PROFILE=hdfs:parquet&SERVER=test2') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export'); INSERT INTO pxf_hdfs_writable_parquet_table SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);Example 2: Create a readable external table. In the CREATE TABLE statement, specify the HDFS protocol, Parquet data format, and path. Query data of the external table.
CREATE READABLE EXTERNAL TABLE pxf_hdfs_readable_parquet_table ( a int, b int, c int ) LOCATION ('pxf://data/demo/pxf_hdfs_writable_parquet_table?PROFILE=hdfs:parquet&SERVER=test2') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'); SELECT * FROM pxf_hdfs_readable_parquet_table;
Hive external tables
To display Hive data in AnalyticDB for PostgreSQL, you must map the Hive data values to AnalyticDB for PostgreSQL table columns of the same type. For information about the mappings, see the "Appendix 4: Mappings between Hive data types and AnalyticDB for PostgreSQL data types" section of this topic.
Syntax
Writable and readable Hive external tables support different features. Select the appropriate SQL syntax based on your business requirements.
Writable external tables
The underlying logic of the Hive data write interface significantly changes as the version iterates. Errors may occur when you create or write Hive external tables of later versions. We recommend that you create and write external tables in HDFS mode. For more information about the syntax, see the "HDFS external tables" section of this topic.
Readable external tables
CREATE [READABLE] EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, ...] | LIKE <other_table> ) LOCATION ('pxf://<hive-db-name>.<hive-table-name> ?PROFILE=Hive|HiveText|HiveRC|HiveORC|HiveVectorizedORC[&SERVER=<server_name>]']) FORMAT 'CUSTOM|TEXT' (FORMATTER='pxfwritable_import' | delimiter='<delim>')
Parameters
Parameter | Description |
READABLE | To create a readable external table, you can specify or omit the READABLE parameter. |
hive_dbName | The name of the Hive database. If you do not specify this parameter, the Hive database named |
hive_tableName | The name of the Hive table. |
PROFILE | The data format of the Hive external table. Set this parameter to For more information, see the "Appendix 2: Data formats supported by Hive external tables" section of this topic. Important The |
SERVER | The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab. If you do not specify this parameter, the server named Default is used. |
FORMAT |
|
Examples
Connect to the Hive cluster and create a Hive table.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS pxf_hive_demo_table ( a int, b int, c int ) row format delimited fields terminated by '|' stored as textfile location '/data/demo/pxf_hdfs_demo_table';Connect to the AnalyticDB for PostgreSQL database, create a writable Hive external table, and then insert data into the external table.
postgres> CREATE WRITABLE EXTERNAL TABLE pxf_hive_writable_table ( a int, b int, c int ) LOCATION ('pxf://default.pxf_hive_demo_table?PROFILE=Hive&SERVER=test2') FORMAT 'TEXT' (delimiter='|'); INSERT INTO pxf_hive_writable_table SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);Create a readable Hive external table and query data of the external table.
postgres> CREATE EXTERNAL TABLE pxf_hive_readable_table ( a int, b int, c int ) LOCATION ('pxf://default.pxf_hive_demo_table?PROFILE=Hive&SERVER=test2') FORMAT 'TEXT' (delimiter='|'); SELECT * FROM pxf_hive_readable_table;
JDBC external tables
Supported data types
The JDBC connector supports the following data types:
INTEGER, BIGINT, and SMALLINT
REAL and FLOAT8
NUMERIC
BOOLEAN
VARCHAR, BPCHAR, and TEXT
DATE
TIMESTAMP
BYTEA
The JDBC connector cannot read or write Hive data of the Byte[] type.
Syntax
You can use the following syntax to create an external table and use the JDBC connector to read or write data. To access data in a remote SQL database, you can create a readable or writable
AnalyticDB for PostgreSQL
external table that references the remote database table.
The AnalyticDB for PostgreSQL
external table and the remote database table or query result tuple must have the same column names and types.
CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<external_tableName>?PROFILE=Jdbc[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');Parameters
Parameter | Description |
external_tableName | The full name of the external table. This parameter is determined by the external SQL database and may include a schema name and a table name. |
PROFILE | The data format of the JDBC external table. Set this parameter to |
SERVER | The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab. If you do not specify this parameter, the server named Default is used. |
custom-option | The custom options. For more information, see the "JDBC custom options" section of this topic. |
FORMAT | The data format. Specify FORMAT
|
JDBC custom options
You can include JDBC connector custom options in the LOCATION URI and add an ampersand (&) before each option. The following table describes the supported JDBC custom options.
Option name | Operation | Description |
BATCH_SIZE | Write | The number of INSERT operations that are performed on the external SQL database in batch processing mode. By default, batch writing is enabled. Default value: 100. The value must be an integer. The external data source access feature always checks the BATCH_SIZE option.
|
FETCH_SIZE | Read | The number of rows that are buffered when the JDBC connector reads data from the external SQL database. By default, batch reading is enabled. Default value: 1000. The value must be an integer. To change the default fetch size, specify the FETCH_SIZE option when you create the external table. Example: If the JDBC driver of the external database does not support batch reading, you must specify |
QUERY_TIMEOUT | Read or write | The amount of time that the JDBC driver waits for a statement to execute. Unit: seconds. Default value: infinite. The value must be an integer. |
POOL_SIZE | Write | Enables thread pooling on INSERT operations and identifies the number of threads in the pool. By default, thread pooling is disabled. We recommend that you use the BATCH_SIZE and POOL_SIZE options together. In this case, each thread receives and processes one complete batch of data. If you use the POOL_SIZE option but do not use the BATCH_SIZE option, each thread in the pool receives exactly one tuple. When a thread in the thread pool fails, the JDBC connector returns an error. If an INSERT operation fails, specific data may be written to the external database table. To disable or enable thread pooling and configure the thread pool size, create an external table by using the following settings for the POOL_SIZE option:
|
PARTITION_BY | Read | Enables read partitioning. You can specify only one partition column in the Sample PARTITION_BY option when you create a JDBC external table: When you enable partitioning, the JDBC connector splits a SELECT query into multiple subqueries that retrieve a subset of the data. Each subset is called a fragment. The JDBC connector automatically adds extra query constraints (WHERE expressions) to each fragment to ensure that every tuple of data is retrieved from the external database exactly once. For example, when you query an external table created by using a LOCATION clause that specifies
The external data source management feature distributes the fragments among AnalyticDB for PostgreSQL compute nodes. An external data source instance generates a thread for each compute node on the host that serves a fragment. If the number of fragments is less than or equal to the number of AnalyticDB for PostgreSQL compute nodes configured on a compute node host, a single external data source instance can provide services to all fragments. Each external data source instance sends results back to AnalyticDB for PostgreSQL, where they are collected and returned to you. When you specify the |
RANGE | Read | If you specify the PARTITION_BY option, the RANGE option is required. The query range, which is used as a hint to help create partitions. The format of the RANGE option is determined by the data type of the partition column.
|
INTERVAL | Read | If you specify the PARTITION_BY option and the option is of the The interval of a fragment, which is in the
|
QUOTE_COLUMNS | Read | Specifies whether the JDBC connector references column names when the connector constructs an SQL query for the external database. Valid values:
By default, the QUOTE_COLUMNS option is not specified. In this case, the JDBC connector automatically references all column names in the query if a column name meets one of the following requirements:
|
Examples
Create a table in a MySQL or PostgreSQL database and insert data into the table.
CREATE TABLE test(id int, name VARCHAR(255));
INSERT INTO test(id,name) VALUES(1,'qingzhi');
INSERT INTO test(id,name) VALUES(2,'jianli');Connect to an AnalyticDB for PostgreSQL database, create a JDBC external table, and then query data of the external table.
CREATE EXTERNAL TABLE pxf_jdbc_textsimple(id int, name varchar)
LOCATION ('pxf://test?PROFILE=Jdbc&SERVER=test01')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')
ENCODING 'UTF8';
SELECT * FROM pxf_jdbc_textsimple;Appendix 1: Data formats supported by HDFS external tables
The following table describes the mappings between the data formats supported by HDFS external tables and the PROFILE parameter configured for creating HDFS external tables.
Data format | PROFILE parameter |
Delimited single line text | hdfs:text |
Delimited text with quoted line feeds | hdfs:text:multi |
Avro | hdfs:avro |
JSON | hdfs:json |
Parquet | hdfs:parquet |
AvroSequenceFile | hdfs:AvroSequenceFile |
SequenceFile | hdfs:SequenceFile |
Appendix 2: Data formats supported by Hive external tables
The following table describes the mappings between the data formats supported by Hive external tables and the PROFILE parameter configured for creating Hive external tables.
Data format | PROFILE parameter |
TextFile | Hive or HiveText |
SequenceFile | Hive |
RCFile | Hive or HiveRC |
ORC | Hive, HiveORC, or HiveVectorizedORC |
Parquet | Hive |
Appendix 3: Mappings between Parquet data types and AnalyticDB for PostgreSQL data types
Parquet data type | AnalyticDB for PostgreSQL data type |
Boolean | Boolean |
Byte_array | Bytea or Text |
Double | Float8 |
Fixed_len_byte_array | Numeric |
Float | Real |
Int_8, Int_16 | Smallint or Integer |
Int64 | Bigint |
Int96 | Timestamp or Timestamptz |
When the external data source access feature writes Parquet data, the feature localizes a timestamp to the current system time zone, converts the timestamp into UTC, and then converts the timestamp into Int96. During the conversion process, the external data source access feature loses the time zone information.
Appendix 4: Mappings between Hive data types and AnalyticDB for PostgreSQL data types
Hive data type | AnalyticDB for PostgreSQL data type |
Boolean | Bool |
Int | Int4 |
Smallint | Int2 |
Tinyint | Int2 |
Float | Float4 |
Bigint | Int8 |
double | Float8 |
String | Text |
Binary | Bytea |
Timestamp | Timestamp |