All Products
Search
Document Center

:External data source access

Last Updated:Nov 21, 2024

You can use the external data source access feature of AnalyticDB for PostgreSQL to access an external data source, such as a Hadoop Distributed File System (HDFS), Hive, or Java Database Connectivity (JDBC) data source. This feature allows you to perform query and analysis on external data sources after optimizing the data format.

Overview

External data source management provides high-performance parallel transmission for structured and unstructured data. The external data source access feature allows you to query and analyze external system data from AnalyticDB for PostgreSQL by using SQL syntax. The external data source access feature provides the following benefits:

  • Supports access from AnalyticDB for PostgreSQL to multiple data storage systems, such as HDFS, Hive, MySQL, PostgreSQL, and PolarDB for MySQL.

  • Supports various data formats, such as CSV, ORC, Parquet, JSON, and Avro.

  • Supports efficient data reads and writes on top of the parallel data processing capability of AnalyticDB for PostgreSQL.

  • Uses SQL syntax to access external data sources regardless of the data connection details.

Supported versions

  • AnalyticDB for PostgreSQL V6.0: V6.6.0.0 or later.

  • AnalyticDB for PostgreSQL V7.0: V7.0.2.3 or later.

  • AnalyticDB for PostgreSQL in Serverless mode: V2.1.1.3 or later.

Procedure

  1. Step 1: Enable the external data service.

  2. Step 2: Add a data source.

  3. Step 3: Read and write external table data.

Step 1: Enable the external data service

  1. Log on to the AnalyticDB for PostgreSQL console.
  2. In the upper-left corner of the console, select a region.
  3. Find the instance that you want to manage and click the instance ID.
  4. In the left-side navigation pane, click External Data Source Management.

  5. On the External Data Source Access tab, click Enable External Data Service.

  6. In the panel that appears, configure the parameters that are described in the following table and click OK.

    Parameter

    Description

    Name

    The name of the external data service. All types of characters are supported.

    Service Description

    The description of the external data service.

Step 2: Add a data source

The external data source access feature supports HDFS, Hive, and JDBC data sources.

Add an HDFS data source

Prerequisites

  • A self-managed Hadoop cluster or an E-MapReduce (EMR) cluster that supports Hadoop Common, HDFS, and YARN is created in the same virtual private cloud (VPC) as an AnalyticDB for PostgreSQL instance.

  • The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a security group of the Hadoop cluster. All default port numbers of the Hadoop services that may be used by the external data source access feature are added to the security group. The following table describes the Hadoop services. For more information, see Add a security group rule.

    Service

    Port number

    Namenode IPC

    • Hadoop 2.x: 8020

    • Hadoop 3.x: 9820

    • EMR: 9000

    For more information about the port numbers, see the dfs.namenode.rpc-address parameter in the hdfs-site.xml file.

    DataNode IPC

    • Hadoop 2.x: 50020

    • Hadoop 3.x: 9867

    For more information about the port numbers, see the dfs.datanode.ipc.address parameter in the hdfs-site.xml file.

    DataNode

    • Hadoop 2.x: 50010

    • Hadoop 3.x: 9866

    For more information about the port numbers, see the dfs.datanode.address parameter in the hdfs-site.xml file.

    Hive metastore

    9083

    For more information about the port numbers, see the dfs.datanode.address parameter in the hive-site.xml file.

    Hive HiveServer

    10000

    For more information about the port numbers, see the hive.server2.thrift.port parameter in the hive-site.xml file.

    Zookeeper

    2181

    For more information about the port numbers,

    see the clientPort parameter in the zoo.cfg file.

    Kerberos Server

    88 (transmission encryption and authentication) and 749 (kadmin)

    For more information about the port numbers,

    see the kdc_ports and kadmind_port parameters in the kdc.conf file.

    JDBC (MySQL or PostgreSQL)

    3306, 1521, and 5432

    Internet Control Message Protocol (ICMP)

    For network connectivity purposes, a security group rule is added to allow inbound and outbound ICMP traffic.

    Note

    To obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.

Prepare a test dataset

Prepare a test dataset and run the hadoop fs -put <File name> <Hadoop cluster path> command to store the dataset in the Hadoop cluster path.

For example, to upload the /home/user/file.txt file to the /tmp directory of HDFS, run the hadoop fs -put /home/user/file.txt /tmp/file.tx command. Sample dataset:

1,test1,1,100
2,test2,2,90

Procedure

  1. On the External Data Service Access tab, select Hadoop Data Service from the Add Data Source drop-down list.

  2. In the Add Hadoop Data Source panel, configure the parameters in the Network & Security Group, Configuration Files, and Initialization steps.

    1. Configure the parameters in the Network & Security Group step. The following table describes the parameters.

      Parameter

      Description

      Data Source Name

      • The name can contain letters, digits, hyphens (-), and underscores (_).

      • The name must start and end with a letter or digit.

      • The name can be up to 50 characters in length.

      Data Source Description

      The description of the data source.

      Data Source Type

      • If the data source is an EMR cluster, select E-MapReduce (EMR).

      • If the data source is a self-managed Hadoop cluster, select Self-managed Hadoop Cluster.

      Database Engine

      The database engine. Select HDFS.

      Data Source

      The ID of the EMR cluster.

      Note

      This parameter is not supported for self-managed Hadoop clusters.

      Network Check

      Click Network Check. A message indicating that the network is connected appears.

    2. Click Next and configure the parameters in the Configuration Files step. The following table describes the parameters.

      Parameter

      Description

      Method to obtain

      Parameter Configuration

      None

      • If the data source is an EMR cluster, click Obtain Automatic Configuration.

      • If the data source is a self-managed Hadoop cluster, you must paste the configuration file information.

      hdfs-site.xml

      The HDFS configuration parameters, such as the replication factor and block size.

      Obtain the .xml files of self-managed Hadoop clusters from one of the following directories:

      • conf directory: suitable for earlier versions of Hadoop.

      • etc/hadoop directory: suitable for later versions of Hadoop. This directory is located in the main installation directory of Hadoop.

      core-site.xml

      The Hadoop configuration files, such as the I/O settings and file system configuration.

      yarn-site.xml

      The YARN configuration parameters that are used for cluster resource management and job scheduling.

      mapred-site.xml

      The EMR job parameters, such as the job executor and optimization parameters.

      /etc/host

      Resolves the hostnames of all Hadoop cluster nodes to IP addresses.

      Use one of the following methods to obtain the file content:

      • Method 1: Log on to each host of the Hadoop cluster to obtain the mapping between the IP address and the hostname of the current node from the /etc/hosts file.

      • Method 2: Log on to the host where a Hadoop NameNode is located and run the #!/bin/bash command. Sample result:

        echo $(hostname -i) $(hostname -f)
        hdfs dfsadmin -report | awk -F':' '/^Name:/ {printf $2} /^Hostname:/ {print $2}'
    3. Click Save and Finish.

      Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the HDFS data source is available.

  3. Find the data source and click Check in the Actions column.

  4. In the Check Service Configuration dialog box, enter the Hadoop cluster path of the dataset, such as /tmp/file.txt, and click Check.

    If the check is successful, you can read and write HDFS external tables.

Add a Hive data source

Prerequisites

  • A self-managed Hive cluster or an EMR cluster that supports Hadoop Common, HDFS, Hive, and YARN is created in the same VPC as an AnalyticDB for PostgreSQL instance.

  • The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a security group of the Hive cluster. For more information, see Add a security group rule.

    Note

    To obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.

Prepare a test dataset

Prepare a test dataset and run the hadoop fs -put <File name> <Hadoop cluster path> command to store the dataset in the Hadoop cluster path.

For example, to upload the /home/user/file.txt file to the /tmp directory of Hive, run the hadoop fs -put /home/user/file.txt /tmp/file.tx command. Sample dataset:

1,test1,1,100
2,test2,2,90

Procedure

  1. On the External Data Service Access tab, select Hadoop Data Service from the Add Data Source drop-down list.

  2. In the Add Hadoop Data Source panel, configure the parameters in the Network & Security Group, Configuration Files, and Initialization steps.

    1. Configure the parameters in the Network & Security Group step. The following table describes the parameters.

      Parameter

      Description

      Data Source Name

      • The name can contain letters, digits, hyphens (-), and underscores (_).

      • The name must start and end with a letter or digit.

      • The name can be up to 50 characters in length.

      Data Source Description

      The description of the data source.

      Data Source Type

      • If the data source is an EMR cluster, select E-MapReduce (EMR).

      • If the data source is a self-managed Hive cluster, select Self-managed Hadoop Cluster.

      Database Engine

      The database engine. Select HIVE.

      Data Source

      The ID of the EMR cluster.

      Note

      This parameter is not supported for self-managed Hive clusters.

      Network Check

      Click Network Check. A message indicating that the network is connected appears.

    2. Click Next and configure the parameters in the Configuration Files step. The following table describes the parameters.

      Parameter

      Description

      Method to obtain

      Parameter Configuration

      None

      • If the data source is an EMR cluster, click Obtain Automatic Configuration.

      • If the data source is a self-managed Hive cluster, you must paste the configuration file information.

      hdfs-site.xml

      The HDFS configuration parameters, such as the replication factor and block size.

      Obtain the .xml files of self-managed Hive clusters from one of the following directories:

      • conf directory: suitable for the standard installation.

      • etc/hadoop directory: suitable for the installation that integrates Hadoop and Hive.

      core-site.xml

      The Hadoop configuration files, such as the I/O settings and file system configuration.

      yarn-site.xml

      The YARN configuration parameters that are used for cluster resource management and job scheduling.

      mapred-site.xml

      The EMR job parameters, such as the job executor and optimization parameters.

      hive-site.xml

      The Hive configuration parameters, such as the metadata connection information, Hive memory, and Hive performance.

      /etc/host

      Resolves the hostnames of all Hive cluster nodes to IP addresses.

      Use one of the following methods to obtain the file content:

      • Method 1: Log on to each host of the Hive cluster to obtain the mapping between the IP address and the hostname of the current node from the /etc/hosts file.

      • Method 2: Log on to the host where a Hive NameNode is located and run the #!/bin/bash command. Sample result:

        echo $(hostname -i) $(hostname -f)
        hive dfsadmin -report | awk -F':' '/^Name:/ {printf $2} /^Hostname:/ {print $2}'
    3. Click Save and Finish.

      Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the Hive data source is available.

  3. Find the data source and click Check in the Actions column.

  4. In the Check Service Configuration dialog box, enter the Hadoop cluster path of the dataset, such as /tmp/file.txt, and click Check.

    If the check is successful, you can read and write Hive external tables.

Add a JDBC data source

A JDBC data source is a data source that is compatible with MySQL and PostgreSQL, such as ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, PolarDB for MySQL, and PolarDB for PostgreSQL.

In this example, an ApsaraDB RDS for MySQL data source is used.

Prerequisites

  • An ApsaraDB RDS for MySQL instance is created in the same VPC as an AnalyticDB for PostgreSQL instance.

  • Databases and accounts are created in the ApsaraDB RDS for MySQL instance. For more information, see Create accounts and databases.

  • The IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance is added to a whitelist of the ApsaraDB RDS for MySQL instance. For more information, see Configure an IP address whitelist.

    Note

    To obtain the IP address of the vSwitch that is connected to the AnalyticDB for PostgreSQL instance, go to the External Data Source Management page.

Procedure

  1. On the External Data Service Access tab, select JDBC Data Service from the Add Data Source drop-down list.

  2. In the Add JDBC Data Source panel, configure the parameters that are described in the following table.

    Parameter

    Description

    Data Source Name

    • The name can contain letters, digits, hyphens (-), and underscores (_).

    • The name must start and end with a letter or digit.

    • The name can be up to 50 characters in length.

    Database Engine

    The database engine. Select MySQL.

    Data Source Description

    The description of the data source.

    JDBC Connection String

    The JDBC connection string that is used to connect to the ApsaraDB RDS for MySQL instance. Format:

    jdbc:mysql://<servername>:<port>/<dbName>
    • <servername>:<port>: the endpoint and port number of the ApsaraDB RDS for MySQL instance.

      • If you use an internal connection, enter the internal endpoint and internal port number of the ApsaraDB RDS for MySQL instance.

      • If you use a public connection, enter the public endpoint and public port number of the ApsaraDB RDS for MySQL instance.

      To obtain the endpoint and port number, go to the Database Connection page of the ApsaraDB RDS for MySQL instance.

      For more information, see View and change the endpoints and port numbers.

    • <dbName>: the name of the database in the ApsaraDB RDS for MySQL instance.

    Username

    The database account of the ApsaraDB RDS for MySQL instance.

    Logon Password

    The password of the database account.

  3. Click Test Connection. If the network connection is successful, click OK.

    If the network connection fails, check whether the JDBC connection string is correct.

    Wait 3 minutes to 5 minutes. When the status of the data source changes to Running, the JDBC data source is available. You can read and write JDBC external tables.

Step 3: Read and write external table data

Select SQL syntax based on your data source type to read and write external table data.

HDFS external tables

Read and write text data

Syntax

CREATE [WRITABLE | READABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
    ?PROFILE=hdfs:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Parameters

Parameter

Description

[WRITABLE | READABLE]

  • To create a writable external table, you must specify the WRITABLE parameter.

  • To create a readable external table, you can specify or omit the READABLE parameter.

READABLE

Optional. The readable external table.

path-to-hdfs-dir

The absolute path of the HDFS data.

PROFILE

The data format of the HDFS external table. To use the text data format, set this parameter to hdfs:text.

For more information, see the "Appendix 1: Data formats supported by HDFS external tables" section of this topic.

SERVER

The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab.

If you do not specify this parameter, the server named Default is used.

FORMAT

The text format.

  • If the path-to-hdfs-file path contains text-separated data, specify FORMAT 'TEXT'.

  • If the path-to-hdfs-file path contains comma-separated data, specify FORMAT 'CSV'.

delimiter

The data delimiter.

  • If you use the FORMAT 'CSV' clause, the default delim_value field is a comma (,).

  • If you use the FORMAT 'TEXT' clause, the default delim_value field is a vertical bar (|).

  • If the delimiter is an escape sequence, add E before the delim_value field. Example: (delimiter=E'\t'), (delimiter ':').

DISTRIBUTED BY

The distribution policy.

If you want to load data from an existing AnalyticDB for PostgreSQL table to an external table, we recommend that you use the distribution policy or field names of the AnalyticDB for PostgreSQL table in the external table to prevent additional data migration between compute nodes when the data is loaded.

Examples

  • Example 1: Create a writable external table. In the CREATE TABLE statement, specify the HDFS protocol, data format, and path. Insert data into the external table.

    CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writable_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://data/demo/pxf_hdfs_demo_table?PROFILE=hdfs:text&SERVER=test2')
    FORMAT 'TEXT' (delimiter='|');
    
    INSERT INTO pxf_hdfs_writable_table 
    SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);
  • Example 2: Create a readable external table. In the CREATE TABLE statement, specify the HDFS protocol, data format, and path. Query data of the external table.

    CREATE READABLE EXTERNAL TABLE pxf_hdfs_readable_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://data/demo/pxf_hdfs_demo_table?PROFILE=hdfs:text&SERVER=test2')
    FORMAT 'TEXT' (delimiter='|');
    
    SELECT * FROM pxf_hdfs_readable_table order by 1;

Read and write Parquet data

Important

To read and write Parquet primitive data types in AnalyticDB for PostgreSQL, you must map the Parquet data values to AnalyticDB for PostgreSQL table columns of the same type. For information about the mappings, see the "Appendix 3: Mappings between Parquet data types and AnalyticDB for PostgreSQL data types" section of this topic.

Syntax

CREATE [WRITABLE |READABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hdfs_dbName>.<hdfs_tableName>
    ?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

Parameters

Parameter

Description

[WRITABLE | READABLE]

  • To create a writable external table, you must specify the WRITABLE parameter.

  • To create a readable external table, you can specify or omit the READABLE parameter.

<hdfs_dbName>.<hdfs_tableName>

The absolute path of the HDFS data.

PROFILE

The data format of the HDFS external table. To use the Parquet data format, set this parameter to hdfs:parquet.

For more information, see the "Appendix 1: Data formats supported by HDFS external tables" section of this topic.

SERVER

The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab.

If you do not specify this parameter, the server named Default is used.

FORMAT

The non-text format. To use the Parquet data format, specify FORMAT 'CUSTOM'.

FORMATTER

Specifies whether the external table is readable or writable. Valid values:

  • pxfwritable_export: writable.

  • pxfwritable_import: readable.

DISTRIBUTED BY

The distribution policy.

If you want to load data from an existing AnalyticDB for PostgreSQL table to an external table, we recommend that you use the distribution policy or field names of the AnalyticDB for PostgreSQL table in the external table to prevent additional data migration between compute nodes when the data is loaded.

The hdfs:parquet profile supports encoding- and compression-related write options. You can specify the write options in the CREATE WRITABLE EXTERNAL TABLE LOCATION clause. The following table describes the custom write options.

Write option

Description

COMPRESSION_CODEC

The compression codec alias. Valid values for writing Parquet data:

  • snappy (default)

  • gzip

  • lzo

  • uncompressed

ROWGROUP_SIZE

The size of the row group. Unit: bytes. Default value: 8 × 1024 × 1024.

A Parquet file consists of one or more row groups. This way, data is logically divided into rows.

PAGE_SIZE

The size of the page. Unit: bytes. Default value: 1024 × 1024.

A row group consists of column chunks that are divided into pages.

DICTIONARY_PAGE_SIZE

The size of the dictionary page. Unit: bytes. Default value: 512 × 1024.

When the external data source access feature writes Parquet files, dictionary encoding is enabled. Only one dictionary page exists per column, per row group.

PARQUET_VERSION

The Parquet version. Valid values:

  • v1 (default)

  • v2

Examples

  • Example 1: Create a writable external table. In the CREATE TABLE statement, specify the HDFS protocol, Parquet data format, and path. Insert data into the external table.

    CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writable_parquet_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://data/demo/pxf_hdfs_writable_parquet_table?PROFILE=hdfs:parquet&SERVER=test2')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
    
    INSERT INTO pxf_hdfs_writable_parquet_table 
    SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);
  • Example 2: Create a readable external table. In the CREATE TABLE statement, specify the HDFS protocol, Parquet data format, and path. Query data of the external table.

    CREATE READABLE EXTERNAL TABLE pxf_hdfs_readable_parquet_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://data/demo/pxf_hdfs_writable_parquet_table?PROFILE=hdfs:parquet&SERVER=test2')
    FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    
    SELECT * FROM pxf_hdfs_readable_parquet_table;

Hive external tables

To display Hive data in AnalyticDB for PostgreSQL, you must map the Hive data values to AnalyticDB for PostgreSQL table columns of the same type. For information about the mappings, see the "Appendix 4: Mappings between Hive data types and AnalyticDB for PostgreSQL data types" section of this topic.

Syntax

Writable and readable Hive external tables support different features. Select the appropriate SQL syntax based on your business requirements.

  • Writable external tables

    The underlying logic of the Hive data write interface significantly changes as the version iterates. Errors may occur when you create or write Hive external tables of later versions. We recommend that you create and write external tables in HDFS mode. For more information about the syntax, see the "HDFS external tables" section of this topic.

  • Readable external tables

    CREATE [READABLE] EXTERNAL TABLE <table_name>
        ( <column_name> <data_type> [, ...] | LIKE <other_table> )
    LOCATION ('pxf://<hive-db-name>.<hive-table-name>
        ?PROFILE=Hive|HiveText|HiveRC|HiveORC|HiveVectorizedORC[&SERVER=<server_name>]'])
    FORMAT 'CUSTOM|TEXT' (FORMATTER='pxfwritable_import' | delimiter='<delim>')

Parameters

Parameter

Description

READABLE

To create a readable external table, you can specify or omit the READABLE parameter.

hive_dbName

The name of the Hive database. If you do not specify this parameter, the Hive database named Default is used.

hive_tableName

The name of the Hive table.

PROFILE

The data format of the Hive external table. Set this parameter to Hive, HiveText, HiveRC, HiveORC, or HiveVectorizedORC.

For more information, see the "Appendix 2: Data formats supported by Hive external tables" section of this topic.

Important

The HiveVectorizedORC profile does not support the Timestamp data type.

SERVER

The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab.

If you do not specify this parameter, the server named Default is used.

FORMAT

  • If you set the PROFILE parameter to Hive, HiveORC, or HiveVectorizedORC, you must specify FORMAT 'CUSTOM' and set the FORMATTER parameter to pxfwritable_import.

  • If you set the PROFILE parameter to HiveText or HiveRC, you must specify FORMAT 'TEXT' and set the delimiter parameter to a single ASCII delimiter.

Examples

  1. Connect to the Hive cluster and create a Hive table.

    hive> CREATE EXTERNAL TABLE IF NOT EXISTS pxf_hive_demo_table
    (
    a int,
    b int,
    c int
    )
    row format delimited fields terminated by '|'
    stored as textfile
    location '/data/demo/pxf_hdfs_demo_table';
  2. Connect to the AnalyticDB for PostgreSQL database, create a writable Hive external table, and then insert data into the external table.

    postgres> CREATE WRITABLE EXTERNAL TABLE pxf_hive_writable_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://default.pxf_hive_demo_table?PROFILE=Hive&SERVER=test2')
    FORMAT 'TEXT' (delimiter='|');
    
    INSERT INTO pxf_hive_writable_table 
    SELECT generate_series(1, 9000), generate_series(1, 9000), generate_series(1, 9000);
  3. Create a readable Hive external table and query data of the external table.

    postgres> CREATE EXTERNAL TABLE pxf_hive_readable_table
    ( 
    a int,
    b int,
    c int
    )
    LOCATION ('pxf://default.pxf_hive_demo_table?PROFILE=Hive&SERVER=test2')
    FORMAT 'TEXT' (delimiter='|');
    
    SELECT * FROM pxf_hive_readable_table;

JDBC external tables

Supported data types

The JDBC connector supports the following data types:

  • INTEGER, BIGINT, and SMALLINT

  • REAL and FLOAT8

  • NUMERIC

  • BOOLEAN

  • VARCHAR, BPCHAR, and TEXT

  • DATE

  • TIMESTAMP

  • BYTEA

Important

The JDBC connector cannot read or write Hive data of the Byte[] type.

Syntax

You can use the following syntax to create an external table and use the JDBC connector to read or write data. To access data in a remote SQL database, you can create a readable or writable

AnalyticDB for PostgreSQL

external table that references the remote database table.

The AnalyticDB for PostgreSQL

external table and the remote database table or query result tuple must have the same column names and types.

CREATE [READABLE | WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<external_tableName>?PROFILE=Jdbc[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');

Parameters

Parameter

Description

external_tableName

The full name of the external table. This parameter is determined by the external SQL database and may include a schema name and a table name.

PROFILE

The data format of the JDBC external table. Set this parameter to Jdbc.

SERVER

The name of the external data source. To obtain the data source name, log on to the AnalyticDB for PostgreSQL console, click External Data Source Management in the left-side navigation pane, and then click the External Data Source Access tab.

If you do not specify this parameter, the server named Default is used.

custom-option

The custom options. For more information, see the "JDBC custom options" section of this topic.

FORMAT

The data format. Specify FORMAT 'CUSTOM'. The following FORMATTER functions are supported:

  • pxfwritable_import: used for read operations.

  • pxfwritable_export: used for write operations.

JDBC custom options

You can include JDBC connector custom options in the LOCATION URI and add an ampersand (&) before each option. The following table describes the supported JDBC custom options.

Option name

Operation

Description

BATCH_SIZE

Write

The number of INSERT operations that are performed on the external SQL database in batch processing mode. By default, batch writing is enabled. Default value: 100. The value must be an integer. The external data source access feature always checks the BATCH_SIZE option.

  • If the JDBC driver of the external SQL database supports batch writing, batch INSERT operations may significantly increase performance. To disable batch writing or modify the default batch size value, create an external table by using the following BATCH_SIZE settings:

    • BATCH_SIZE=0 or BATCH_SIZE=1: disables batch writing.

    • BATCH_SIZE=n (n>1): sets the BATCH_SIZE option to n.

  • If the JDBC driver of the external SQL database does not support batch writing, the behavior of the JDBC connector is determined by the following BATCH_SIZE settings:

    • BATCH_SIZE omitted: The JDBC connector inserts data without using batch writing.

    • BATCH_SIZE=n (n>1): The INSERT operation fails and the JDBC connector returns an error.

FETCH_SIZE

Read

The number of rows that are buffered when the JDBC connector reads data from the external SQL database. By default, batch reading is enabled. Default value: 1000. The value must be an integer.

To change the default fetch size, specify the FETCH_SIZE option when you create the external table. Example: FETCH_SIZE=5000.

If the JDBC driver of the external database does not support batch reading, you must specify FETCH_SIZE=0 to disable batch reading.

QUERY_TIMEOUT

Read or write

The amount of time that the JDBC driver waits for a statement to execute. Unit: seconds. Default value: infinite. The value must be an integer.

POOL_SIZE

Write

Enables thread pooling on INSERT operations and identifies the number of threads in the pool. By default, thread pooling is disabled.

We recommend that you use the BATCH_SIZE and POOL_SIZE options together. In this case, each thread receives and processes one complete batch of data. If you use the POOL_SIZE option but do not use the BATCH_SIZE option, each thread in the pool receives exactly one tuple.

When a thread in the thread pool fails, the JDBC connector returns an error. If an INSERT operation fails, specific data may be written to the external database table.

To disable or enable thread pooling and configure the thread pool size, create an external table by using the following settings for the POOL_SIZE option:

  • POOL_SIZE=n (n<1): The thread pool size is the number of CPUs in the system.

  • POOL_SIZE=1: disables thread pooling.

  • POOL_SIZE=n (n>1): enables thread pooling and sets the POOL_SIZE option to n.

PARTITION_BY

Read

Enables read partitioning. You can specify only one partition column in the <column-name>:<column-type> format. The JDBC connector supports Date, Int, and Enum <column-type> values. If you do not specify a PARTITION_BY column, a single JDBC connector provides services for the read request.

Sample PARTITION_BY option when you create a JDBC external table:

&PARTITION_BY=id:int&RANGE=1:100&INTERVAL=5
&PARTITION_BY=year:int&RANGE=2011:2013&INTERVAL=1
&PARTITION_BY=createdate:date&RANGE=2013-01-01:2016-01-01&INTERVAL=1:month
&PARTITION_BY=color:enum&RANGE=red:yellow:blue

When you enable partitioning, the JDBC connector splits a SELECT query into multiple subqueries that retrieve a subset of the data. Each subset is called a fragment. The JDBC connector automatically adds extra query constraints (WHERE expressions) to each fragment to ensure that every tuple of data is retrieved from the external database exactly once.

For example, when you query an external table created by using a LOCATION clause that specifies &PARTITION_BY=id:int&RANGE=1:5&INTERVAL=2, the JDBC connector generates five fragments: two fragments based on partition settings and up to three implicitly generated fragments. The following constraints are associated with each fragment:

  • Fragment 1: WHERE (id < 1). This fragment is implicitly generated for the start value of the RANGE field.

  • Fragment 2: WHERE (id >= 1) AND (id < 3). This fragment is specified based on partition settings.

  • Fragment 3: WHERE (id >= 3) AND (id < 5). This fragment is specified based on partition settings.

  • Fragment 4: WHERE (id >= 5). This fragment is implicitly generated for the end value of the RANGE field.

  • Fragment 5: WHERE (id IS NULL). This fragment is implicitly generated.

The external data source management feature distributes the fragments among AnalyticDB for PostgreSQL compute nodes. An external data source instance generates a thread for each compute node on the host that serves a fragment. If the number of fragments is less than or equal to the number of AnalyticDB for PostgreSQL compute nodes configured on a compute node host, a single external data source instance can provide services to all fragments. Each external data source instance sends results back to AnalyticDB for PostgreSQL, where they are collected and returned to you.

When you specify the PARTITION_BY option, modify the value and unit of the INTERVAL option based on the optimal number of JDBC connections to the destination database and the optimal distribution of external data across AnalyticDB for PostgreSQL compute nodes. The minimum value of the INTERVAL option varies based on the number of AnalyticDB for PostgreSQL compute nodes, whereas the maximum value varies based on the supported number of JDBC connections to the destination database. The setting of the INTERVAL option influences the number of fragments. We recommend that you do not set the INTERVAL option to an excessively large or small value. You can perform tests by using multiple values to select the optimal settings.

RANGE

Read

If you specify the PARTITION_BY option, the RANGE option is required.

The query range, which is used as a hint to help create partitions. The format of the RANGE option is determined by the data type of the partition column.

  • If the partition column is of the Enum type, you must set the RANGE option to a list of values in the <value>:<value>[:<value>[…]] format. Each value pair forms a fragment.

  • If the partition column is of the Int or Date type, you must specify the RANGE option in the <start-value>:<end-value> format. The value specifies an interval from <start-value> to <end-value> (inclusive).

    Note

    If the partition column is of the Date type, use the yyyy-MM-dd date format.

INTERVAL

Read

If you specify the PARTITION_BY option and the option is of the Int or Date type, the INTERVAL option is required.

The interval of a fragment, which is in the <interval-value>[:<interval-unit>] format. The INTERNAL option is used together with the RANGE option as a hint to help create partitions. The <interval-value> field specifies the fragment size.

  • If the PARTITION_BY option is of the Date type, use the <interval-unit> field to specify year, month, or day.

  • If the PARTITION_BY option is of the Enum type, the JDBC connector ignores the INTERVAL option.

QUOTE_COLUMNS

Read

Specifies whether the JDBC connector references column names when the connector constructs an SQL query for the external database. Valid values:

  • true: The JDBC connector references all column names.

  • A value other than true: The JDBC connector does not reference column names.

By default, the QUOTE_COLUMNS option is not specified. In this case, the JDBC connector automatically references all column names in the query if a column name meets one of the following requirements:

  • A column name contains special characters.

  • A column name contains uppercase and lowercase letters and the external database does not support unreferenced mixed-case identifiers.

Examples

  1. Create a table in a MySQL or PostgreSQL database and insert data into the table.

CREATE TABLE test(id int, name VARCHAR(255));
INSERT INTO test(id,name) VALUES(1,'qingzhi');
INSERT INTO test(id,name) VALUES(2,'jianli');
  1. Connect to an AnalyticDB for PostgreSQL database, create a JDBC external table, and then query data of the external table.

CREATE EXTERNAL TABLE pxf_jdbc_textsimple(id int, name varchar)
LOCATION ('pxf://test?PROFILE=Jdbc&SERVER=test01')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')
ENCODING 'UTF8';
SELECT * FROM pxf_jdbc_textsimple;

Appendix 1: Data formats supported by HDFS external tables

The following table describes the mappings between the data formats supported by HDFS external tables and the PROFILE parameter configured for creating HDFS external tables.

Data format

PROFILE parameter

Delimited single line text

hdfs:text

Delimited text with quoted line feeds

hdfs:text:multi

Avro

hdfs:avro

JSON

hdfs:json

Parquet

hdfs:parquet

AvroSequenceFile

hdfs:AvroSequenceFile

SequenceFile

hdfs:SequenceFile

Appendix 2: Data formats supported by Hive external tables

The following table describes the mappings between the data formats supported by Hive external tables and the PROFILE parameter configured for creating Hive external tables.

Data format

PROFILE parameter

TextFile

Hive or HiveText

SequenceFile

Hive

RCFile

Hive or HiveRC

ORC

Hive, HiveORC, or HiveVectorizedORC

Parquet

Hive

Appendix 3: Mappings between Parquet data types and AnalyticDB for PostgreSQL data types

Parquet data type

AnalyticDB for PostgreSQL data type

Boolean

Boolean

Byte_array

Bytea or Text

Double

Float8

Fixed_len_byte_array

Numeric

Float

Real

Int_8, Int_16

Smallint or Integer

Int64

Bigint

Int96

Timestamp or Timestamptz

When the external data source access feature writes Parquet data, the feature localizes a timestamp to the current system time zone, converts the timestamp into UTC, and then converts the timestamp into Int96. During the conversion process, the external data source access feature loses the time zone information.

Appendix 4: Mappings between Hive data types and AnalyticDB for PostgreSQL data types

Hive data type

AnalyticDB for PostgreSQL data type

Boolean

Bool

Int

Int4

Smallint

Int2

Tinyint

Int2

Float

Float4

Bigint

Int8

double

Float8

String

Text

Binary

Bytea

Timestamp

Timestamp