AnalyticDB for MySQL supports using external tables to import and export data. This topic describes how to use an external table to query data from the Hadoop Distributed File System (HDFS) and import the data into AnalyticDB for MySQL.
Prerequisites
Your AnalyticDB for MySQL cluster must run kernel version 3.1.4 or later.
NoteTo view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.
The HDFS data file is in CSV, Parquet, or ORC format.
An HDFS cluster is created and the data to be imported is prepared in an HDFS folder. This topic uses the
hdfs_import_test_data.csvfolder as an example.The following service access ports are configured in the HDFS cluster for the AnalyticDB for MySQL cluster:
namenode: Reads and writes file system metadata. You can configure the port number in thefs.defaultFSparameter. The default port is 8020.For more information about the configuration, see core-default.xml.
datanode: Reads and writes data. You can configure the port number in thedfs.datanode.addressparameter. The default port is 50010.For more information about the configuration, see hdfs-default.xml.
AnalyticDB for MySQL Data Warehouse Edition in elastic mode supports Elastic Network Interface (ENI) access.
ImportantLog on to the AnalyticDB for MySQL console. On the Cluster Information page, in the Network Information section, turn on the Elastic Network Interface (ENI) switch.
Enabling or disabling the ENI network interrupts database connections for about 2 minutes. During this time, read and write operations are unavailable. Evaluate the impact before you enable or disable the ENI network.
Procedure
Create a destination database. In this example, the destination database in the AnalyticDB for MySQL cluster is named
adb_demo.CREATE DATABASE IF NOT EXISTS adb_demo;In the
adb_demodestination database, use theCREATE TABLEstatement to create an external table in CSV, Parquet, or ORC format.Create a destination table.
Use one of the following statements to create a destination table in the
adb_demodestination database to store data imported from HDFS:Create a destination table for a standard external table. In this example, the destination table is named
adb_hdfs_import_test. The syntax is as follows:CREATE TABLE IF NOT EXISTS adb_hdfs_import_test ( uid string, other string ) DISTRIBUTED BY HASH(uid);When you create a destination table for a partitioned external table, you must define both standard columns, such as
uidandother, and partition key columns, such asp1,p2, andp3. In this example, the destination table is namedadb_hdfs_import_parquet_partition. The syntax is as follows:CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition ( uid string, other string, p1 date, p2 int, p3 varchar ) DISTRIBUTED BY HASH(uid);
Import data from HDFS into the destination AnalyticDB for MySQL cluster.
Select a method to import data as needed. The syntax for importing data into a partitioned table is the same as for a standard table. The following examples use a standard table:
Method 1 (Recommended): Use
INSERT OVERWRITEto import data. This method supports batch import and delivers high performance. Data is visible after a successful import. If the import fails, the data is rolled back. The following code provides an example:INSERT OVERWRITE adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;Method 2: Use
INSERT INTOto import data. Inserted data is available for real-time queries. Use this method for small amounts of data. The following code provides an example:INSERT INTO adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;Method 3: Run an asynchronous task to import data. The following code provides an example:
SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test SELECT * FROM hdfs_import_test_external_table;The following result is returned:
+---------------------------------------+ | job_id | +---------------------------------------+ | 2020112122202917203100908203303****** | +---------------------------------------+You can also check the status of the asynchronous task based on the
job_id. For more information, see Asynchronously submit an import task.
What to do next
After the import is complete, log on to the adb_demo destination database in your AnalyticDB for MySQL cluster. Run the following statement to verify that the data was imported from the source table into the adb_hdfs_import_test destination table:
SELECT * FROM adb_hdfs_import_test LIMIT 100;Create an HDFS external table
Create an external table for a CSV file
The statement is as follows:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"csv", "delimiter":",", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv" }';Parameter
Required
Description
ENGINE='HDFS'Required
The storage engine for the external table. This example uses HDFS.
TABLE_PROPERTIESThe method that AnalyticDB for MySQL uses to access HDFS data.
formatThe format of the data file. To create an external table for a CSV file, set this parameter to
csv.delimiterThe column delimiter for the CSV data file. This example uses a comma (,).
hdfs_urlThe absolute address of the destination data file or folder in the HDFS cluster. The address must start with
hdfs://.Example:
hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csvpartition_columnOptional
The partition key columns of the external table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.
compress_typeThe compression type of the data file. CSV files support only the Gzip compression type.
skip_header_line_countThe number of header rows to skip at the beginning of the file during data import. The first row of a CSV file is the table header. If you set this parameter to 1, the first row is automatically skipped during data import.
The default value is 0, which means no rows are skipped.
hdfs_ha_host_portIf the High Availability (HA) feature is configured for the HDFS cluster, configure the
hdfs_ha_host_portparameter when you create an external table. The format isip1:port1,ip2:port2. The IP addresses and ports are for the primary and secondarynamenodeinstances.Example:
192.168.xx.xx:8020,192.168.xx.xx:8021Create an external table for a Parquet or ORC file
The following statement shows how to create an external table for a Parquet file:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"parquet", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/" }';Parameter
Required
Description
ENGINE='HDFS'Required
The storage engine for the external table. This example uses HDFS.
TABLE_PROPERTIESThe method that AnalyticDB for MySQL uses to access HDFS data.
formatThe format of the data file.
To create an external table for a Parquet file, set this parameter to
parquet.To create an external table for an ORC file, set this parameter to
orc.
hdfs_urlThe absolute address of the destination data file or folder in the HDFS cluster. The address must start with
hdfs://.partition_columnOptional
The partition key columns of the table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.
hdfs_ha_host_portIf the HA feature is configured for the HDFS cluster, configure the
hdfs_ha_host_portparameter when you create an external table. The format isip1:port1,ip2:port2. The IP addresses and ports are for the primary and secondarynamenodeinstances.Example:
192.168.xx.xx:8020,192.168.xx.xx:8021NoteThe column names in the `CREATE TABLE` statement for the external table must be identical to the column names in the Parquet or ORC file but are case-insensitive. The order of the columns must also be the same.
When you create an external table, you can select only some columns from the Parquet or ORC file to be columns in the external table. Unselected columns are not imported.
If the `CREATE TABLE` statement for the external table includes a column that does not exist in the Parquet or ORC file, queries for that column return NULL.
Data type mappings between Parquet and AnalyticDB for MySQL
Basic type in Parquet
Logical type in Parquet
Data type in AnalyticDB for MySQL
BOOLEAN
N/A
BOOLEAN
INT32
INT_8
TINYINT
INT32
INT_16
SMALLINT
INT32
N/A
INT or INTEGER
INT64
N/A
BIGINT
FLOAT
N/A
FLOAT
DOUBLE
N/A
DOUBLE
FIXED_LEN_BYTE_ARRAY
BINARY
INT64
INT32
DECIMAL
DECIMAL
BINARY
UTF-8
VARCHAR
STRING
JSON (available if the Parquet object contains a column of the JSON type)
INT32
DATE
DATE
INT64
TIMESTAMP_MILLIS
TIMESTAMP or DATETIME
INT96
N/A
TIMESTAMP or DATETIME
ImportantParquet external tables that use columns of the
STRUCTtype cannot be created.Data type mappings between ORC and AnalyticDB for MySQL
Data type in ORC
Data type in AnalyticDB for MySQL
BOOLEAN
BOOLEAN
BYTE
TINYINT
SHORT
SMALLINT
INT
INT or INTEGER
LONG
BIGINT
DECIMAL
DECIMAL
FLOAT
FLOAT
DOUBLE
DOUBLE
BINARY
STRING
VARCHAR
VARCHAR
STRING
JSON (available if the ORC object contains a column of the JSON type)
TIMESTAMP
TIMESTAMP or DATETIME
DATE
DATE
ImportantORC external tables that use the
LIST,STRUCT, orUNIONtype cannot be created. ORC external tables that use theMAPtype cab be created but cannot be queried.
Create a partitioned HDFS external table
HDFS supports partitioning data in Parquet, CSV, and ORC file formats. Partitioned data forms a hierarchical directory on HDFS. In the following example, p1 is the level-1 partition, p2 is the level-2 partition, and p3 is the level-3 partition:
parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ │ │ └── 000000_1
│ │ └── p3=SHENZHEN
│ │ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0The following statement shows how to create an external table with specified columns for a Parquet file:
CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
uid varchar,
other varchar,
p1 date,
p2 int,
p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
"format":"parquet", //To create an external table for a CSV or ORC file, change the value of format to csv or orc.
"partition_column":"p1, p2, p3" //For partitioned HDFS data, if you want to query data by partition, you must specify the partition_column parameter in the CREATE EXTERNAL TABLE statement when you import data to AnalyticDB for MySQL.
}';The
partition_columnproperty in theTABLE_PROPERTIESparameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by thepartition_columnproperty must conform to the partition levels of the sample data.When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.
The partition key columns defined in the statement must be in the same order as the partition key columns specified by the
partition_columnproperty.Partition key columns support the following data types:
BOOLEAN,TINYINT,SMALLINT,INT,INTEGER,BIGINT,FLOAT,DOUBLE,DECIMAL,VARCHAR,STRING,DATE, andTIMESTAMP.When you query data, partition key columns can be displayed and used in the same manner as other columns.
If you leave the format parameter empty, the CSV format is used.
For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.