How to import HDFS data to AnalyticDB for MySQL - AnalyticDB

AnalyticDB for MySQL supports using external tables to import and export data. This topic describes how to use an external table to query data from the Hadoop Distributed File System (HDFS) and import the data into AnalyticDB for MySQL.

Prerequisites

Your AnalyticDB for MySQL cluster must run kernel version 3.1.4 or later.
Note
To view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.
The HDFS data file is in CSV, Parquet, or ORC format.
An HDFS cluster is created and the data to be imported is prepared in an HDFS folder. This topic uses the hdfs_import_test_data.csv folder as an example.
The following service access ports are configured in the HDFS cluster for the AnalyticDB for MySQL cluster:
- namenode: Reads and writes file system metadata. You can configure the port number in the fs.defaultFS parameter. The default port is 8020.
  For more information about the configuration, see core-default.xml.
- datanode: Reads and writes data. You can configure the port number in the dfs.datanode.address parameter. The default port is 50010.
  For more information about the configuration, see hdfs-default.xml.
AnalyticDB for MySQL Data Warehouse Edition in elastic mode supports Elastic Network Interface (ENI) access.
Important
- Log on to the AnalyticDB for MySQL console. On the Cluster Information page, in the Network Information section, turn on the Elastic Network Interface (ENI) switch.
- Enabling or disabling the ENI network interrupts database connections for about 2 minutes. During this time, read and write operations are unavailable. Evaluate the impact before you enable or disable the ENI network.

Procedure

Create a destination database. In this example, the destination database in the AnalyticDB for MySQL cluster is named adb_demo.
```
CREATE DATABASE IF NOT EXISTS adb_demo;    
```
In the adb_demo destination database, use the CREATE TABLE statement to create an external table in CSV, Parquet, or ORC format.
- Create an HDFS external table.
- Create a partitioned HDFS external table.
Create a destination table.
Use one of the following statements to create a destination table in the adb_demo destination database to store data imported from HDFS:
- Create a destination table for a standard external table. In this example, the destination table is named adb_hdfs_import_test. The syntax is as follows:
```
CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
(
    uid string,
    other string
)
DISTRIBUTED BY HASH(uid);
```
- When you create a destination table for a partitioned external table, you must define both standard columns, such as uid and other, and partition key columns, such as p1, p2, and p3. In this example, the destination table is named adb_hdfs_import_parquet_partition. The syntax is as follows:
```
CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
(
    uid string,
    other string,
    p1 date,
    p2 int,
    p3 varchar
)
DISTRIBUTED BY HASH(uid);
```
Import data from HDFS into the destination AnalyticDB for MySQL cluster.
Select a method to import data as needed. The syntax for importing data into a partitioned table is the same as for a standard table. The following examples use a standard table:
- Method 1 (Recommended): Use INSERT OVERWRITE to import data. This method supports batch import and delivers high performance. Data is visible after a successful import. If the import fails, the data is rolled back. The following code provides an example:
```
INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;
```
- Method 2: Use INSERT INTO to import data. Inserted data is available for real-time queries. Use this method for small amounts of data. The following code provides an example:
```
INSERT INTO adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;
```
- Method 3: Run an asynchronous task to import data. The following code provides an example:
```
SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;
```
  The following result is returned:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+
```
  You can also check the status of the asynchronous task based on the job_id. For more information, see Asynchronously submit an import task.

What to do next

After the import is complete, log on to the adb_demo destination database in your AnalyticDB for MySQL cluster. Run the following statement to verify that the data was imported from the source table into the adb_hdfs_import_test destination table:

SELECT * FROM adb_hdfs_import_test LIMIT 100;

Create an HDFS external table

Create an external table for a CSV file

The statement is as follows:

CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
    uid string,
    other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"csv",
    "delimiter":",",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
}';

Parameter	Required	Description
`ENGINE='HDFS'`	Required	The storage engine for the external table. This example uses HDFS.
`TABLE_PROPERTIES`		The method that AnalyticDB for MySQL uses to access HDFS data.
`format`		The format of the data file. To create an external table for a CSV file, set this parameter to `csv`.
`delimiter`		The column delimiter for the CSV data file. This example uses a comma (,).
`hdfs_url`		The absolute address of the destination data file or folder in the HDFS cluster. The address must start with `hdfs://`. Example: `hdfs://172.17.*.*:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv`
`partition_column`	Optional	The partition key columns of the external table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.
`compress_type`		The compression type of the data file. CSV files support only the Gzip compression type.
`skip_header_line_count`		The number of header rows to skip at the beginning of the file during data import. The first row of a CSV file is the table header. If you set this parameter to 1, the first row is automatically skipped during data import. The default value is 0, which means no rows are skipped.
`hdfs_ha_host_port`		If the High Availability (HA) feature is configured for the HDFS cluster, configure the `hdfs_ha_host_port` parameter when you create an external table. The format is `ip1:port1,ip2:port2`. The IP addresses and ports are for the primary and secondary `namenode` instances. Example: `192.168.xx.xx:8020,192.168.xx.xx:8021`

Create an external table for a Parquet or ORC file

The following statement shows how to create an external table for a Parquet file:

CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
    uid string,
    other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"parquet",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
}';

Parameter	Required	Description
`ENGINE='HDFS'`	Required	The storage engine for the external table. This example uses HDFS.
`TABLE_PROPERTIES`		The method that AnalyticDB for MySQL uses to access HDFS data.
`format`		The format of the data file. To create an external table for a Parquet file, set this parameter to `parquet`. To create an external table for an ORC file, set this parameter to `orc`.
`hdfs_url`		The absolute address of the destination data file or folder in the HDFS cluster. The address must start with `hdfs://`.
`partition_column`	Optional	The partition key columns of the table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.
`hdfs_ha_host_port`	Optional	If the HA feature is configured for the HDFS cluster, configure the `hdfs_ha_host_port` parameter when you create an external table. The format is `ip1:port1,ip2:port2`. The IP addresses and ports are for the primary and secondary `namenode` instances. Example: `192.168.xx.xx:8020,192.168.xx.xx:8021`

Note

The column names in the `CREATE TABLE` statement for the external table must be identical to the column names in the Parquet or ORC file but are case-insensitive. The order of the columns must also be the same.
When you create an external table, you can select only some columns from the Parquet or ORC file to be columns in the external table. Unselected columns are not imported.
If the `CREATE TABLE` statement for the external table includes a column that does not exist in the Parquet or ORC file, queries for that column return NULL.

Data type mappings between Parquet and AnalyticDB for MySQL

Basic type in Parquet	Logical type in Parquet	Data type in AnalyticDB for MySQL
BOOLEAN	N/A	BOOLEAN
INT32	INT_8	TINYINT
INT32	INT_16	SMALLINT
INT32	N/A	INT or INTEGER
INT64	N/A	BIGINT
FLOAT	N/A	FLOAT
DOUBLE	N/A	DOUBLE
FIXED_LEN_BYTE_ARRAY BINARY INT64 INT32	DECIMAL	DECIMAL
BINARY	UTF-8	VARCHAR STRING JSON (available if the Parquet object contains a column of the JSON type)
INT32	DATE	DATE
INT64	TIMESTAMP_MILLIS	TIMESTAMP or DATETIME
INT96	N/A	TIMESTAMP or DATETIME

Important

Parquet external tables that use columns of the STRUCT type cannot be created.

Data type mappings between ORC and AnalyticDB for MySQL

Data type in ORC	Data type in AnalyticDB for MySQL
BOOLEAN	BOOLEAN
BYTE	TINYINT
SHORT	SMALLINT
INT	INT or INTEGER
LONG	BIGINT
DECIMAL	DECIMAL
FLOAT	FLOAT
DOUBLE	DOUBLE
BINARY STRING VARCHAR	VARCHAR STRING JSON (available if the ORC object contains a column of the JSON type)
TIMESTAMP	TIMESTAMP or DATETIME
DATE	DATE

Important

ORC external tables that use the LIST, STRUCT, or UNION type cannot be created. ORC external tables that use the MAP type cab be created but cannot be queried.

Create a partitioned HDFS external table

HDFS supports partitioning data in Parquet, CSV, and ORC file formats. Partitioned data forms a hierarchical directory on HDFS. In the following example, p1 is the level-1 partition, p2 is the level-2 partition, and p3 is the level-3 partition:

parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│   │   ├── p3=SHANGHAI
│   │   │   ├── 000000_0
│   │   │   └── 000000_1
│   │   └── p3=SHENZHEN
│   │       └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

The following statement shows how to create an external table with specified columns for a Parquet file:

CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
  "hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
  "format":"parquet",  //To create an external table for a CSV or ORC file, change the value of format to csv or orc.
  "partition_column":"p1, p2, p3"  //For partitioned HDFS data, if you want to query data by partition, you must specify the partition_column parameter in the CREATE EXTERNAL TABLE statement when you import data to AnalyticDB for MySQL.
}';

Note

The partition_column property in the TABLE_PROPERTIES parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by the partition_column property must conform to the partition levels of the sample data.
When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.
The partition key columns defined in the statement must be in the same order as the partition key columns specified by the partition_column property.
Partition key columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.
When you query data, partition key columns can be displayed and used in the same manner as other columns.
If you leave the format parameter empty, the CSV format is used.
For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.