External tables in AnalyticDB for MySQL act as a bridge to your Hadoop Distributed File System (HDFS). Create an external table that maps to your HDFS files, then run an INSERT SELECT statement to load the data into a native AnalyticDB for MySQL table.
Prerequisites
Before you begin, confirm the following:
Cluster requirements:
Your AnalyticDB for MySQL cluster runs kernel version 3.1.4 or later. To check and update the version, log on to the AnalyticDB for MySQL console and go to the Configuration Information section on the Cluster Information page. For instructions, see Update the minor version of a cluster.
Your cluster is a Data Warehouse Edition cluster running in elastic mode. Elastic mode is required for Elastic Network Interface (ENI) access.
HDFS requirements:
Your HDFS data files are in CSV, Parquet, or ORC format.
Your HDFS cluster is running and the data to import is available in an HDFS directory. This topic uses
hdfs_import_test_data.csvas an example.The following service ports are open in the HDFS cluster for AnalyticDB for MySQL:
NameNode: Manages file system metadata. Default port: 8020, configured via
fs.defaultFS. See core-default.xml.DataNode: Handles data reads and writes. Default port: 50010, configured via
dfs.datanode.address. See hdfs-default.xml.
Network configuration:
Enable ENI access on your cluster before importing data.
Log on to the AnalyticDB for MySQL console.
On the Cluster Information page, go to the Network Information section.
Turn on the Elastic Network Interface (ENI) switch.
Enabling or disabling ENI interrupts database connections for about 2 minutes. Read and write operations are unavailable during this time. Plan the change during a maintenance window.
Import HDFS data
Step 1: Create a destination database
CREATE DATABASE IF NOT EXISTS adb_demo;Step 2: Create an HDFS external table
In the adb_demo database, create an external table that maps to your HDFS data. Select the tab that matches your file format.
CSV:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
uid string,
other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"format":"csv",
"delimiter":",",
"hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
}';Parquet:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
uid string,
other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"format":"parquet",
"hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
}';ORC: Use the same syntax as Parquet, but set "format":"orc".
For all parameters available in TABLE_PROPERTIES, see External table parameters.
Step 3: Create a destination table
Create a native AnalyticDB for MySQL table to receive the imported data.
Standard table:
CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
(
uid string,
other string
)
DISTRIBUTED BY HASH(uid);Partitioned table — include both data columns and partition key columns. Partition key columns must appear at the end of the column list:
CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
(
uid string,
other string,
p1 date,
p2 int,
p3 varchar
)
DISTRIBUTED BY HASH(uid);Step 4: Import data
Choose one of the following methods. The syntax is the same for both standard and partitioned destination tables.
| Method | Command | When to use |
|---|---|---|
| INSERT OVERWRITE (recommended) | INSERT OVERWRITE <dest_table> SELECT * FROM <external_table> | Large datasets. Supports batch import with high throughput. Data becomes visible only after a successful import; rolled back on failure. |
| INSERT INTO | INSERT INTO <dest_table> SELECT * FROM <external_table> | Small datasets. Inserted rows are immediately available for queries. |
| Asynchronous task | SUBMIT JOB INSERT OVERWRITE <dest_table> SELECT * FROM <external_table> | Long-running imports. Returns a job_id immediately so you can monitor progress. |
Example — INSERT OVERWRITE:
INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;Example — asynchronous task:
SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;The command returns a job_id:
+---------------------------------------+
| job_id |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+Use the job_id to check import status. For details, see Asynchronously submit an import task.
Verify the import
After the import completes, run the following query against adb_demo to confirm the data loaded correctly:
SELECT * FROM adb_hdfs_import_test LIMIT 100;Create a partitioned HDFS external table
Use partitioned external tables when your HDFS data is organized into a hierarchical directory structure. HDFS supports partitioned data in Parquet, CSV, and ORC formats. Partitions form a directory hierarchy like this:
parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ │ │ └── 000000_1
│ │ └── p3=SHENZHEN
│ │ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0The following statement creates an external table for this partitioned Parquet dataset:
CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
uid varchar,
other varchar,
p1 date,
p2 int,
p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
"format":"parquet",
"partition_column":"p1, p2, p3"
}';To use CSV or ORC data instead, change "format" to "csv" or "orc". If format is omitted, CSV is used by default.
Rules for partitioned external tables:
The
partition_columnproperty lists the partition key columns in the order they appear in the directory hierarchy. The order must match exactly.Define partition key columns in the
CREATE TABLEstatement and place them at the end of the column list, after all data columns.Partition key columns support these data types:
BOOLEAN,TINYINT,SMALLINT,INT,INTEGER,BIGINT,FLOAT,DOUBLE,DECIMAL,VARCHAR,STRING,DATE, andTIMESTAMP.Partition key columns behave like regular columns in queries — you can SELECT and filter on them.
For other
TABLE_PROPERTIESparameters, see the CSV parameters table or Parquet/ORC parameters table.
External table parameters
CSV parameters
| Parameter | Required | Description |
|---|---|---|
ENGINE='HDFS' | Yes | Sets the storage engine to HDFS. |
TABLE_PROPERTIES | Yes | JSON object containing HDFS connection and format settings. |
format | No | File format. Set to csv. If omitted, CSV is used by default. |
delimiter | No | Column delimiter character. Default: ,. |
hdfs_url | Yes | Absolute path to the HDFS file or directory. Must start with hdfs://. Example: hdfs://172.17.*.*:9000/adb/data/file.csv |
partition_column | No | Comma-separated list of partition key columns. Required for partitioned data. |
compress_type | No | Compression format. CSV supports Gzip only. |
skip_header_line_count | No | Number of header rows to skip. Set to 1 to skip the column header row. Default: 0. |
hdfs_ha_host_port | No | NameNode IP addresses and ports for high availability (HA) clusters. Format: ip1:port1,ip2:port2. Example: 192.168.xx.xx:8020,192.168.xx.xx:8021 |
Parquet and ORC parameters
| Parameter | Required | Description |
|---|---|---|
ENGINE='HDFS' | Yes | Sets the storage engine to HDFS. |
TABLE_PROPERTIES | Yes | JSON object containing HDFS connection and format settings. |
format | Yes | File format. Set to parquet or orc. |
hdfs_url | Yes | Absolute path to the HDFS file or directory. Must start with hdfs://. |
partition_column | No | Comma-separated list of partition key columns. Required for partitioned data. |
hdfs_ha_host_port | No | NameNode IP addresses and ports for HA clusters. Format: ip1:port1,ip2:port2. |
Column mapping rules for Parquet and ORC files:
Column names in the
CREATE TABLEstatement must match column names in the file (case-insensitive). The column order must also match.You can include only a subset of columns from the file. Excluded columns are not imported.
If the
CREATE TABLEstatement references a column that does not exist in the file, queries for that column return NULL.
Limitations
CSV external tables support Gzip compression only.
Parquet external tables cannot use columns of the
STRUCTtype.ORC external tables cannot use the
LIST,STRUCT, orUNIONtype. TheMAPtype can be used in table creation but cannot be queried.
Data type mappings
Parquet to AnalyticDB for MySQL
| Basic type in Parquet | Logical type in Parquet | Data type in AnalyticDB for MySQL |
|---|---|---|
| BOOLEAN | N/A | BOOLEAN |
| INT32 | INT_8 | TINYINT |
| INT32 | INT_16 | SMALLINT |
| INT32 | N/A | INT or INTEGER |
| INT64 | N/A | BIGINT |
| FLOAT | N/A | FLOAT |
| DOUBLE | N/A | DOUBLE |
| FIXED_LEN_BYTE_ARRAY, BINARY, INT64, or INT32 | DECIMAL | DECIMAL |
| BINARY | UTF-8 | VARCHAR, STRING, or JSON (available if the Parquet object contains a column of the JSON type) |
| INT32 | DATE | DATE |
| INT64 | TIMESTAMP_MILLIS | TIMESTAMP or DATETIME |
| INT96 | N/A | TIMESTAMP or DATETIME |
ORC to AnalyticDB for MySQL
| Data type in ORC | Data type in AnalyticDB for MySQL |
|---|---|
| BOOLEAN | BOOLEAN |
| BYTE | TINYINT |
| SHORT | SMALLINT |
| INT | INT or INTEGER |
| LONG | BIGINT |
| DECIMAL | DECIMAL |
| FLOAT | FLOAT |
| DOUBLE | DOUBLE |
| BINARY, STRING, or VARCHAR | VARCHAR, STRING, or JSON (available if the ORC object contains a column of the JSON type) |
| TIMESTAMP | TIMESTAMP or DATETIME |
| DATE | DATE |
What's next
Asynchronously submit an import task — monitor long-running imports using a
job_idUpdate the minor version of a cluster — keep your cluster on a supported kernel version