All Products
Search
Document Center

AnalyticDB:Use external tables to import HDFS data

Last Updated:Mar 28, 2026

External tables in AnalyticDB for MySQL act as a bridge to your Hadoop Distributed File System (HDFS). Create an external table that maps to your HDFS files, then run an INSERT SELECT statement to load the data into a native AnalyticDB for MySQL table.

Prerequisites

Before you begin, confirm the following:

Cluster requirements:

  • Your AnalyticDB for MySQL cluster runs kernel version 3.1.4 or later. To check and update the version, log on to the AnalyticDB for MySQL console and go to the Configuration Information section on the Cluster Information page. For instructions, see Update the minor version of a cluster.

  • Your cluster is a Data Warehouse Edition cluster running in elastic mode. Elastic mode is required for Elastic Network Interface (ENI) access.

HDFS requirements:

  • Your HDFS data files are in CSV, Parquet, or ORC format.

  • Your HDFS cluster is running and the data to import is available in an HDFS directory. This topic uses hdfs_import_test_data.csv as an example.

  • The following service ports are open in the HDFS cluster for AnalyticDB for MySQL:

    • NameNode: Manages file system metadata. Default port: 8020, configured via fs.defaultFS. See core-default.xml.

    • DataNode: Handles data reads and writes. Default port: 50010, configured via dfs.datanode.address. See hdfs-default.xml.

Network configuration:

Enable ENI access on your cluster before importing data.

  1. Log on to the AnalyticDB for MySQL console.

  2. On the Cluster Information page, go to the Network Information section.

  3. Turn on the Elastic Network Interface (ENI) switch.

Important

Enabling or disabling ENI interrupts database connections for about 2 minutes. Read and write operations are unavailable during this time. Plan the change during a maintenance window.

Import HDFS data

Step 1: Create a destination database

CREATE DATABASE IF NOT EXISTS adb_demo;

Step 2: Create an HDFS external table

In the adb_demo database, create an external table that maps to your HDFS data. Select the tab that matches your file format.

CSV:

CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
    uid string,
    other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"csv",
    "delimiter":",",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
}';

Parquet:

CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
(
    uid string,
    other string
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
    "format":"parquet",
    "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
}';

ORC: Use the same syntax as Parquet, but set "format":"orc".

For all parameters available in TABLE_PROPERTIES, see External table parameters.

Step 3: Create a destination table

Create a native AnalyticDB for MySQL table to receive the imported data.

Standard table:

CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
(
    uid string,
    other string
)
DISTRIBUTED BY HASH(uid);

Partitioned table — include both data columns and partition key columns. Partition key columns must appear at the end of the column list:

CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
(
    uid string,
    other string,
    p1 date,
    p2 int,
    p3 varchar
)
DISTRIBUTED BY HASH(uid);

Step 4: Import data

Choose one of the following methods. The syntax is the same for both standard and partitioned destination tables.

MethodCommandWhen to use
INSERT OVERWRITE (recommended)INSERT OVERWRITE <dest_table> SELECT * FROM <external_table>Large datasets. Supports batch import with high throughput. Data becomes visible only after a successful import; rolled back on failure.
INSERT INTOINSERT INTO <dest_table> SELECT * FROM <external_table>Small datasets. Inserted rows are immediately available for queries.
Asynchronous taskSUBMIT JOB INSERT OVERWRITE <dest_table> SELECT * FROM <external_table>Long-running imports. Returns a job_id immediately so you can monitor progress.

Example — INSERT OVERWRITE:

INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;

Example — asynchronous task:

SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test
SELECT * FROM hdfs_import_test_external_table;

The command returns a job_id:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+

Use the job_id to check import status. For details, see Asynchronously submit an import task.

Verify the import

After the import completes, run the following query against adb_demo to confirm the data loaded correctly:

SELECT * FROM adb_hdfs_import_test LIMIT 100;

Create a partitioned HDFS external table

Use partitioned external tables when your HDFS data is organized into a hierarchical directory structure. HDFS supports partitioned data in Parquet, CSV, and ORC formats. Partitions form a directory hierarchy like this:

parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│   │   ├── p3=SHANGHAI
│   │   │   ├── 000000_0
│   │   │   └── 000000_1
│   │   └── p3=SHENZHEN
│   │       └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

The following statement creates an external table for this partitioned Parquet dataset:

CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
  "hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
  "format":"parquet",
  "partition_column":"p1, p2, p3"
}';

To use CSV or ORC data instead, change "format" to "csv" or "orc". If format is omitted, CSV is used by default.

Rules for partitioned external tables:

  • The partition_column property lists the partition key columns in the order they appear in the directory hierarchy. The order must match exactly.

  • Define partition key columns in the CREATE TABLE statement and place them at the end of the column list, after all data columns.

  • Partition key columns support these data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.

  • Partition key columns behave like regular columns in queries — you can SELECT and filter on them.

  • For other TABLE_PROPERTIES parameters, see the CSV parameters table or Parquet/ORC parameters table.

External table parameters

CSV parameters

ParameterRequiredDescription
ENGINE='HDFS'YesSets the storage engine to HDFS.
TABLE_PROPERTIESYesJSON object containing HDFS connection and format settings.
formatNoFile format. Set to csv. If omitted, CSV is used by default.
delimiterNoColumn delimiter character. Default: ,.
hdfs_urlYesAbsolute path to the HDFS file or directory. Must start with hdfs://. Example: hdfs://172.17.*.*:9000/adb/data/file.csv
partition_columnNoComma-separated list of partition key columns. Required for partitioned data.
compress_typeNoCompression format. CSV supports Gzip only.
skip_header_line_countNoNumber of header rows to skip. Set to 1 to skip the column header row. Default: 0.
hdfs_ha_host_portNoNameNode IP addresses and ports for high availability (HA) clusters. Format: ip1:port1,ip2:port2. Example: 192.168.xx.xx:8020,192.168.xx.xx:8021

Parquet and ORC parameters

ParameterRequiredDescription
ENGINE='HDFS'YesSets the storage engine to HDFS.
TABLE_PROPERTIESYesJSON object containing HDFS connection and format settings.
formatYesFile format. Set to parquet or orc.
hdfs_urlYesAbsolute path to the HDFS file or directory. Must start with hdfs://.
partition_columnNoComma-separated list of partition key columns. Required for partitioned data.
hdfs_ha_host_portNoNameNode IP addresses and ports for HA clusters. Format: ip1:port1,ip2:port2.

Column mapping rules for Parquet and ORC files:

  • Column names in the CREATE TABLE statement must match column names in the file (case-insensitive). The column order must also match.

  • You can include only a subset of columns from the file. Excluded columns are not imported.

  • If the CREATE TABLE statement references a column that does not exist in the file, queries for that column return NULL.

Limitations

  • CSV external tables support Gzip compression only.

  • Parquet external tables cannot use columns of the STRUCT type.

  • ORC external tables cannot use the LIST, STRUCT, or UNION type. The MAP type can be used in table creation but cannot be queried.

Data type mappings

Parquet to AnalyticDB for MySQL

Basic type in ParquetLogical type in ParquetData type in AnalyticDB for MySQL
BOOLEANN/ABOOLEAN
INT32INT_8TINYINT
INT32INT_16SMALLINT
INT32N/AINT or INTEGER
INT64N/ABIGINT
FLOATN/AFLOAT
DOUBLEN/ADOUBLE
FIXED_LEN_BYTE_ARRAY, BINARY, INT64, or INT32DECIMALDECIMAL
BINARYUTF-8VARCHAR, STRING, or JSON (available if the Parquet object contains a column of the JSON type)
INT32DATEDATE
INT64TIMESTAMP_MILLISTIMESTAMP or DATETIME
INT96N/ATIMESTAMP or DATETIME

ORC to AnalyticDB for MySQL

Data type in ORCData type in AnalyticDB for MySQL
BOOLEANBOOLEAN
BYTETINYINT
SHORTSMALLINT
INTINT or INTEGER
LONGBIGINT
DECIMALDECIMAL
FLOATFLOAT
DOUBLEDOUBLE
BINARY, STRING, or VARCHARVARCHAR, STRING, or JSON (available if the ORC object contains a column of the JSON type)
TIMESTAMPTIMESTAMP or DATETIME
DATEDATE

What's next