All Products
Search
Document Center

AnalyticDB:Use external tables to import HDFS data

Last Updated:Dec 01, 2025

AnalyticDB for MySQL supports using external tables to import and export data. This topic describes how to use an external table to query data from the Hadoop Distributed File System (HDFS) and import the data into AnalyticDB for MySQL.

Prerequisites

  • Your AnalyticDB for MySQL cluster must run kernel version 3.1.4 or later.

    Note

    To view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.

  • The HDFS data file is in CSV, Parquet, or ORC format.

  • An HDFS cluster is created and the data to be imported is prepared in an HDFS folder. This topic uses the hdfs_import_test_data.csv folder as an example.

  • The following service access ports are configured in the HDFS cluster for the AnalyticDB for MySQL cluster:

    • namenode: Reads and writes file system metadata. You can configure the port number in the fs.defaultFS parameter. The default port is 8020.

      For more information about the configuration, see core-default.xml.

    • datanode: Reads and writes data. You can configure the port number in the dfs.datanode.address parameter. The default port is 50010.

      For more information about the configuration, see hdfs-default.xml.

  • AnalyticDB for MySQL Data Warehouse Edition in elastic mode supports Elastic Network Interface (ENI) access.

    Important
    • Log on to the AnalyticDB for MySQL console. On the Cluster Information page, in the Network Information section, turn on the Elastic Network Interface (ENI) switch.

    • Enabling or disabling the ENI network interrupts database connections for about 2 minutes. During this time, read and write operations are unavailable. Evaluate the impact before you enable or disable the ENI network.

Procedure

  1. Create a destination database. In this example, the destination database in the AnalyticDB for MySQL cluster is named adb_demo.

    CREATE DATABASE IF NOT EXISTS adb_demo;    
  2. In the adb_demo destination database, use the CREATE TABLE statement to create an external table in CSV, Parquet, or ORC format.

  3. Create a destination table.

    Use one of the following statements to create a destination table in the adb_demo destination database to store data imported from HDFS:

    • Create a destination table for a standard external table. In this example, the destination table is named adb_hdfs_import_test. The syntax is as follows:

      CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
      (
          uid string,
          other string
      )
      DISTRIBUTED BY HASH(uid);
    • When you create a destination table for a partitioned external table, you must define both standard columns, such as uid and other, and partition key columns, such as p1, p2, and p3. In this example, the destination table is named adb_hdfs_import_parquet_partition. The syntax is as follows:

      CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
      (
          uid string,
          other string,
          p1 date,
          p2 int,
          p3 varchar
      )
      DISTRIBUTED BY HASH(uid);
  4. Import data from HDFS into the destination AnalyticDB for MySQL cluster.

    Select a method to import data as needed. The syntax for importing data into a partitioned table is the same as for a standard table. The following examples use a standard table:

    • Method 1 (Recommended): Use INSERT OVERWRITE to import data. This method supports batch import and delivers high performance. Data is visible after a successful import. If the import fails, the data is rolled back. The following code provides an example:

      INSERT OVERWRITE adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 2: Use INSERT INTO to import data. Inserted data is available for real-time queries. Use this method for small amounts of data. The following code provides an example:

      INSERT INTO adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 3: Run an asynchronous task to import data. The following code provides an example:

      SUBMIT JOB INSERT OVERWRITE adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;

      The following result is returned:

      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |
      +---------------------------------------+

      You can also check the status of the asynchronous task based on the job_id. For more information, see Asynchronously submit an import task.

What to do next

After the import is complete, log on to the adb_demo destination database in your AnalyticDB for MySQL cluster. Run the following statement to verify that the data was imported from the source table into the adb_hdfs_import_test destination table:

SELECT * FROM adb_hdfs_import_test LIMIT 100;

Create an HDFS external table

  • Create an external table for a CSV file

    The statement is as follows:

    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"csv",
        "delimiter":",",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
    }';

    Parameter

    Required

    Description

    ENGINE='HDFS'

    Required

    The storage engine for the external table. This example uses HDFS.

    TABLE_PROPERTIES

    The method that AnalyticDB for MySQL uses to access HDFS data.

    format

    The format of the data file. To create an external table for a CSV file, set this parameter to csv.

    delimiter

    The column delimiter for the CSV data file. This example uses a comma (,).

    hdfs_url

    The absolute address of the destination data file or folder in the HDFS cluster. The address must start with hdfs://.

    Example: hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv

    partition_column

    Optional

    The partition key columns of the external table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.

    compress_type

    The compression type of the data file. CSV files support only the Gzip compression type.

    skip_header_line_count

    The number of header rows to skip at the beginning of the file during data import. The first row of a CSV file is the table header. If you set this parameter to 1, the first row is automatically skipped during data import.

    The default value is 0, which means no rows are skipped.

    hdfs_ha_host_port

    If the High Availability (HA) feature is configured for the HDFS cluster, configure the hdfs_ha_host_port parameter when you create an external table. The format is ip1:port1,ip2:port2. The IP addresses and ports are for the primary and secondary namenode instances.

    Example: 192.168.xx.xx:8020,192.168.xx.xx:8021

  • Create an external table for a Parquet or ORC file

    The following statement shows how to create an external table for a Parquet file:

    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"parquet",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
    }';

    Parameter

    Required

    Description

    ENGINE='HDFS'

    Required

    The storage engine for the external table. This example uses HDFS.

    TABLE_PROPERTIES

    The method that AnalyticDB for MySQL uses to access HDFS data.

    format

    The format of the data file.

    • To create an external table for a Parquet file, set this parameter to parquet.

    • To create an external table for an ORC file, set this parameter to orc.

    hdfs_url

    The absolute address of the destination data file or folder in the HDFS cluster. The address must start with hdfs://.

    partition_column

    Optional

    The partition key columns of the table. Separate multiple columns with commas (,). For information about how to define partition key columns, see Create a partitioned HDFS external table.

    hdfs_ha_host_port

    If the HA feature is configured for the HDFS cluster, configure the hdfs_ha_host_port parameter when you create an external table. The format is ip1:port1,ip2:port2. The IP addresses and ports are for the primary and secondary namenode instances.

    Example: 192.168.xx.xx:8020,192.168.xx.xx:8021

    Note
    • The column names in the `CREATE TABLE` statement for the external table must be identical to the column names in the Parquet or ORC file but are case-insensitive. The order of the columns must also be the same.

    • When you create an external table, you can select only some columns from the Parquet or ORC file to be columns in the external table. Unselected columns are not imported.

    • If the `CREATE TABLE` statement for the external table includes a column that does not exist in the Parquet or ORC file, queries for that column return NULL.

    Data type mappings between Parquet and AnalyticDB for MySQL

    Basic type in Parquet

    Logical type in Parquet

    Data type in AnalyticDB for MySQL

    BOOLEAN

    N/A

    BOOLEAN

    INT32

    INT_8

    TINYINT

    INT32

    INT_16

    SMALLINT

    INT32

    N/A

    INT or INTEGER

    INT64

    N/A

    BIGINT

    FLOAT

    N/A

    FLOAT

    DOUBLE

    N/A

    DOUBLE

    • FIXED_LEN_BYTE_ARRAY

    • BINARY

    • INT64

    • INT32

    DECIMAL

    DECIMAL

    BINARY

    UTF-8

    • VARCHAR

    • STRING

    • JSON (available if the Parquet object contains a column of the JSON type)

    INT32

    DATE

    DATE

    INT64

    TIMESTAMP_MILLIS

    TIMESTAMP or DATETIME

    INT96

    N/A

    TIMESTAMP or DATETIME

    Important

    Parquet external tables that use columns of the STRUCT type cannot be created.

    Data type mappings between ORC and AnalyticDB for MySQL

    Data type in ORC

    Data type in AnalyticDB for MySQL

    BOOLEAN

    BOOLEAN

    BYTE

    TINYINT

    SHORT

    SMALLINT

    INT

    INT or INTEGER

    LONG

    BIGINT

    DECIMAL

    DECIMAL

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    • BINARY

    • STRING

    • VARCHAR

    • VARCHAR

    • STRING

    • JSON (available if the ORC object contains a column of the JSON type)

    TIMESTAMP

    TIMESTAMP or DATETIME

    DATE

    DATE

    Important

    ORC external tables that use the LIST, STRUCT, or UNION type cannot be created. ORC external tables that use the MAP type cab be created but cannot be queried.

Create a partitioned HDFS external table

HDFS supports partitioning data in Parquet, CSV, and ORC file formats. Partitioned data forms a hierarchical directory on HDFS. In the following example, p1 is the level-1 partition, p2 is the level-2 partition, and p3 is the level-3 partition:

parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│   │   ├── p3=SHANGHAI
│   │   │   ├── 000000_0
│   │   │   └── 000000_1
│   │   └── p3=SHENZHEN
│   │       └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

The following statement shows how to create an external table with specified columns for a Parquet file:

CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
  "hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
  "format":"parquet",  //To create an external table for a CSV or ORC file, change the value of format to csv or orc.
  "partition_column":"p1, p2, p3"  //For partitioned HDFS data, if you want to query data by partition, you must specify the partition_column parameter in the CREATE EXTERNAL TABLE statement when you import data to AnalyticDB for MySQL.
}';
Note
  • The partition_column property in the TABLE_PROPERTIES parameter specifies the partition key columns (p1, p2, and p3 in the example). The partition key columns specified by the partition_column property must conform to the partition levels of the sample data.

  • When you define columns in the statement, you must include the partition key columns (p1, p2, and p3 in the example) and their data types. The partition key columns must be placed at the end of the column definition.

  • The partition key columns defined in the statement must be in the same order as the partition key columns specified by the partition_column property.

  • Partition key columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.

  • When you query data, partition key columns can be displayed and used in the same manner as other columns.

  • If you leave the format parameter empty, the CSV format is used.

  • For information about other parameters, see the parameter table in the "Non-partitioned OSS external tables" section of this topic.