AnalyticDB for MySQL allows you to use external tables to import and export data. This topic describes how to query Apsara File Storage for HDFS data by using an external mapping table of AnalyticDB for MySQL and import the data into AnalyticDB for MySQL.

Prerequisites

  • The version of the AnalyticDB for MySQL cluster is V3.1.4.4 or later. For more information, see Release notes.
  • The format of the Apsara File Storage for HDFS file is Parquet, ORC, CSV, or TEXT.
  • An Apsara File Storage for HDFS cluster is created, and data to be imported is stored in an Apsara File Storage for HDFS folder. In this topic, the hdfs_import_test_data.csv folder is used.
  • The following service access ports for the AnalyticDB for MySQL cluster are configured in the Apsara File Storage for HDFS cluster:
    • namenode: used to read and write metadata of a file system. You can configure the port number by using the fs.defaultFS parameter. The default port number is 8020.

      For detailed configurations, see core-default.xml.

    • datanode: used to read and write data. You can configure the port number by using the dfs.datanode.address parameter. The default port number is 50010.

      For detailed configurations, see hdfs-default.xml.

  • If the AnalyticDB for MySQL cluster is in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page. Enable ENI

Procedure

  1. Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.
  2. Create a destination database. For more information, see Create a database.

    In this example, the database named adb_demo is used as the destination database in the AnalyticDB for MySQL cluster.

  3. Create an external mapping table.
  4. Create a destination table.

    You can create a destination table in the adb_demo destination database by using the following statements to store data imported from Apsara File Storage for HDFS.

    • The following statement shows how to create a destination table corresponding to the standard external mapping table. In this example, the destination table is named adb_hdfs_import_test.
      CREATE TABLE IF NOT EXISTS adb_hdfs_import_test
      (
          uid string,
          other string
      )
      DISTRIBUTED BY HASH(uid);
    • The following statement shows how to create a destination table corresponding to the partitioned external mapping table. In this example, the destination table is named adb_hdfs_import_parquet_partition. To create the table, you must define ordinary columns (such as uid and other) and partition columns (such as p1, p2, and p3) in the statement.
      CREATE TABLE IF NOT EXISTS adb_hdfs_import_parquet_partition
      (
          uid string,
          other string,
          p1 date,
          p2 int,
          p3 varchar
      )
      DISTRIBUTED BY HASH(uid);
  5. Import data from Apsara File Storage for HDFS to the destination AnalyticDB for MySQL cluster.

    You can use one of the following methods to import data. The syntax to import data by using a partitioned table is the same as the syntax to import data by using a standard table.

    • Method 1 (recommended): Use the INSERT OVERWRITE INTO statement to import data. This method allows you to batch import data and provides good performance. If the import succeeds, the data is available for query. Otherwise, the data is rolled back to its previous state. The following statements show how to import data by using this method:
      INSERT OVERWRITE INTO adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 2: Use the INSERT INTO statement to import data. This method allows you to query imported data in real-time. This method is suitable for when the amount of data is small. The following statements show how to import data by using this method:
      INSERT INTO adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
    • Method 3: Asynchronously submit a task to import data. The following statements show how to import data by using this method:
      SUBMIT JOB INSERT OVERWRITE INTO adb_hdfs_import_test
      SELECT * FROM hdfs_import_test_external_table;
      The following result is returned:
      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |
      +---------------------------------------+

      You can check the state of the asynchronous task based on the job_id value. For more information, see Asynchronously submit an import or export task.

What to do next

After data is imported, you can log on to the adb_demo destination database in the AnalyticDB for MySQL cluster. Then, execute the following statement to check whether data is imported from the source table to the adb_hdfs_import_test destination table:
SELECT * FROM adb_hdfs_import_test LIMIT 100;

Appendix 1: Create an Apsara File Storage for HDFS external table

You can create an Apsara File Storage for HDFS external table in the CSV, Parquet, ORC, or TEXT format in the AnalyticDB for MySQL cluster. The following section provides the syntax used to create external tables in different formats:

  • Create an external table in the CSV format.
    Sample statement:
    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"csv",
        "delimiter":",",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv"
    }';
    Parameter Required Description
    ENGINE='HDFS' Yes The storage engine of the external table. In this example, HDFS is used as the storage engine.
    TABLE_PROPERTIES The method used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.
    format The format of the data file. When you create an external table in the CSV format, you must set this parameter to csv.
    delimiter The column delimiter of the CSV data file. In this example, a comma (,) is used as the delimiter.
    hdfs_url The absolute address of the destination data file or folder in the Apsara File Storage for HDFS cluster. The address must start with hdfs://.

    Example: hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv

    partition_column No The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see Appendix 2: Create a partitioned external table for Apsara File Storage for HDFS.
    compress_type The compression type of the data file. CSV files support only the gzip compression type.
    skip_header_line_count The number of header lines to skip when you import data. The first line of a CSV file is the table header. If you set this parameter to 1, the first line of the file is skipped when you import data.

    The default value of this parameter is 0, which indicates that no lines are skipped.

  • Create an external table in the Parquet format.
    Note
    • The column names used in the statement to create an external table must be the same as those in a Parquet file. The names are case-insensitive. The sequence of the columns in the statement must also be the same as that in the Parquet file.
    • When you create an external table, you can choose only specific columns in a Parquet file as the columns of the external table. The columns that are not selected in the Parquet file are not imported.
    • If the statement used to create an external table contains a column that is not in the Parquet file, NULL is returned in the query results of this column.

    Sample statement:

    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"parquet",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/"
    }';
    Parameter Required Description
    ENGINE='HDFS' Yes The storage engine of the external table. In this example, HDFS is used as the storage engine.
    TABLE_PROPERTIES The method used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.
    format The format of the data file. When you create an external table in the Parquet format, you must set this parameter to parquet.
    hdfs_url The absolute address of the destination data file or folder in the Apsara File Storage for HDFS cluster. The address must start with hdfs://.
    partition_column No The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see Appendix 2: Create a partitioned external table for Apsara File Storage for HDFS.
    compress_type The compression type of the data file. Parquet files support the Snappy, gzip, LZO, Brotli, LZ4, ZSTD, and uncompressed types.

    Example: "compress_type":"snappy".

    When you create an external table in the Parquet format, take note of the mappings between data types. The following table describes the detailed rules.

  • Create an external table in the ORC format.
    Sample statement:
    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"orc",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_orc_data/"
    }';
    Parameter Required Description
    ENGINE='HDFS' Yes The storage engine of the external table. In this example, HDFS is used as the storage engine.
    TABLE_PROPERTIES The method used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.
    format The format of the data file. When you create an external table in the ORC format, you must set this parameter to orc.
    hdfs_url The absolute address of the destination data file or folder in the Apsara File Storage for HDFS cluster. The address must start with hdfs://.
    partition_column No The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see Appendix 2: Create a partitioned external table for Apsara File Storage for HDFS.
  • Create an external table in the TEXT format.
    Sample statement:
    CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='HDFS'
    TABLE_PROPERTIES='{
        "format":"text",
        "row_delimiter":"\n",
        "field_delimiter":",",
        "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_text_data/"
    }';
    Parameter Required Description
    ENGINE='HDFS' Yes The storage engine of the external table. In this example, HDFS is used as the storage engine.
    TABLE_PROPERTIES The method used by AnalyticDB for MySQL to access Apsara File Storage for HDFS data.
    format The format of the data file. When you create an external table in the TEXT format, you must set this parameter to text.
    row_delimiter The row delimiter of the TEXT file. Set the value to \n.
    field_delimiter The column delimiter of the TEXT file. The delimiter must be a single character, such as a comma (,). A value of \n indicates that a row is a field.
    hdfs_url The absolute address of the destination data file or folder in the Apsara File Storage for HDFS cluster. The address must start with hdfs://.
    compress_type No The compression type of the data file. TEXT files support only the gzip compression type.

Appendix 2: Create a partitioned external table for Apsara File Storage for HDFS

Apsara File Storage for HDFS can partition data that is in the Parquet, CSV, or ORC format. It generates a hierarchical directory for data that contains partitions. For example, in the following data, p1 indicates the first-level partition, p2 indicates the second-level partition, and p3 indicates the third-level partition.
parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│           └── 000000_1
│       └── p3=SHENZHEN
│           └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│           └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

If you want to query the preceding Apsara File Storage for HDFS data by partition, you must specify partition columns (partition_column) in the statement used to create an external table when you import data into AnalyticDB for MySQL.

The following statement shows how to create an external table and specify partition columns in the external table. In this example, a Parquet file is used. If you want to create a CSV or ORC file, you need only to change the value of format to csv or orc.
CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
  "hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
  "format":"parquet",
  "partition_column":"p1, p2, p3"
}'
Note
  • When you define partition columns in an external table, you must specify the columns that contain partitions and their data types (such as p1 date, p2 int, p3 varchar in the example). You must also specify the columns as partition columns in the partition_column property of TABLE_PROPERTIES.
  • The sequence of partition columns defined in the statement used to create an external table and the sequence of partition columns specified by the partition_column property must both be the same as the partition sequence in the source Apsara File Storage for HDFS file. For example, the partition sequence in the source file is p1, p2, and p3. The sequence of partition columns defined in the table creation statement and the sequence of partition columns specified by the partition_column property must also be p1, p2, and p3.
  • Partition columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT or INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR or STRING, DATE, and TIMESTAMP.
  • The syntax used to query partition columns and the way the query results are displayed are the same as those of other columns.