AnalyticDB for MySQL allows you to use external tables to import and export data. This topic describes how to query Apsara File Storage for HDFS data by using an external mapping table of AnalyticDB MySQL and import the data to AnalyticDB MySQL.
Prerequisites
- The version of the AnalyticDB for MySQL cluster is V3.1.4.4 or later.
Note
- For information about how to view the version of an AnalyticDB for MySQL cluster, see How can I view the version of an AnalyticDB for MySQL cluster?
- To upgrade the version of an AnalyticDB for MySQL cluster, Submit a ticket.
- The Apsara File Storage for HDFS data file is in the CSV, Parquet, or ORC format.
- An Apsara File Storage for HDFS cluster is created, and the data that you want to import is stored in an Apsara File Storage for HDFS folder. In this topic, the data is stored in the hdfs_import_test_data.csv folder.
- The following service access ports for the AnalyticDB MySQL cluster are configured in the Apsara File Storage for HDFS cluster:
namenode
: used to read and write metadata of a file system. You can configure the port number by using the fs.defaultFS parameter. The default port number is 8020.For information about the detailed configurations, see core-default.xml.
datanode
: used to read and write data. You can configure the port number by using the dfs.datanode.address parameter. The default port number is 50010.For information about the detailed configurations, see hdfs-default.xml.
- If the AnalyticDB MySQL cluster is in elastic mode, you must turn on ENI in the Network Information section of the Cluster Information page.
Procedure
What to do next
adb_demo
destination database in the AnalyticDB MySQL cluster and then execute the following statement to check whether the data is imported
from the source table to the adb_hdfs_import_test
destination table:SELECT * FROM adb_hdfs_import_test LIMIT 100;
Create an Apsara File Storage for HDFS external table
- Create an external table in the CSV formatThe following statement shows how to create an external table in the CSV format:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"csv", "delimiter":",", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv" }';
Parameter Required Description ENGINE='HDFS'
Yes The storage engine used for the external table. In this example, HDFS is used as the storage engine. TABLE_PROPERTIES
The connection information used by AnalyticDB MySQL to access Apsara File Storage for HDFS data. format
The format of the data file. When you create an external table in the CSV format, you must set this parameter to csv
.delimiter
The column delimiter of the CSV data file. In this example, a comma (,) is used as the delimiter. hdfs_url
The absolute URL of the destination data file or folder in the Apsara File Storage for HDFS cluster. The URL must start with hdfs://
.Example:
hdfs://172.17.***.***:9000/adb/hdfs_import_test_csv_data/hdfs_import_test_data.csv
partition_column
No The partition column of the table. Separate multiple columns with commas (,). For more information about the methods to define a partition column, see Create a partitioned external table for Apsara File Storage for HDFS. compress_type
The compression type of the data file. CSV files support only the gzip compression type. skip_header_line_count
The number of header rows to skip when you import data. The first row of a CSV file is the table header. If you set this parameter to 1, the first row of the file is skipped when you import data. The default value of this parameter is 0, which indicates that no rows are skipped.
- Create an external table in the Parquet or ORC formatThe following statement shows how to create an external table in the Parquet format:
CREATE TABLE IF NOT EXISTS hdfs_import_test_external_table ( uid string, other string ) ENGINE='HDFS' TABLE_PROPERTIES='{ "format":"parquet", "hdfs_url":"hdfs://172.17.***.***:9000/adb/hdfs_import_test_parquet_data/" }';
Parameter Required Description ENGINE='HDFS'
Yes The storage engine of the external table. In this example, HDFS is used as the storage engine. TABLE_PROPERTIES
The connection information used by AnalyticDB MySQL to access Apsara File Storage for HDFS data. format
The format of the data file. - When you create an external table in the Parquet format, you must set this parameter
to
parquet
. - When you create an external table in the ORC format, you must set this parameter to
orc
.
hdfs_url
The absolute URL of the destination data file or folder in the Apsara File Storage for HDFS cluster. The URL must start with hdfs://
.partition_column
No The partition column of the table. Separate multiple columns with commas (,). For more information about the methods for defining a partition column, see Create a partitioned external table for Apsara File Storage for HDFS. Note- The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are not case sensitive. The sequence of the columns in the statement must be the same as that in the Parquet file.
- When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
- If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned in the query results of this column.
The following table describes the mappings between data types in Parquet and AnalyticDB for MySQL V3.0.Basic type in Parquet Logical type in Parquet Data type in AnalyticDB for MySQL V3.0 BOOLEAN None BOOLEAN INT32 INT_8 TINYINT INT32 INT_16 SMALLINT INT32 None INT or INTEGER INT64 None BIGINT FLOAT None FLOAT DOUBLE None DOUBLE - FIXED_LEN_BYTE_ARRAY
- BINARY
- INT64
- INT32
DECIMAL DECIMAL BINARY UTF-8 - VARCHAR
- STRING
- JSON (JSON is available if a Parquet file contains a column in the JSON format.)
INT32 DATE DATE INT64 TIMESTAMP_MILLIS TIMESTAMP or DATETIME INT96 None TIMESTAMP or DATETIME Notice External tables in the Parquet format do not support theSTRUCT
data type. Tables cannot be created for this data type.The following table describes the mappings between data types in ORC and AnalyticDB for MySQL V3.0.Data type in ORC Data type in AnalyticDB for MySQL V3.0 BOOLEAN BOOLEAN BYTE TINYINT SHORT SMALLINT INT INT or INTEGER LONG BIGINT DECIMAL DECIMAL FLOAT FLOAT DOUBLE DOUBLE - BINARY
- STRING
- VARCHAR
- VARCHAR
- STRING
- JSON (JSON is available if an ORC file contains a column in the JSON format.)
TIMESTAMP TIMESTAMP or DATETIME DATE DATE Notice External tables in the ORC format do not support theLIST
,STRUCT
, orUNION
data type. Tables that contain columns in theMAP
data type can be created, but queries cannot be performed. - When you create an external table in the Parquet format, you must set this parameter
to
Create a partitioned external table for Apsara File Storage for HDFS
p1
indicates the first-level partition, p2
indicates the second-level partition, and p3
indicates the third-level partition.parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── 000000_1
│ └── p3=SHENZHEN
│ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0
CREATE TABLE IF NOT EXISTS hdfs_parquet_partition_table
(
uid varchar,
other varchar,
p1 date,
p2 int,
p3 varchar
)
ENGINE='HDFS'
TABLE_PROPERTIES='{
"hdfs_url":"hdfs://172.17.***.**:9000/adb/parquet_partition_classic/",
"format":"parquet", //Specify the format of the file. To create a CSV or ORC file, change the format
value to csv
or orc
.
"partition_column":"p1, p2, p3" //If you want to query the Apsara File Storage for HDFS data by partition, you must specify partition columns (partition_column
) in the statement used to create an external table when you import data into AnalyticDB MySQL.
}';
- The partition_column property of
TABLE_PROPERTIES
must declare the partition columns and their data types (such asp1 date
,p2 int
, andp3 varchar
in the example). The sequence of partition columns specified by partition_column must be the same as the sequence of partition columns defined in the statement used to create the external table. - When you define partition columns in an external table, you must specify the columns that contain partitions (such as p1, p2, and p3 in the example) and their data types. You must specify the partition columns at the end of the statement.
- The sequence of partition columns defined in the statement used to create an external table must be the same as the sequence of partition columns specified by the partition_column property.
- Partition columns support the following data types:
BOOLEAN
,TINYINT
,SMALLINT
,INT
,INTEGER
,BIGINT
,FLOAT
,DOUBLE
,DECIMAL
,VARCHAR
,STRING
,DATE
, andTIMESTAMP
. - The syntax used to query partition columns and the way in which the query results are displayed are the same as those for other columns.