AnalyticDB for MySQL lets you import external data using external tables. This topic describes how to use external tables to import data from OSS into an AnalyticDB for MySQL cluster.
Prerequisites
The AnalyticDB for MySQL cluster and the OSS bucket are in the same region. For more information, see Activate OSS.
Elastic Network Interface (ENI) access is enabled for your AnalyticDB for MySQL Data Lakehouse Edition cluster.
ImportantLog on to the AnalyticDB for MySQL console. On the Cluster Information page, in the Network Information section, turn on the Elastic Network Interface (ENI) network switch.
Enabling or disabling the ENI network interrupts the database connection for about 2 minutes. During this period, you cannot read or write data. Therefore, carefully evaluate the impact before you enable or disable the ENI network.
Data preparation
In this example, the data file person.csv is uploaded to the testBucketName/adb/dt=2023-06-15 directory in OSS. Line feeds are used as row delimiters and commas (,) are used as column delimiters. The following sample data is in the person.csv file:
1,james,10,2023-06-15
2,bond,20,2023-06-15
3,jack,30,2023-06-15
4,lucy,40,2023-06-15 Procedure
Enterprise Edition, Basic Edition, and Data Lakehouse Edition
Go to the SQL editor.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, click .
Import data.
You can import data using the regular import (default) method or the elastic import method. The regular import method reads source data on compute nodes and builds indexes on storage nodes. This method consumes computing and storage resources. The elastic import method reads source data and builds indexes in a serverless Spark Job. This method consumes resources from a Job-type resource group. The elastic import method is supported only on Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters that run Milvus version 3.1.10.0 or later and have a Job-type resource group. For more information, see Data import methods.
Regular import
Create an external database.
CREATE EXTERNAL DATABASE adb_external_db;Create an external table. Use the CREATE EXTERNAL TABLE statement to create an OSS external table in the
adb_external_dbdatabase. In this example, the external table is named adb_external_db.person.NoteAn AnalyticDB for MySQL external table must have the same field names, number of fields, field order, and field types as the source data in the OSS file.
For more information about the syntax for creating OSS external tables, see CREATE EXTERNAL TABLE.
Query data.
After the table is created, you can query the data in OSS by running a SELECT statement in AnalyticDB for MySQL.
SELECT * FROM adb_external_db.person;The following result is returned:
+------+-------+------+-----------+ | id | name | age | dt | +------+-------+------+-----------+ | 1 | james | 10 |2023-06-15 | | 2 | bond | 20 |2023-06-15 | | 3 | jack | 30 |2023-06-15 | | 4 | lucy | 40 |2023-06-15 | +------+-------+------+-----------+Create a database in AnalyticDB for MySQL. If a database already exists, skip this step. The following statement provides an example:
CREATE DATABASE adb_demo;Create a table in AnalyticDB for MySQL to store data imported from OSS. The following statement provides an example:
NoteThe internal table that you create must have the same field names, number of fields, field order, and field types as the external table.
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test( id INT, name VARCHAR(1023), age INT, dt VARCHAR(1023) ) DISTRIBUTED BY HASH(id);Import data into the table.
Method 1: Use the
INSERT INTOstatement to import data. If a primary key value is duplicated, the new data is ignored. This is equivalent to usingINSERT IGNORE INTO. For more information, see INSERT INTO.INSERT INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;Method 2: Use the
INSERT OVERWRITE INTOstatement to synchronously import data. This overwrites the existing data in the table.INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;Method 3: Use the
INSERT OVERWRITE INTOstatement to asynchronously import data. For more information, see Asynchronous write.SUBMIT JOB INSERT OVERWRITE adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
Elastic import
Create a database. If a database already exists, skip this step. The following statement provides an example:
CREATE DATABASE adb_demo;Create an external table.
NoteThe field names, number of fields, field order, and field types of the AnalyticDB for MySQL external table must match the source data in the OSS file.
Elastic import supports creating external tables only using the
CREATE TABLEstatement.
CREATE TABLE oss_import_test_external_table ( id INT(1023), name VARCHAR(1023), age INT, dt VARCHAR(1023) ) ENGINE='OSS' TABLE_PROPERTIES='{ "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "url":"oss://testBucketName/adb/dt=2023-06-15/person.csv", "accessid":"accesskey_id", "accesskey":"accesskey_secret", "delimiter":"," }';ImportantWhen you create an external table, the supported TABLE_PROPERTIES parameters vary based on the file format, such as CSV, Parquet, or ORC:
CSV format: Only the
endpoint,url,accessid,accesskey,format,delimiter,null_value, andpartition_columnparameters are supported.Parquet format: Only the
endpoint,url,accessid,accesskey,format, andpartition_columnparameters are supported.ORC format: Only the
endpoint,url,accessid,accesskey,format, andpartition_columnparameters are supported.
For more information about the parameters that you can set for external tables and their descriptions, see OSS non-partitioned external tables and OSS partitioned external tables.
Query data.
After the table is created, you can query data from OSS by running a SELECT statement in AnalyticDB for MySQL.
SELECT * FROM oss_import_test_external_table;The following result is returned:
+------+-------+------+-----------+ | id | name | age | dt | +------+-------+------+-----------+ | 1 | james | 10 |2023-06-15 | | 2 | bond | 20 |2023-06-15 | | 3 | jack | 30 |2023-06-15 | | 4 | lucy | 40 |2023-06-15 | +------+-------+------+-----------+ 4 rows in set (0.35 sec)Create a table in AnalyticDB for MySQL to store data imported from OSS. The following statement provides an example:
NoteThe internal table that you create must have the same field names, number of fields, field order, and field types as the external table.
CREATE TABLE adb_import_test ( id INT, name VARCHAR(1023), age INT, dt VARCHAR(1023) ) DISTRIBUTED BY HASH(id);Import data.
ImportantElastic import supports importing data only using the
INSERT OVERWRITE INTOstatement.Method 1: Run the INSERT OVERWRITE INTO statement to elastically import data. This overwrites the existing data in the table. The following statement provides an example:
/+*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/ INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;
Method 2: Asynchronously run the INSERT OVERWRITE INTO statement to elastically import data. You can use the
SUBMIT JOBstatement to submit an asynchronous task that is scheduled in the background./*+elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/ SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;ImportantYou cannot set a priority queue when you asynchronously submit an elastic import task.
The following result is returned:
+---------------------------------------+ | job_id | +---------------------------------------+ | 202308151719510210170190********** |
After you submit an asynchronous job using
SUBMIT JOB, the returned result indicates only that the job was submitted successfully. You can use the job_id to terminate the asynchronous job or query its status to check whether the job was successfully executed. For more information, see Submit an asynchronous import job.Hint parameters:
elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.
elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.
Parameter
Required
Description
adb.load.resource.group.name
Yes
The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu
No
The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1.
Execute the following statement to query the number of shards in the cluster:
SELECT count(1) FROM information_schema.kepler_meta_shards;spark.driver.resourceSpec
No
The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec
No
The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize
No
The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.
(Optional) Check whether the submitted import task is an elastic import task.
SELECT job_name, (job_type = 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs where job_name = "2023081818010602101701907303151******";The following result is returned:
+---------------------------------------+------------------+ | job_name | is_elastic_load | +---------------------------------------+------------------+ | 20230815171951021017019072*********** | 1 | +---------------------------------------+------------------+If the value of
is_elastic_loadis 1, the submitted import task is an elastic import task. If the value is 0, the submitted import task is a regular import task.
Data Lakehouse Edition
Connect to the cluster and create a database.
CREATE DATABASE adb_demo;Create an external table. Use the CREATE TABLE syntax to create an OSS external table in CSV, Parquet, or ORC format. For more information about the syntax, see OSS external table syntax.
This example uses a non-partitioned external table in CSV format.
CREATE TABLE IF NOT EXISTS oss_import_test_external_table ( id INT, name VARCHAR(1023), age INT, dt VARCHAR(1023) ) ENGINE='OSS' TABLE_PROPERTIES='{ "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "url":"oss://testBucketname/adb/dt=2023-06-15/person.csv", "accessid":"accesskey_id", "accesskey":"accesskey_secret", "delimiter":",", "skip_header_line_count":0, "charset":"utf-8" }';Query data from the
oss_import_test_external_tableexternal table.NoteQuerying external tables with large data volumes in CSV, Parquet, or ORC format can cause high performance overhead. To improve query efficiency, import the data from the OSS external table into AnalyticDB for MySQL, as described in the following steps.
SELECT * FROM oss_import_test_external_table;Create a table in AnalyticDB for MySQL to store the data imported from the OSS external table.
CREATE TABLE IF NOT EXISTS adb_oss_import_test ( id INT, name VARCHAR(1023), age INT, dt VARCHAR(1023) ) DISTRIBUTED BY HASH(id);Run the INSERT statement to import data from the OSS external table into AnalyticDB for MySQL.
ImportantThe
INSERT INTOandINSERT OVERWRITE SELECTstatements import data synchronously by default. This process requires a long-running connection between the client and the AnalyticDB for MySQL server. If you import a large amount of data, such as several hundred gigabytes, the connection may be interrupted by network issues, which causes the data import to fail. Therefore, if you have a large data volume, we recommend that you useSUBMIT JOB INSERT OVERWRITE SELECTto asynchronously import the data.Method 1: Run the
INSERT INTOstatement to import data. If a primary key is duplicated, the current write operation is ignored and the data is not updated. This is equivalent toINSERT IGNORE INTO. For more information, see INSERT INTO. The following statement provides an example:INSERT INTO adb_oss_import_test SELECT * FROM oss_import_test_external_table;Method 2: Run the INSERT OVERWRITE statement to import data. This overwrites the existing data in the table. The following statement provides an example:
INSERT OVERWRITE adb_oss_import_test SELECT * FROM oss_import_test_external_table;Method 3: Asynchronously run the
INSERT OVERWRITEstatement to import data. You can useSUBMIT JOBto submit an asynchronous task that is scheduled in the background. You can add a hint (/*+ direct_batch_load=true*/) to the write task to accelerate the task. For more information, see Asynchronous writes. The following statement provides an example:SUBMIT JOB INSERT OVERWRITE adb_oss_import_test SELECT * FROM oss_import_test_external_table;The following result is returned:
+---------------------------------------+ | job_id | +---------------------------------------+ | 2020112122202917203100908203303****** |For more information about how to asynchronously submit tasks, see Submit an asynchronous import job.
OSS external table syntax
Enterprise Edition, Basic Edition, and Data Lakehouse Edition
For more information about the syntax and description of OSS external tables in Enterprise Edition, Basic Edition, and Data Lakehouse Edition, see OSS external tables.
Data Lakehouse Edition
OSS non-partitioned external tables
CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, …])
ENGINE='OSS'
TABLE_PROPERTIES='{
"endpoint":"endpoint",
"url":"OSS_LOCATION",
"accessid":"accesskey_id",
"accesskey":"accesskey_secret",
"format":"csv|orc|parquet|text
"delimiter|field_delimiter":";",
"skip_header_line_count":1,
"charset":"utf-8"
}';Appearance Type | Parameter | Required | Description |
External tables in CSV, Parquet, and ORC formats | ENGINE='OSS' | Yes | The table engine. Set the value to OSS. |
endpoint | The Endpoint of the OSS bucket. Currently, AnalyticDB for MySQL can access OSS only over a VPC. Note You can log on to the OSS console, click the target bucket, and view the Endpoint on the Overview page of the bucket. | ||
url | The path of the OSS file or directory.
| ||
accessid | The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user with OSS management permissions. For information about how to obtain an AccessKey ID, see Accounts and permissions. | ||
accesskey | The AccessKey secret of an Alibaba Cloud account or a RAM user with OSS management permissions. To obtain an AccessKey secret, see Accounts and permissions. | ||
format | Conditionally required | The format of the data file.
| |
External tables in CSV and Text formats | delimiter|field_delimiter | Yes | The column delimiter of the data file.
|
External tables in CSV format | null_value | No | Defines the Important This parameter is supported only for clusters of Milvus version 3.1.4.2 or later. |
ossnull | The rule that is used to define
Note The preceding examples assume that | ||
skip_header_line_count | The number of header rows to skip at the beginning of the data file during data import. If the first row of a CSV file is the table header, you can set this parameter to 1 to skip the header row during data import. Default value: 0. This value indicates that no rows are skipped. | ||
oss_ignore_quote_and_escape | Specifies whether to ignore quotation marks and escape characters in field values. Default value: false. This value indicates that quotation marks and escape characters are not ignored. Important This parameter is supported only for clusters of Milvus version 3.1.4.2 or later. | ||
charset | The character set of the OSS external table. Valid values:
Important This parameter is supported only for clusters of Milvus version 3.1.10.4 or later. |
The column names in the `CREATE TABLE` statement for the external table must be identical to the column names in the Parquet or ORC file but are case-insensitive. The order of the columns must also be the same.
When you create an external table, you can select only some columns from the Parquet or ORC file to be columns in the external table. Unselected columns are not imported.
If the `CREATE TABLE` statement for the external table includes a column that does not exist in the Parquet or ORC file, queries for that column return NULL.
AnalyticDB for MySQL supports reading from and writing to Hive TEXT files using an external table in CSV format on OSS. The following statement can be used to create an external table:
CREATE TABLE adb_csv_hive_format_oss (
a tinyint,
b smallint,
c int,
d bigint,
e boolean,
f float,
g double,
h varchar,
i varchar, -- binary
j timestamp,
k DECIMAL(10, 4),
l varchar, -- char(10)
m varchar, -- varchar(100)
n date
) ENGINE = 'OSS' TABLE_PROPERTIES='{
"format": "csv",
"endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
"accessid":"accesskey_id",
"accesskey":"accesskey_secret",
"url":"oss://testBucketname/adb_data/",
"delimiter": "\\1",
"null_value": "\\\\N",
"oss_ignore_quote_and_escape": "true",
"ossnull": 2
}';Note the following points when you create an OSS external table in CSV format to read Hive TEXT files:
The default column delimiter for Hive TEXT files is
\1. To read and write Hive TEXT files using an OSS external table in CSV format, you must escape the delimiter as\\1when you configure thedelimiterparameter.The default
NULLvalue for a Hive TEXT file is\N. To read or write Hive TEXT files using an external table in CSV format on OSS, you must set thenull_valueparameter and escape the value as\\\\N.Other basic Hive data types, such as
BOOLEAN, map directly to AnalyticDB for MySQL data types, but theBINARY,CHAR(n), andVARCHAR(n)types all map to the AnalyticDB for MySQLVARCHARtype.
Appendix: Data type mappings
The data types that you specify when you create a table must be consistent with the mappings in the following tables. For the
DECIMALtype, the precision must also be consistent.External tables in Parquet format do not support the
STRUCTtype. If you use this type, the table creation fails.External tables in ORC format do not support complex types such as
LIST,STRUCT, orUNION. If you use these types, the table creation fails. You can create an ORC external table that contains a column of theMAPtype, but queries on the table will fail.
Parquet files and AnalyticDB for MySQL data type mapping
Parquet primitive data type | Parquet logicalType | AnalyticDB for MySQL data type |
BOOLEAN | None | BOOLEAN |
INT32 | INT_8 | TINYINT |
INT32 | INT_16 | SMALLINT |
INT32 | None | INT or INTEGER |
INT64 | None | BIGINT |
FLOAT | None | FLOAT |
DOUBLE | None | DOUBLE |
| DECIMAL | DECIMAL |
BINARY | UTF-8 |
|
INT32 | DATE | DATE |
INT64 | TIMESTAMP_MILLIS | TIMESTAMP or DATETIME |
INT96 | None | TIMESTAMP or DATETIME |
ORC files and AnalyticDB for MySQL data type mapping
ORC file data type | AnalyticDB for MySQL data type |
BOOLEAN | BOOLEAN |
BYTE | TINYINT |
SHORT | SMALLINT |
INT | INT or INTEGER |
LONG | BIGINT |
DECIMAL | DECIMAL |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
|
|
TIMESTAMP | TIMESTAMP or DATETIME |
DATE | DATE |
Paimon files and AnalyticDB for MySQL data type mapping
Paimon file data type | AnalyticDB for MySQL data type |
CHAR | VARCHAR |
VARCHAR | VARCHAR |
BOOLEAN | BOOLEAN |
BINARY | VARBINARY |
VARBINARY | VARBINARY |
DECIMAL | DECIMAL |
TINYINT | TINYINT |
SMALLINT | SMALLINT |
INT | INTEGER |
BIGINT | BIGINT |
FLOAT | REAL |
DOUBLE | DOUBLE |
DATE | DATE |
TIME | Not supported |
TIMESTAMP | TIMESTAMP |
LocalZonedTIMESTAMP | TIMESTAMP (local time zone information is ignored) |
ARRAY | ARRAY |
MAP | MAP |
ROW | ROW |