Import OSS data using external tables - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL lets you import external data using external tables. This topic describes how to use external tables to import data from OSS into an AnalyticDB for MySQL cluster.

Prerequisites

The AnalyticDB for MySQL cluster and the OSS bucket are in the same region. For more information, see Activate OSS.
The data file is uploaded to an OSS directory.
Elastic Network Interface (ENI) access is enabled for your AnalyticDB for MySQL cluster.
Important
- Log on to the AnalyticDB for MySQL console. On the Cluster Information page, in the Network Information section, turn on the ENI network switch.
- Enabling or disabling the ENI network interrupts database connectivity for approximately two minutes, during which read and write operations are unavailable. Evaluate the impact carefully before proceeding.

Data preparation

In this example, the data file person.csv is uploaded to the testBucketName/adb/dt=2023-06-15 directory in OSS. Line feeds are used as row delimiters and commas (,) are used as column delimiters. The following sample data is in the person.csv file:

1,james,10,2023-06-15
2,bond,20,2023-06-15
3,jack,30,2023-06-15
4,lucy,40,2023-06-15

Procedure

Enterprise Edition, Basic Edition, and Data Lakehouse Edition

Go to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the navigation pane on the left, click Job Development > SQL Development.

Import data.

You can import data by using the regular import or elastic import method. By default, the regular import method is used. If you use the regular import method, data is read from compute nodes and indexes are created on storage nodes. The regular import method consumes computing and storage resources. If you use the elastic import method, data is read and indexes are created for Serverless Spark jobs. The elastic import method consumes resources of job resource groups.The elastic import method is supported only on Enterprise Edition, Basic Edition, and Data Lakehouse Edition clusters that run kernel version 3.1.10.0 or later and have a Job-type resource group. For more information, see Data import methods.

Regular import

Create an external database.

CREATE EXTERNAL DATABASE adb_external_db;

Create an external table. Use the CREATE EXTERNAL TABLE statement to create an OSS external table in the adb_external_db database. In this example, the external table is named adb_external_db.person.
Note
An AnalyticDB for MySQL external table must have the same field names, number of fields, field order, and field types as the source data in the OSS file.
Create a non-partitioned OSS external table
```
CREATE EXTERNAL TABLE adb_external_db.person
(
 id INT,
 name VARCHAR(1023),
 age INT,
 dt VARCHAR(1023)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY  ','
STORED AS TEXTFILE
LOCATION  'oss://testBucketName/adb/dt=2023-06-15/';
```
Create a partitioned OSS external table
Create a partitioned OSS external table and add partitions to the table to query its data.
1. Create a partitioned OSS external table.
  CREATE EXTERNAL TABLE adb_external_db.person ( id INT, name VARCHAR(1023) , age INT ) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'oss://testBucketName/adb/';
2. Add a partition. You can run the ALTER TABLE ADD PARTITION statement to manually add a partition, or run the MSCK REPAIR TABLE statement to automatically identify and add partitions.
  ALTER TABLE adb_external_db.person ADD PARTITION (dt='2023-06-15') LOCATION 'oss://testBucketName/adb/dt=2023-06-15/';
  Note
  To add or delete partitions for an OSS external table, see ALTER TABLE (external tables).
  To automatically synchronize the partitions of an OSS external table, see MSCK REPAIR TABLE.
For more information about the syntax for creating OSS external tables, see CREATE EXTERNAL TABLE.

Query data.

After the table is created, you can query the data in OSS by running a SELECT statement in AnalyticDB for MySQL.

SELECT * FROM adb_external_db.person;

The following result is returned:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+

Create a database in AnalyticDB for MySQL. If a database already exists, skip this step. The following statement provides an example:
```
CREATE DATABASE adb_demo; 
```
Create a table in AnalyticDB for MySQL to store data imported from OSS. The following statement provides an example:
Note
The internal table that you create must have the same field names, number of fields, field order, and field types as the external table.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id INT,
    name VARCHAR(1023),
    age INT,
    dt VARCHAR(1023)
)
DISTRIBUTED BY HASH(id);
```
Import data into the table.
- Method 1: Use the INSERT INTO statement to import data. If a primary key value is duplicated, the new data is ignored. This is equivalent to using INSERT IGNORE INTO. For more information, see INSERT INTO.
```
INSERT INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 2: Use the INSERT OVERWRITE INTO statement to synchronously import data. This overwrites the existing data in the table.
```
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 3: Use the INSERT OVERWRITE INTO statement to asynchronously import data. For more information, see Asynchronous write.
```
SUBMIT JOB INSERT OVERWRITE adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```

Elastic import

Create a database. If a database already exists, skip this step. The following statement provides an example:
```
CREATE DATABASE adb_demo; 
```
Create an external table.
Note
- The field names, number of fields, field order, and field types of the AnalyticDB for MySQL external table must match the source data in the OSS file.
- Elastic import supports creating external tables only using the CREATE TABLE statement.
```
CREATE TABLE oss_import_test_external_table
(
  id INT(1023),
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://testBucketName/adb/dt=2023-06-15/person.csv",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "delimiter":","
}';
```
Important
When you create an external table, the supported TABLE_PROPERTIES parameters vary based on the file format, such as CSV, Parquet, or ORC:
- CSV format: Only the endpoint, url, accessid, accesskey, format, delimiter, null_value, and partition_column parameters are supported.
- Parquet format: Only the endpoint, url, accessid, accesskey, format, and partition_column parameters are supported.
- ORC format: Only the endpoint, url, accessid, accesskey, format, and partition_column parameters are supported.
For more information about the parameters that you can set for external tables and their descriptions, see OSS non-partitioned external tables and OSS partitioned external tables.

Query data.

After the table is created, you can query data from OSS by running a SELECT statement in AnalyticDB for MySQL.

SELECT * FROM oss_import_test_external_table;

The following result is returned:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+
4 rows in set (0.35 sec)

Create a table in AnalyticDB for MySQL to store data imported from OSS. The following statement provides an example:
Note
The internal table that you create must have the same field names, number of fields, field order, and field types as the external table.
```
CREATE TABLE adb_import_test
(
  id INT,
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
)
DISTRIBUTED BY HASH(id);
```

Import data.

Important

Elastic import supports importing data only using the INSERT OVERWRITE INTO statement.

Method 1: Run the INSERT OVERWRITE INTO statement to elastically import data. This overwrites the existing data in the table. The following statement provides an example:

/+*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Method 2: Asynchronously run the INSERT OVERWRITE INTO statement to elastically import data. You can use the SUBMIT JOB statement to submit an asynchronous task that is scheduled in the background.

/*+elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Important

You cannot set a priority queue when you asynchronously submit an elastic import task.

The following result is returned:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 202308151719510210170190**********    |

After you submit an asynchronous job using SUBMIT JOB, the returned result indicates only that the job was submitted successfully. You can use the job_id to terminate the asynchronous job or query its status to check whether the job was successfully executed. For more information, see Submit an asynchronous import job.

Hint parameters:

elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

Parameter	Required	Description
adb.load.resource.group.name	Yes	The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu	No	The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1. Execute the following statement to query the number of shards in the cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;`
spark.driver.resourceSpec	No	The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec	No	The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize	No	The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

(Optional) Check whether the submitted import task is an elastic import task.

SELECT job_name, (job_type = 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs where job_name = "2023081818010602101701907303151******";

The following result is returned:

+---------------------------------------+------------------+
| job_name                              | is_elastic_load  |
+---------------------------------------+------------------+
| 20230815171951021017019072*********** |       1          |
+---------------------------------------+------------------+

If the value of is_elastic_load is 1, the submitted import task is an elastic import task. If the value is 0, the submitted import task is a regular import task.

Connect to the cluster and create a database.
```
CREATE DATABASE adb_demo;
```

Create an external table. Use the CREATE TABLE syntax to create an OSS external table in CSV, Parquet, or ORC format. For more information about the syntax, see Syntax for creating OSS external tables.

This example uses a non-partitioned external table in CSV format.

CREATE TABLE IF NOT EXISTS oss_import_test_external_table
(
    id INT,
    name VARCHAR(1023),
    age INT,
    dt VARCHAR(1023) 
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://testBucketname/adb/dt=2023-06-15/person.csv",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "delimiter":",",
    "skip_header_line_count":0,
    "charset":"utf-8"
}';

Query data from the oss_import_test_external_table external table.
Note
Querying external tables with large data volumes in CSV, Parquet, or ORC format can cause high performance overhead. To improve query efficiency, import the data from the OSS external table into AnalyticDB for MySQL, as described in the following steps.
```
SELECT * FROM oss_import_test_external_table;
```

Create a table in AnalyticDB for MySQL to store the data imported from the OSS external table.

CREATE TABLE IF NOT EXISTS adb_oss_import_test
(
   id INT,
   name VARCHAR(1023),
   age INT,
   dt VARCHAR(1023) 
)
DISTRIBUTED BY HASH(id);

Run the INSERT statement to import data from the OSS external table into AnalyticDB for MySQL.
Important
When using INSERT INTO or INSERT OVERWRITE SELECT to import data, the operation runs synchronously by default. For large datasets - such as those in the hundreds of gigabytes - the client must maintain a persistent connection to the AnalyticDB for MySQL server for an extended period. During this time, network issues may cause the connection to drop, resulting in import failure. Therefore, for large data volumes, we recommend using SUBMIT JOB INSERT OVERWRITE SELECT to perform the import asynchronously.
- Method 1: Run the INSERT INTO statement to import data. If a primary key is duplicated, the current write operation is ignored and the data is not updated. This is equivalent to INSERT IGNORE INTO. For more information, see INSERT INTO. The following statement provides an example:
```
INSERT INTO adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
- Method 2: Run the INSERT OVERWRITE statement to import data. This overwrites the existing data in the table. The following statement provides an example:
```
INSERT OVERWRITE adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
- Method 3: Asynchronously run the INSERT OVERWRITE statement to import data. You can use SUBMIT JOB to submit an asynchronous task that is scheduled in the background. You can add a hint (/*+ direct_batch_load=true*/) to the write task to accelerate the task. For more information, see Asynchronous writes. The following statement provides an example:
```
SUBMIT JOB INSERT OVERWRITE adb_oss_import_test
SELECT * FROM oss_import_test_external_table;
```
  The following result is returned:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
```
  For more information about how to asynchronously submit tasks, see Submit an asynchronous import job.

OSS external table syntax

Enterprise Edition, Basic Edition, and Data Lakehouse Edition

For more information about the syntax and description of OSS external tables in Enterprise Edition, Basic Edition, and Data Lakehouse Edition, see OSS external tables.

OSS non-partitioned external tables

CREATE TABLE [IF NOT EXISTS] table_name
(column_name column_type[, …])
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"endpoint",
    "url":"OSS_LOCATION",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "format":"csv|orc|parquet|text
    "delimiter|field_delimiter":";",
    "skip_header_line_count":1,
    "charset":"utf-8"
}';

Appearance Type	Parameter	Required	Description
External tables in CSV, Parquet, and ORC formats	ENGINE='OSS'	Yes	The table engine. Set the value to OSS.
	endpoint		The Endpoint of the OSS bucket. Currently, AnalyticDB for MySQL can access OSS only over a VPC. Note You can log on to the OSS console, click the target bucket, and view the Endpoint on the Overview page of the bucket.
	url		The path of the OSS file or directory. For an OSS file, specify the absolute path. Example: `oss://testBucketname/adb/oss_import_test_data.csv`. The directory path must end with a forward slash (/). Example: `oss://testBucketname/adb/`. Note If you specify a directory path, the external table contains all data in the directory after the table is created. The wildcard character `` is supported at the end of the path to match all files or folders that have a specific prefix. Example: `oss://testBucketname/adb/list_file_with_prefix/test` Note This fuzzy query example matches all files and folders that meet the prefix condition, such as `oss://testBucketname/adb/list_file_with_prefix/testfile1` and `oss://testBucketname/adb/list_file_with_prefix/test1/file2`.
	accessid		The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user with OSS management permissions. For information about how to obtain an AccessKey ID, see Accounts and permissions.
	accesskey		The AccessKey secret of an Alibaba Cloud account or a RAM user with OSS management permissions. To obtain an AccessKey secret, see Accounts and permissions.
	format	Conditionally required	The format of the data file. To create a Parquet external table, you must set this parameter to `parquet`. To create an ORC external table, you must set this parameter to `orc`. To create a Text external table, you must set this parameter to `text`. If you do not specify this parameter, the file format is `csv` by default.
External tables in CSV and Text formats	delimiter\|field_delimiter	Yes	The column delimiter of the data file. If the file type is `csv`, set this parameter to `delimiter`. If the file type is `text`, set this parameter to `field_delimiter`.
External tables in CSV format	null_value	No	Defines the `NULL` value for the CSV data file. By default, an empty value is defined as `NULL`. Example: `"null_value": ""`. Important This parameter is supported only for clusters of kernel version 3.1.4.2 or later.
	ossnull		The rule that is used to define `NULL` values in the CSV data file. Valid values: 1 (default): indicates `EMPTY_SEPARATORS`. Only empty values are defined as `NULL`. Example: `a,"",,c --> "a","",NULL,"c"` 2: indicates `EMPTY_QUOTES`. Only `""` is defined as `NULL`. Example: `a,"",,c --> "a",NULL,"","c"` 3: indicates `BOTH`. Both empty values and `""` are defined as `NULL`. Example: `a,"",,c --> "a",NULL,NULL,"c"` 4: indicates `NEITHER`. Neither empty values nor `""` are defined as `NULL`. Example: `a,"",,c --> "a","","","c"` Note The preceding examples assume that `"null_value": ""`.
	skip_header_line_count		The number of header rows to skip at the beginning of the data file during data import. If the first row of a CSV file is the table header, you can set this parameter to 1 to skip the header row during data import. Default value: 0. This value indicates that no rows are skipped.
	oss_ignore_quote_and_escape		Specifies whether to ignore quotation marks and escape characters in field values. Default value: false. This value indicates that quotation marks and escape characters are not ignored. Important This parameter is supported only for clusters of kernel version 3.1.4.2 or later.
	charset		The character set of the OSS external table. Valid values: utf-8 (default) gbk Important This parameter is supported only for clusters of kernel version 3.1.10.4 or later.

Note

The column names in the `CREATE TABLE` statement for the external table must be identical to the column names in the Parquet or ORC file but are case-insensitive. The order of the columns must also be the same.
When you create an external table, you can select only some columns from the Parquet or ORC file to be columns in the external table. Unselected columns are not imported.
If the `CREATE TABLE` statement for the external table includes a column that does not exist in the Parquet or ORC file, queries for that column return NULL.

AnalyticDB for MySQL supports reading from and writing to Hive TEXT files using an external table in CSV format on OSS. The following statement can be used to create an external table:

CREATE TABLE adb_csv_hive_format_oss (
  a tinyint,
  b smallint,
  c int,
  d bigint,
  e boolean,
  f float,
  g double,
  h varchar,
  i varchar, -- binary
  j timestamp,
  k DECIMAL(10, 4),
  l varchar, -- char(10)
  m varchar, -- varchar(100)
  n date
) ENGINE = 'OSS' TABLE_PROPERTIES='{
    "format": "csv",
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "accessid":"accesskey_id",
    "accesskey":"accesskey_secret",
    "url":"oss://testBucketname/adb_data/",
    "delimiter": "\\1",
    "null_value": "\\\\N",
    "oss_ignore_quote_and_escape": "true",
    "ossnull": 2
}';

Note

Note the following points when you create an OSS external table in CSV format to read Hive TEXT files:

The default column delimiter for Hive TEXT files is \1. To read and write Hive TEXT files using an OSS external table in CSV format, you must escape the delimiter as \\1 when you configure the delimiter parameter.
The default NULL value for a Hive TEXT file is \N. To read or write Hive TEXT files using an external table in CSV format on OSS, you must set the null_value parameter and escape the value as \\\\N.
Other basic Hive data types, such as BOOLEAN, map directly to AnalyticDB for MySQL data types, but the BINARY, CHAR(n), and VARCHAR(n) types all map to the AnalyticDB for MySQL VARCHAR type.

Appendix: Data type mappings

Important

The data types that you specify when you create a table must be consistent with the mappings in the following tables. For the DECIMAL type, the precision must also be consistent.
External tables in Parquet format do not support the STRUCT type. If you use this type, the table creation fails.
External tables in ORC format do not support complex types such as LIST, STRUCT, or UNION. If you use these types, the table creation fails. You can create an ORC external table that contains a column of the MAP type, but queries on the table will fail.

The following tables show the data type mappings between AnalyticDB for MySQL and Parquet, ORC, and Paimon

Parquet

Parquet primitive type	Parquet logical type	AnalyticDB for MySQL data type
BOOLEAN	None	BOOLEAN
INT32	INT_8	TINYINT
INT32	INT_16	SMALLINT
INT32	None	INT or INTEGER
INT64	None	BIGINT
FLOAT	None	FLOAT
DOUBLE	None	DOUBLE
FIXED_LEN_BYTE_ARRAY BINARY INT64 INT32	DECIMAL	DECIMAL
BINARY	UTF-8	VARCHAR STRING JSON (if the content of the Parquet column is in JSON format)
INT32	DATE	DATE
INT64	TIMESTAMP_MILLIS	TIMESTAMP or DATETIME
INT96	None	TIMESTAMP or DATETIME

ORC

ORC data type	AnalyticDB for MySQL data type
BOOLEAN	BOOLEAN
BYTE	TINYINT
SHORT	SMALLINT
INT	INT or INTEGER
LONG	BIGINT
DECIMAL	DECIMAL
FLOAT	FLOAT
DOUBLE	DOUBLE
BINARY STRING VARCHAR	VARCHAR STRING JSON (if the content of the ORC column is in JSON format)
TIMESTAMP	TIMESTAMP or DATETIME
DATE	DATE

Paimon

Paimon data type	AnalyticDB for MySQL data type
CHAR	VARCHAR
VARCHAR	VARCHAR
BOOLEAN	BOOLEAN
BINARY	VARBINARY
VARBINARY	VARBINARY
DECIMAL	DECIMAL
TINYINT	TINYINT
SMALLINT	SMALLINT
INT	INTEGER
BIGINT	BIGINT
FLOAT	REAL
DOUBLE	DOUBLE
DATE	DATE
TIME	Not supported
TIMESTAMP	TIMESTAMP
LocalZonedTIMESTAMP	TIMESTAMP (local time zone information is ignored)
ARRAY	ARRAY
MAP	MAP
ROW	ROW