Use external tables to import data from OSS to Data Lakehouse Edition - AnalyticDB for MySQL

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to access and import data by using external tables. You can import data by using regular import or elastic import. Compared with regular import, elastic import consumes less resources and reduces impacts on real-time data reads and writes. This topic describes how to use external tables to query Object Storage Service (OSS) data and import data from OSS to AnalyticDB for MySQL Data Lakehouse Edition (V3.0).

Prerequisites

OSS is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create buckets, and Create directories.
Data is uploaded to the OSS directory. For more information, see Upload objects.
An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created in the same region as the OSS bucket.

Sample data

In this example, the person object is uploaded to the testBucketName/adb/dt=2023-06-15 directory in OSS. Line feeds are used as row delimiters, and commas (,) are used as column delimiters. Sample data in the person object:

1,james,10,2023-06-15
2,bond,20,2023-06-15
3,jack,30,2023-06-15
4,lucy,40,2023-06-15

Procedure

Go to the SQL editor.
1. Log on to the AnalyticDB for MySQL console.
2. In the upper-left corner of the page, select the region where the cluster resides.
3. In the left-side navigation pane, click Clusters.
4. On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the Cluster ID.
5. In the left-side navigation pane, choose Job Development > SQL Development.

Import data.

You can import data by using regular import or elastic import. The default method is regular import. Regular import reads data from compute nodes and creates indexes on storage nodes. This method consumes computing and storage resources. Elastic import reads data and creates indexes for Serverless Spark jobs. This method consumes resources of job resource groups. Only AnalyticDB for MySQL Data Lakehouse Edition (V3.0) clusters of V3.1.10.0 or later that have job resource groups support elastic import. For more information, see Data import methods.

If you use elastic import, take note of the following items:

To perform elastic import, you must execute the following statement to enable the elastic import feature:
```
SET adb_config RC_ELASTIC_JOB_SCHEDULER_ENABLE=true;
```
Elastic import allows you to import only CSV, Parquet, or ORC data from OSS.
Elastic import reads data and creates indexes based on job resource groups. This consumes resources of job resource groups, and you are charged for the resources. For more information, see View monitoring information of resource groups and Billable items of Data Lakehouse Edition (V3.0).
Make sure that job resource groups have sufficient resources to prevent long waiting time for jobs, long execution durations, and job failures.
An elastic import job requires at least 2 minutes to 3 minutes to complete and is not suitable for small amounts of data. If an import job must be completed within 3 minutes, we recommend that you use regular import.
An elastic import job requires a longer period of time to complete than a regular import job that uses the same resources. If you want an import job to be completed within a short period of time, we recommend that you increase the maximum amount of resources for an elastic import job to accelerate the job.

Regular import

Create an external database.

CREATE EXTERNAL DATABASE adb_external_db;

Use the CREATE EXTERNAL TABLE statement to create an OSS external table in the adb_external_db database. In this example, the external table is named adb_external_db.person.
Note
The OSS external table of AnalyticDB for MySQL must use the same names, quantity, order, and data types of fields as the OSS object.
Create a non-partitioned OSS external table
```
CREATE EXTERNAL TABLE adb_external_db.person
(
 id INT,
 name VARCHAR(1023),
 age INT,
 dt VARCHAR(1023)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY  ','
STORED AS TEXTFILE
LOCATION  'oss://testBucketName/adb/dt=2023-06-15/';
```
Create a partitioned OSS external table
Create a partitioned OSS external table and add partitions to the table.
1. Create a partitioned OSS external table.
  CREATE EXTERNAL TABLE adb_external_db.person ( id INT, name VARCHAR(1023) , age INT ) PARTITIONED BY (dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 'oss://testBucketName/adb/';
2. Add partitions to the OSS external table. You can use the ALTER TABLE ADD PARTITION statement to manually add partitions or use the MSCK REPAIR TABLE statement to automatically identify and add partitions.
  ALTER TABLE adb_external_db.person ADD PARTITION (dt='2023-06-15') LOCATION 'oss://testBucketName/adb/dt=2023-06-15/';
  Note
  For information about how to add partitions to or remove partitions from an OSS external table, see ALTER TABLE (external tables).
  For information about how to synchronize partitions of an OSS external table, see MSCK REPAIR TABLE.
For information about the syntax, see CREATE EXTERNAL TABLE.

Query data.

After you create an external table, you can execute the SELECT statement in AnalyticDB for MySQL to query data of the external table.

SELECT * FROM adb_external_db.person;

Sample result:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+
4 rows in set (0.35 sec)

Create a database in the AnalyticDB for MySQL cluster. If you have already created a database, skip this step.
```
CREATE DATABASE adb_demo; 
```
Create a table in the AnalyticDB MySQL cluster to store data that is imported from OSS.
Note
The created internal table must use the same names, quantity, order, and data types of fields as the external table that is created in Step b.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id INT,
    name VARCHAR(1023),
    age INT,
    dt VARCHAR(1023)
)
DISTRIBUTE BY HASH(id);
```
Import data to the table.
- Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted, and the INSERT INTO statement is equivalent to INSERT IGNORE INTO. For more information, see INSERT INTO.
```
INSERT INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 2: Execute the INSERT OVERWRITE INTO statement to synchronously import data. If the primary key has duplicate values, the original value is overwritten by the new value.
```
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```
- Method 3: Execute the INSERT OVERWRITE INTO statement to asynchronously import data. For more information, see Asynchronous writing.
```
SUBMIT JOB INSERT OVERWRITE adb_demo.adb_import_test SELECT * FROM adb_external_db.person;
```

Elastic import

Create a database. If you have already created a database, skip this step.
```
CREATE DATABASE adb_demo; 
```
Create an OSS external table.
Note
- The OSS external table of AnalyticDB for MySQL must use the same names, quantity, order, and data types of fields as the OSS object.
- Elastic import allows you to create external tables by using only the CREATE TABLE statement.
```
CREATE TABLE oss_import_test_external_table
(
  id INT(1023),
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://<bucket-name>/adb/oss_import_test_data.csv",
    "accessid":"LTAI5t8sqJn5GhpBVtN8****",
    "accesskey":"HlClegbiV5mJjBYBJHEZQOnRF7****",
    "delimiter":","
}';
```
Important
When you create an external table, you can configure the following TABLE_PROPERTIES parameters for the CSV, Parquet, or ORC external table:
- CSV: endpoint, url, accessid, accesskey, format, delimiter, null_value, and partition_column.
- Parquet: endpoint, url, accessid, accesskey, format, and partition_column.
- ORC: endpoint, url, accessid, accesskey, format, and partition_column.
For more information about external table parameters, see the "Create an OSS external table for a non-partitioned object" and "Create an OSS external table for a partitioned object." sections of the Use external tables to import data to Data Warehouse Edition topic.

Query data.

After you create an external table, you can execute the SELECT statement in AnalyticDB for MySQL to query data of the external table.

SELECT * FROM oss_import_test_external_table;

Sample result:

+------+-------+------+-----------+
| id   | name  | age  | dt        |
+------+-------+------+-----------+
|    1 | james |   10 |2023-06-15 |
|    2 | bond  |   20 |2023-06-15 |
|    3 | jack  |   30 |2023-06-15 |
|    4 | lucy  |   40 |2023-06-15 |
+------+-------+------+-----------+
4 rows in set (0.35 sec)

Create a table in the AnalyticDB MySQL cluster to store data that is imported from OSS.
Note
The created internal table must use the same names, quantity, order, and data types of fields as the external table that is created in Step b.
```
CREATE TABLE adb_import_test
(
  id INT,
  name VARCHAR(1023),
  age INT,
  dt VARCHAR(1023)
  primary key(id)
)
DISTRIBUTED BY HASH(uid);
```

Import data.

Important

Elastic import allows you to import data by using only the INSERT OVERWRITE INTO statement.

Method 1: Execute the INSERT OVERWRITE INTO statement to elastically import data. If the primary key has duplicate values, the original value is overwritten by the new value.

/+*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Method 2: Asynchronously execute the INSERT OVERWRITE INTO statement to elastically import data. Typically, the SUBMIT JOB statement is used to submit an asynchronous job.

/*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.oss_import_test_external_table;

Important

When you asynchronously submit an elastic import job, you cannot configure priority queues.

Sample result:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2023081517195102101701907203151****** |

After you use the SUBMIT JOB statement to submit an asynchronous job, only a job ID is returned, which indicates that the asynchronous job is successfully submitted. You can use the returned job ID to terminate the asynchronous job or query the status of the asynchronous job. For more information, see Asynchronously submit an import job.

Hint parameters:

elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

Parameter	Required	Description
adb.load.resource.group.name	Yes	The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu	No	The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1. Execute the following statement to query the number of shards in the cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;`
spark.driver.resourceSpec	No	The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec	No	The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize	No	The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

(Optional) Check whether the submitted job is an elastic import job.

SELECT job_name, (job_type == 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs where job_name = "2023081818010602101701907303151******"

Sample result:

+---------------------------------------+------------------+
| job_name                              | is_elastic_load  |
+---------------------------------------+------------------+
| 2023081517195102101701907203151****** |       1          |
+---------------------------------------+------------------+

If an elastic import job is submitted, 1 is returned for the is_elastic_load parameter. If a regular import job is submitted, 0 is returned.

Prerequisites

Sample data

Procedure

Regular import

Create a non-partitioned OSS external table

Create a partitioned OSS external table

Elastic import