Import MaxCompute data by using external tables - AnalyticDB

AnalyticDB for MySQL lets you use external tables to read and import data from MaxCompute. This method maximizes the utilization of cluster resources and improves data import performance. This topic describes how to import MaxCompute data into AnalyticDB for MySQL using external tables.

Features

AnalyticDB for MySQL Product Series	Access method	AnalyticDB for MySQL kernel version	Data access efficiency
Enterprise Edition, Basic Edition, and Data Lakehouse Edition	Tunnel Record API	No limit	Suitable for small-scale data access. Data access and import are slow.
	Tunnel Arrow API	3.2.2.3 or later	Reads data in columns to reduce data access and import time, providing faster data transmission.
Data Warehouse Edition	Tunnel Record API	No limit	Uses a public Data Transmission Service resource group. This resource is shared by all projects in the region, resulting in slow data access and import.

Prerequisites

The MaxCompute project and the AnalyticDB for MySQL cluster reside in the same region. For more information, see Create a cluster.
The VPC CIDR block of the AnalyticDB for MySQL cluster is added to the whitelist of the MaxCompute project.
Note
Log on to the AnalyticDB for MySQL console and find the VPC ID on the Cluster Information page. Log on to the VPC console and use the VPC ID to find the CIDR block on the VPC page. For more information about configuring the MaxCompute whitelist, see Manage IP whitelists.
For AnalyticDB for MySQLEnterprise Edition, Basic Edition, and Data Lakehouse Edition clusters:
- Elastic Network Interface (ENI) access is enabled.
  Important
  - Log on to the AnalyticDB for MySQL console. In the navigation pane on the left, choose Cluster Management > Cluster Information > Network Information. In the Network Information section, turn on the switch for ENI access.
  - Enabling or disabling ENI access interrupts database connections for approximately 2 minutes. During this period, you cannot read or write data. We recommend that you carefully evaluate the potential impact before you perform this operation.
- To access and import MaxCompute data using the Tunnel Arrow API, your cluster must be V3.2.2.3 or later.
  Note
  To view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.

Data preparation

This section uses the MaxCompute project odps_project and the sample table odps_nopart_import_test as an example:

CREATE TABLE IF NOT EXISTS odps_nopart_import_test (
    id int,
    name string,
    age int)
partitioned by (dt string);

Run the following statement to add a partition to the odps_nopart_import_test table:

ALTER TABLE odps_nopart_import_test 
ADD 
PARTITION (dt='202207');

Run the following sample statement to add data to the partition:

INSERT INTO odps_project.odps_nopart_import_test 
PARTITION (dt='202207') 
VALUES (1,'james',10),(2,'bond',20),(3,'jack',30),(4,'lucy',40);

Procedure

Enterprise Edition, Basic Edition, and Data Lakehouse Edition

By default, an AnalyticDB for MySQL cluster uses the Tunnel Record API to access and import data from MaxCompute. To use the Tunnel Arrow API, you must first enable the Arrow API feature. After this feature is enabled, the AnalyticDB for MySQL cluster uses the Tunnel Arrow API to import data.

Tunnel Record API

Two data import methods are available: regular import (default) and elastic import. The regular import method reads source data from compute nodes and builds indexes on storage nodes, consuming both computing and storage resources. The elastic import method reads source data and builds indexes within a Serverless Spark Job, consuming resources from a job resource group. The elastic import method is supported only on clusters that run Milvus version 3.1.10.0 or later and have a job resource group. Compared with the regular import method, the elastic import method consumes fewer resources. This reduces the impact on real-time data reads and writes, and improves resource isolation and data import efficiency. For more information, see Data import methods.

Regular import

Navigate to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the navigation pane on the left, choose Job Development > SQL Development.
Create an external database. The following sample statement is used:
```
CREATE EXTERNAL DATABASE adb_external_db;
```

Create an external table. This topic uses test_adb as an example.

CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_db.test_adb (
    id int,
    name varchar(1023),
    age int,
    dt string
    ) ENGINE='ODPS'
TABLE_PROPERTIES='{
"accessid":"yourAccessKeyID",
"endpoint":"http://service.cn-hangzhou.maxcompute.aliyun.com/api",
"accesskey":"yourAccessKeySecret",
"partition_column":"dt",
"project_name":"odps_project",
"table_name":"odps_nopart_import_test"
}';

Note

The external table in AnalyticDB for MySQL and the table in MaxCompute must have the same field names, number of fields, and field order. The data types of the fields must be compatible.
For more information about the parameters for external tables, see CREATE EXTERNAL TABLE.

Query the data.

SELECT * FROM adb_external_db.test_adb;

The following result is returned:

+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+

Perform the following steps to import data from MaxCompute to AnalyticDB for MySQL.
1. Create a database in AnalyticDB for MySQL. The following statement is an example:
```
CREATE DATABASE adb_demo; 
```
2. Create a table in AnalyticDB for MySQL to store the data imported from MaxCompute. The following statement is an example:
  Note
  The new table and the external table that you created in Step 3 must have the same field order and number of fields. The data types of the fields must be compatible.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id int,
    name string,
    age int,
    dt string,
    PRIMARY KEY(id,dt)
)
DISTRIBUTED BY HASH(id)  
PARTITION BY VALUE('dt'); 
```
3. Write data to the table. The following sample statements are used:
  - Method 1: Execute an `INSERT INTO` statement to import data. If a duplicate primary key exists, the data is automatically ignored and not updated. This operation is equivalent to INSERT IGNORE INTO. For more information, see INSERT INTO. The following statement is an example:
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    To import data from a specific partition into adb_demo.adb_import_test, you can run the following statement:
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb 
WHERE dt = '202207'; 
```
  - Method 2: Run an `INSERT OVERWRITE INTO` statement to import data. This statement overwrites the existing data in the table. The following statement is an example:
```
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
  - Method 3: Asynchronously execute an `INSERT OVERWRITE INTO` statement to import data. You can use the SUBMIT JOB command to submit an asynchronous task. The task is scheduled by the backend. You can add a hint (/*+ direct_batch_load=true*/) before the write task to accelerate the task. For more information, see Asynchronous writes. The following statement is an example:
```
SUBMIT job 
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    The following result is returned:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+
```
    For more information, see Submitting import tasks asynchronously.

Elastic import

Navigate to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the navigation pane on the left, choose Job Development > SQL Development.
Create a database. You can skip this step if a database already exists. The following statement is an example:
```
CREATE DATABASE adb_demo; 
```
Create an external table.
Note
- The name of the external table in AnalyticDB for MySQL must be the same as the name of the MaxCompute project. Otherwise, the external table fails to be created.
- The external table in AnalyticDB for MySQL and the table in MaxCompute must have the same field names, number of fields, and field order. The data types of the fields must be compatible.
- Elastic import supports creating external tables only using the CREATE TABLE statement.
```
CREATE TABLE IF NOT EXISTS test_adb
(
    id int,
    name string,
    age int,
    dt string
)
 ENGINE='ODPS'
 TABLE_PROPERTIES='{
 "endpoint":"http://service.cn-hangzhou.maxcompute.aliyun-inc.com/api",
 "accessid":"yourAccessKeyID",
 "accesskey":"yourAccessKeySecret",
 "partition_column":"dt",
 "project_name":"odps_project",
 "table_name":"odps_nopart_import_test"
 }';                 
```
For more information about the parameters that you can set for external tables, see Parameter descriptions.

Query the data.

SELECT * FROM adb_demo.test_adb;

The following result is returned:

+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+

Create a table in AnalyticDB for MySQL to store the data imported from MaxCompute. The following statement is an example:
Note
The internal table and the external table that you created in Step 3 must have the same field names, number of fields, field order, and data types.
```
CREATE TABLE IF NOT EXISTS adb_import_test
(   id int,
    name string,
    age int,
    dt string,
    PRIMARY KEY(id,dt)
)
DISTRIBUTED BY HASH(id)
PARTITION BY VALUE('dt') LIFECYCLE 30;  
```

Import data.

Important

Elastic import supports importing data only using the INSERT OVERWRITE INTO statement.

Method 1: Run the INSERT OVERWRITE INTO statement to elastically import data. This statement overwrites the existing data in the table. The following statement is an example:
```
/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;
```

Method 2: Asynchronously execute an INSERT OVERWRITE INTO statement to elastically import data. You can use the SUBMIT JOB command to submit an asynchronous task. The task is scheduled by the backend.

/*+ elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group]*/
SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;

Important

You cannot set priority queues when you asynchronously submit an elastic import task.

The following result is returned:

+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2023081517192220291720310090151****** |
+---------------------------------------+

After you use the SUBMIT JOB command to submit an asynchronous job, the returned result indicates only that the job is successfully submitted. You can use the returned job_id to terminate the asynchronous job or query its status to check whether the job is successfully executed. For more information, see Submit an import task asynchronously.

Hint parameters:

elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

Parameter	Required	Description
adb.load.resource.group.name	Yes	The name of the job resource group that runs the elastic import job.
adb.load.job.max.acu	No	The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1. Execute the following statement to query the number of shards in the cluster: `SELECT count(1) FROM information_schema.kepler_meta_shards;`
spark.driver.resourceSpec	No	The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.executor.resourceSpec	No	The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.
spark.adb.executorDiskSize	No	The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

(Optional) Check whether the submitted import task is an elastic import task.

SELECT job_name, (job_type = 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs WHERE job_name = "2023081818010602101701907303151******";

The result is as follows:

+---------------------------------------+------------------+
| job_name                              | is_elastic_load  |
+---------------------------------------+------------------+
| 2023081517195203101701907203151****** |       1          |
+---------------------------------------+------------------+

The return value of is_elastic_load is 1 for an elastic import task and 0 for a regular import task.

Tunnel Arrow API

Step 1: Enable the Arrow API

Methods

You can enable the Arrow API at the cluster level using a `SET` command or at the query level using a hint:

Enable the Arrow API at the cluster level:
```
SET ADB_CONFIG <config_name>= <value>;
```

Enable the Arrow API at the query level:

/*<config_name>= <value>*/ SELECT * FROM table;

Arrow API configuration parameters

Parameter (config_name)

Description

ODPS_TUNNEL_ARROW_ENABLED

Specifies whether to enable the Arrow API. Valid values:

true: Yes.
false (default): No.

ODPS_TUNNEL_SPLIT_BY_SIZE_ENABLED

Specifies whether to enable dynamic split. Valid values:

true: Yes.
false (default): No.

Step 2: Access and import MaxCompute data

Navigate to the SQL editor.
1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
2. In the navigation pane on the left, choose Job Development > SQL Development.
Create an external database. The following statement is an example:
```
CREATE EXTERNAL DATABASE adb_external_db;
```

Create an external table. This topic uses test_adb as an example.

CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_db.test_adb (
    id int,
    name varchar(1023),
    age int,
    dt string
    ) ENGINE='ODPS'
TABLE_PROPERTIES='{
"accessid":"yourAccessKeyID",
"endpoint":"http://service.cn-hangzhou.maxcompute.aliyun.com/api",
"accesskey":"yourAccessKeySecret",
"partition_column":"dt",
"project_name":"odps_project",
"table_name":"odps_nopart_import_test"
}';

Note

The external table in AnalyticDB for MySQL and the table in MaxCompute must have the same field names, number of fields, and field order. The data types of the fields must be compatible.
For more information about the parameters for external tables, see CREATE EXTERNAL TABLE.

Query the data.

SELECT * FROM adb_external_db.test_adb;

The following result is returned:

+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+

Perform the following steps to import data from MaxCompute to AnalyticDB for MySQL.
1. Create a database in AnalyticDB for MySQL. The following statement is an example:
```
CREATE DATABASE adb_demo; 
```
2. Create a table in AnalyticDB for MySQL to store the data imported from MaxCompute. The following statement is an example:
  Note
  The new table and the external table that you created in Step 3 must have the same field order and number of fields. The data types of the fields must be compatible.
```
CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
    id int,
    name string,
    age int,
    dt string
    PRIMARY KEY(id,dt)
)
DISTRIBUTED BY HASH(id)  
PARTITION BY VALUE('dt'); 
```
3. Write data to the table. The following sample statements are used:
  - Method 1: Execute an `INSERT INTO` statement to import data. This operation is equivalent to INSERT IGNORE INTO. If a primary key is duplicated, the new data is ignored and the existing record is not updated. For more information, see INSERT INTO. The following statement is an example:
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    To import data from a specific partition into adb_demo.adb_import_test, you can run the following statement:
```
INSERT INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb 
WHERE dt = '202207'; 
```
  - Method 2: Run an `INSERT OVERWRITE INTO` statement to import data. This statement overwrites the existing data in the table. The following statement is an example:
```
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
  - Method 3: Asynchronously import data using an `INSERT OVERWRITE INTO` statement. You can use the SUBMIT JOB command to submit an asynchronous task. The task is scheduled by the backend. You can add a hint (/*+ direct_batch_load=true*/) before the write task to accelerate the task. For more information, see Asynchronous writes. The following statement is an example:
```
SUBMIT job 
INSERT OVERWRITE INTO adb_demo.adb_import_test
SELECT * FROM adb_external_db.test_adb;
```
    The following result is returned:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
+---------------------------------------+
```
    For more information, see Asynchronously submitting import tasks.

Data Warehouse Edition

Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.
Create the destination database.
```
CREATE database test_adb;
```

Create a MaxCompute external table. This topic uses odps_nopart_import_test_external_table as an example.

CREATE TABLE IF NOT EXISTS odps_nopart_import_test_external_table
(
    id int,
    name string,
    age int,
    dt string
)
 ENGINE='ODPS'
 TABLE_PROPERTIES='{
 "endpoint":"http://service.cn.maxcompute.aliyun-inc.com/api",
 "accessid":"yourAccessKeyID",
 "accesskey":"yourAccessKeySecret",
 "partition_column":"dt",
 "project_name":"odps_project1",
 "table_name":"odps_nopart_import_test"
 }';

Parameter	Description
`ENGINE=’ODPS’`	The storage engine for the external table. Set this parameter to ODPS to read data from or write data to MaxCompute.
`endpoint`	The Endpoint (domain Name) of MaxCompute. Note You can access MaxCompute from AnalyticDB for MySQL only through a MaxCompute VPC endpoint. To query the VPC endpoints for different regions, see VPC Endpoints.
`accessid`	The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user that has permissions to access MaxCompute. For more information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and Permissions.
`accesskey`	The AccessKey secret of an Alibaba Cloud account or a RAM user that has permissions to access MaxCompute. For more information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and Permissions.
`partition_column`	This example creates a partitioned table and requires the `partition_column` parameter. If your source table in MaxCompute is not partitioned, create a non-partitioned table in AnalyticDB for MySQL and omit the `partition_column` parameter.
`project_name`	The name of the workspace in MaxCompute.
`table_name`	The name of the source table in MaxCompute.

In the test_adb database, create the adb_nopart_import_test table to store the data imported from MaxCompute.

CREATE TABLE IF NOT EXISTS adb_nopart_import_test
(   id int,
    name string,
    age int,
    dt string,
    PRIMARY KEY(id,dt)
)
DISTRIBUTED BY HASH(id)
PARTITION BY VALUE('dt') LIFECYCLE 30;

Import the data.
- Method 1: Run an INSERT INTO statement to import data. If a primary key conflict occurs, the system ignores the current data and does not perform an update. This operation is equivalent to INSERT IGNORE INTO. For more information, see INSERT INTO. The following statement is an example:
```
INSERT INTO adb_nopart_import_test
SELECT * FROM odps_nopart_import_test_external_table; 
```
  The following example shows how to use a SELECT statement to query the data written to the table:
```
SELECT * FROM adb_nopart_import_test;
```
  The following result is returned:
```
+------+-------+------+---------+
| id   | name  | age  |   dt    |
+------+-------+------+---------+
|    1 | james |   10 |  202207 |
|    2 | bond  |   20 |  202207 |
|    3 | jack  |   30 |  202207 |
|    4 | lucy  |   40 |  202207 |
+------+-------+------+---------+
```
  To import data from a specific partition into adb_nopart_import_test, run the following command:
```
INSERT INTO adb_nopart_import_test
SELECT * FROM odps_nopart_import_test_external_table
WHERE dt = '202207';
```
- Method 2: Run an INSERT OVERWRITE statement to import data. This operation overwrites the existing data in the table. The following statement is an example:
```
INSERT OVERWRITE adb_nopart_import_test
SELECT * FROM odps_nopart_import_test_external_table;
```
- Method 3: Asynchronously run an INSERT OVERWRITE statement to import data. You can use the SUBMIT JOB command to submit an asynchronous task that is scheduled in the background. You can add a hint before the write task to accelerate it. For more information, see Asynchronous write. The following statement is an example:
```
SUBMIT JOB 
INSERT OVERWRITE adb_nopart_import_test 
SELECT * FROM odps_nopart_import_test_external_table;  
```
  The following result is returned:
```
+---------------------------------------+
| job_id                                |
+---------------------------------------+
| 2020112122202917203100908203303****** |
```
  For more information about how to submit asynchronous import tasks, see Submit an asynchronous import task.