All Products
Search
Document Center

AnalyticDB for MySQL:Use external tables to import data to Data Lakehouse Edition

Last Updated:Mar 11, 2024

AnalyticDB for MySQL allows you to access and import MaxCompute data by using external tables. This maximizes the utilization of cluster resources and improves data import performance. This topic describes how to use external tables to import data from MaxCompute to AnalyticDB for MySQL Data Lakehouse Edition (V3.0).

Prerequisites

  • A MaxCompute project and an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster are created in the same region. For more information, see Create a cluster.

  • The Elastic Network Interface (ENI) feature is enabled for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.

  • The CIDR blocks of the virtual private cloud (VPC) in which the AnalyticDB for MySQL cluster resides are added to a whitelist of the MaxCompute project.

    You can log on to the AnalyticDB for MySQL console to view the VPC ID on the Cluster Information page. Then, you can log on to the VPC console and find the VPC ID on the VPCs page to view the CIDR blocks. For information about how to configure a MaxCompute whitelist, see Manage IP address whitelists.

Sample data

In this example, a MaxCompute project named test_adb and a MaxCompute table named person are used. Execute the following statement to create a table:

CREATE TABLE IF NOT EXISTS person (
    id int,
    name varchar(1023),
    age int)
partitioned by (dt string);

Execute the following statement to create a partition in the person table:

ALTER TABLE person 
ADD 
PARTITION (dt='202207');

Execute the following statement to insert data into the partition:

INSERT INTO test_adb.person 
partition (dt='202207') 
VALUES (1,'james',10),(2,'bond',20),(3,'jack',30),(4,'lucy',40);

Access and import MaxCompute data in tunnel mode

You can import data by using regular import or elastic import. The default method is regular import. Regular import reads data from compute nodes and creates indexes on storage nodes. This method consumes computing and storage resources. Elastic import reads data and creates indexes for Serverless Spark jobs. This method consumes resources of job resource groups. Only AnalyticDB for MySQL Data Lakehouse Edition (V3.0) clusters of V3.1.10.0 or later that have job resource groups support elastic import. Compared with regular import, elastic import consumes less resources. This reduces impacts on real-time data reads and writes, and improves resource isolation and data import efficiency. For more information, see Data import methods.

Regular import

  1. Go to the SQL editor.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

    2. In the left-side navigation pane, choose Job Development > SQL Development.
  2. Create an external database.

    CREATE EXTERNAL DATABASE adb_external_db;
  3. Create an external table. In this example, an external table named test_adb is used.

    CREATE EXTERNAL TABLE IF NOT EXISTS adb_external_db.test_adb (
        id int,
        name varchar(1023),
        age int,
        dt string
        ) ENGINE='ODPS'
    TABLE_PROPERTIES='{
    "accessid":"LTAILd4****",
    "endpoint":"http://service.cn-hangzhou.maxcompute.aliyun.com/api",
    "accesskey":"4A5Q7ZVzcYnWMQPysX****",
    "partition_column":"dt",
    "project_name":"test_adb",
    "table_name":"person"
    }';
    Note
    • The AnalyticDB for MySQL external table must use the same names, quantity, and order of fields as the MaxCompute table. The data types of fields must be compatible between the two tables.

    • For information about the parameters that are used to create an external table, see CREATE EXTERNAL TABLE.

  4. Query data.

    SELECT * FROM adb_external_db.test_adb;

    Sample result:

    +------+-------+------+---------+
    | id   | name  | age  |   dt    |
    +------+-------+------+---------+
    |    1 | james |   10 |  202207 |
    |    2 | bond  |   20 |  202207 |
    |    3 | jack  |   30 |  202207 |
    |    4 | lucy  |   40 |  202207 |
    +------+-------+------+---------+
    4 rows in set (0.35 sec)
  5. Perform the following steps to import data from MaxCompute to AnalyticDB for MySQL:

    1. Create a database in the AnalyticDB for MySQL cluster.

    2. CREATE DATABASE adb_demo; 
    3. Create a table in the AnalyticDB for MySQL database to store data that is imported from MaxCompute.

    4. Note

      The created table must use the same quantity and order of fields as the external table that is created in Step 3. The data types of fields must be compatible between the two tables.

      CREATE TABLE IF NOT EXISTS adb_demo.adb_import_test(
          id int,
          name string,
          age int,
          dt string
          PRIMARY KEY(id,dt)
      )
      DISTRIBUTE BY HASH(id)  
      PARTITION BY VALUE('dt'); 
    5. Write data to the table.

      • Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO.

      • INSERT INTO adb_demo.adb_import_test
        SELECT * FROM adb_external_db.test_adb;

        If you want to import data from a specific partition to the adb_demo.adb_import_test table, you can execute the following statement:

        INSERT INTO adb_demo.adb_import_test
        SELECT * FROM adb_external_db.test_adb 
        WHERE dt = '202207'; 
      • Method 2: Execute the INSERT OVERWRITE INTO statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value.

      • INSERT OVERWRITE INTO adb_demo.adb_import_test
        SELECT * FROM adb_external_db.test_adb;
      • Method 3: Execute the INSERT OVERWRITE INTO statement to asynchronously import data. Typically, the SUBMIT JOB statement is used to submit an asynchronous job. You can add a hint (/* direct_batch_load=true*/) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE SELECT topic.

      • SUBMIT job 
        INSERT OVERWRITE INTO adb_demo.adb_import_test
        SELECT * FROM adb_external_db.test_adb;

        Sample result:

        +---------------------------------------+
        | job_id                                |
        +---------------------------------------+
        | 2020112122202917203100908203303****** |
        +---------------------------------------+
      • For information about how to asynchronously submit a job, see Asynchronously submit an import job.

Elastic import

  1. Go to the SQL editor.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

    2. In the left-side navigation pane, choose Job Development > SQL Development.
  2. Create a database. If you have already created a database, skip this step.

    CREATE DATABASE adb_demo; 
  3. Create an external table.

    Note
    • The name of the AnalyticDB for MySQL external table must be the same as that of the MaxCompute project. Otherwise, the external table fails to be created.

    • The AnalyticDB for MySQL external table must use the same names, quantity, and order of fields as the MaxCompute table. The data types of fields must be compatible between the two tables.

    • Elastic import allows you to create external tables by using only the CREATE TABLE statement.

    CREATE TABLE IF NOT EXISTS test_adb
    (
        id int,
        name string,
        age int,
        dt string
    )
     ENGINE='ODPS'
     TABLE_PROPERTIES='{
     "endpoint":"http://service.cn-hangzhou.maxcompute.aliyun-inc.com/api",
     "accessid":"LTAILd4****",
     "accesskey":"4A5Q7ZVzcYnWMQPysX****",
     "partition_column":"dt",
     "project_name":"test_adb",
     "table_name":"person"
     }';                 

    For information about the external table parameters, see the parameter table in the Use external tables to import data to Data Warehouse Edition topic.

  4. Query data.

    SELECT * FROM adb_demo.test_adb;

    Sample result:

    +------+-------+------+---------+
    | id   | name  | age  |   dt    |
    +------+-------+------+---------+
    |    1 | james |   10 |  202207 |
    |    2 | bond  |   20 |  202207 |
    |    3 | jack  |   30 |  202207 |
    |    4 | lucy  |   40 |  202207 |
    +------+-------+------+---------+
    4 rows in set (0.35 sec)
  5. Create a table in the AnalyticDB for MySQL database to store data that is imported from MaxCompute.

    Note

    The created internal table must use the same names, quantity, order, and data types of fields as the external table that is created in Step 3.

    CREATE TABLE IF NOT EXISTS adb_import_test
    (   id int,
        name string,
        age int,
        dt string,
        PRIMARY KEY(id,dt)
    )
    DISTRIBUTE BY HASH(id)
    PARTITION BY VALUE('dt') LIFECYCLE 30;  
  6. Import data.

    Important

    Elastic import allows you to import data by using only the INSERT OVERWRITE INTO statement.

    • Method 1: Execute the INSERT OVERWRITE INTO statement to elastically import data. If the primary key has duplicate values, the original value is overwritten by the new value.

      /*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group|spark.adb.eni.vswitchId=vsw-bp12ldm83z4zu9k4d****]*/
      INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;
    • Method 2: Asynchronously execute the INSERT OVERWRITE INTO statement to elastically import data. Typically, the SUBMIT JOB statement is used to submit an asynchronous job.

      /*elastic_load=true, elastic_load_configs=[adb.load.resource.group.name=resource_group|spark.adb.eni.vswitchId=vsw-bp12ldm83z4zu9k4d****]*/
      SUBMIT JOB INSERT OVERWRITE INTO adb_demo.adb_import_test SELECT * FROM adb_demo.test_adb;
      Important

      When you asynchronously submit an elastic import job, you cannot configure priority queues.

      Sample result:

      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2023081517192220291720310090151****** |
      +---------------------------------------+

    After you use the SUBMIT JOB statement to submit an asynchronous job, only a job ID is returned, which indicates that the asynchronous job is successfully submitted. You can use the returned job ID to terminate the asynchronous job or query the status of the asynchronous job. For more information, see Asynchronously submit an import job.

    Hint parameters:

    • elastic_load: specifies whether to use elastic import. Valid values: true and false. Default value: false.

    • elastic_load_configs: the configuration parameters of the elastic import feature. You must enclose the parameters within brackets ([ ]) and separate multiple parameters with vertical bars (|). The following table describes the parameters.

      Parameter

      Required

      Description

      adb.load.resource.group.name

      Yes

      The name of the job resource group that runs the elastic import job.

      adb.load.job.max.acu

      No

      The maximum amount of resources for an elastic import job. Unit: AnalyticDB compute units (ACUs). Minimum value: 5 ACUs. Default value: number of shards plus 1.

      Execute the following statement to query the number of shards in the cluster:

      SELECT count(1) FROM information_schema.kepler_meta_shards;

      spark.driver.resourceSpec

      No

      The resource type of the Spark driver. Default value: small. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.

      spark.executor.resourceSpec

      No

      The resource type of the Spark executor. Default value: large. For information about the valid values, see the Type column in the "Spark resource specifications" table of the Conf configuration parameters topic.

      spark.adb.executorDiskSize

      No

      The disk capacity of the Spark executor. Valid values: (0,100]. Unit: GiB. Default value: 10 GiB. For more information, see the "Specify driver and executor resources" section of the Conf configuration parameters topic.

  7. (Optional) Check whether the submitted job is an elastic import job.

    SELECT job_name, (job_type = 3) AS is_elastic_load FROM INFORMATION_SCHEMA.kepler_meta_async_jobs WHERE job_name = "2023081818010602101701907303151******";

    Sample result:

    +---------------------------------------+------------------+
    | job_name                              | is_elastic_load  |
    +---------------------------------------+------------------+
    | 2023081517195203101701907203151****** |       1          |
    +---------------------------------------+------------------+

    If an elastic import job is submitted, 1 is returned for the is_elastic_load parameter. If a regular import job is submitted, 0 is returned.