All Products
Search
Document Center

:Use external tables to import data to Data Warehouse Edition

Last Updated:Jan 08, 2024

This topic describes how to query Object Storage Service (OSS) objects by using external tables and then import the objects to AnalyticDB for MySQL Data Warehouse Edition (V3.0). OSS objects in the Parquet, CSV, or TEXT format can be imported by using external tables.

Prerequisites

  • OSS is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS and Create a bucket.

  • An AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster is created in the same virtual private cloud (VPC) as the OSS bucket. A whitelist is configured and a database account is created for the cluster. For more information, see Use AnalyticDB for MySQL Data Warehouse Edition (V3.0).

  • If the AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster is in elastic mode, ENI is turned on in the Network Information section of the Cluster Information page.启用ENI网络

Sample data

In this example, the oss_import_test_data.csv object is uploaded to the <bucket-name>.oss-cn-hangzhou.aliyuncs.com/adb/ directory in OSS. The row delimiter is a line feed, and the column delimiter is a semicolon (;). The following code shows part of the data contained in this object:

uid;other
12;hello_world_1
27;hello_world_2
28;hello_world_3
33;hello_world_4
37;hello_world_5
40;hello_world_6
...             

Procedure

In this example, data in the oss_import_test_data.txt object is imported to an AnalyticDB for MySQL database named adb_demo.

  1. Connect to the AnalyticDB for MySQL cluster. For more information, see Connect to an AnalyticDB for MySQL cluster.

  2. Create a database. For more information, see Create a database.

    In this example, a database named adb_demo is created in the AnalyticDB for MySQL cluster.

  3. Create an external table. You can execute the CREATE TABLE statement to create an OSS external table in the CSV, Parquet, or ORC format in the adb_demo database. For more information about the syntax, see the "Create an OSS external table for a non-partitioned object" or "Create an OSS external table for a partitioned object" section of this topic.

  4. Query OSS data. In this example, the external table oss_import_test_external_table is used.

    The syntax for querying external tables is the same as that for querying AnalyticDB for MySQL tables. You can use the following statement to query OSS external tables:

    SELECT uid, other FROM oss_import_test_external_table WHERE uid < 100 LIMIT 10;
    Note

    For queries on CSV, Parquet, and ORC objects, a larger amount of data causes greater performance loss. To improve query efficiency, we recommend that you perform the following steps to import the object to AnalyticDB for MySQL before you query data.

  5. Create a destination table. You can create a destination table named adb_oss_import_test in the adb_demo database to store the data imported from OSS. The following statement can be used to create a destination table:

    CREATE TABLE IF NOT EXISTS adb_oss_import_test
    (
        uid string,
        other string
    )
    DISTRIBUTED BY HASH(uid);
  6. Execute an INSERT statement to import data from OSS to AnalyticDB for MySQL.

    Important

    By default, INSERT INTO and INSERT OVERWRITE SELECT statements import data synchronously. If hundreds of gigabytes of data is imported, the import may fail due to a disconnection from the client to the AnalyticDB for MySQL server. Therefore, if you want to import a large amount of data, we recommend that you execute the SUBMIT JOB INSERT OVERWRITE SELECT statement to import data asynchronously.

    • Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO. Sample statement:

      INSERT INTO adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 2: Execute the INSERT OVERWRITE statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value. Sample statement:

      INSERT OVERWRITE adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 3: Execute the INSERT OVERWRITE statement to import data asynchronously. Typically, SUBMIT JOB is used to submit asynchronous jobs. You can add a hint (/* direct_batch_load=true*/) before the data import statement to accelerate the job. For more information, see the "Asynchronous writing" section of the INSERT OVERWRITE INTO SELECT topic. Sample statement:

      SUBMIT JOB INSERT OVERWRITE adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;

      The following information is returned:

      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |

      For information about how to submit asynchronous jobs, see Asynchronously submit an import job.

Create an OSS external table for a non-partitioned object

  • Create an OSS external table in the CSV format

    The following statement shows how to create an external table in the CSV format:

    CREATE TABLE IF NOT EXISTS oss_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='OSS'
    TABLE_PROPERTIES='{
        "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
        "url":"oss://<bucket-name>/adb_data/",
        "accessid":"LTAIF****5FsE",
        "accesskey":"Ccw****iWjv",
        "delimiter":";",
        "skip_header_line_count":1,
    }'; 

    Parameter

    Required

    Description

    ENGINE='OSS'

    Yes

    The storage engine that is used for the external table, which is OSS.

    TABLE_PROPERTIES

    The connection information that is used by AnalyticDB for MySQL to access OSS.

    endpoint

    The endpoint of the OSS bucket.

    Note

    AnalyticDB for MySQL can access OSS only from Elastic Compute Service (ECS) instances over VPCs.

    You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.

    url

    The absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory.

    Examples:

    • Object: oss://<bucket-name>/adb/oss_import_test_data.csv.

    • Directory: oss://<bucket-name>/adb_data/.

      Note

      If you set this parameter to the absolute path of the source directory, the created external table contains all data in the directory.

    accessid

    The AccessKey ID that is used to access the object or directory in OSS.

    For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

    accesskey

    The AccessKey secret that is used to access the object or directory in OSS.

    delimiter

    The column delimiter of the object in the CSV format. For example, you can set the column delimiter to a comma (,).

    null_value

    No

    The NULL value of the object in the CSV format. By default, an empty string is defined as NULL, which is "null_value": "".

    Note

    This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.

    ossnull

    The rule for defining the NULL value of the object in the CSV format. Valid values:

    • 1 (default): EMPTY_SEPARATORS. Only empty strings are defined as NULL.

      For example, a,"",,c is expressed as "a","",NULL,"c" based on this rule.

    • 2: EMPTY_QUOTES. Only quotation marks ("") are defined as NULL.

      For example, a,"",,c is expressed as "a",NULL,"","c" based on this rule.

    • 3: BOTH. Both empty strings and quotation marks ("") are defined as NULL.

      For example, a,"",,c is expressed as "a",NULL,NULL,"c" based on this rule.

    • 4: NEITHER. Empty strings and quotation marks ("") are not defined as NULL.

      For example, a,"",,c is expressed as "a","","","c" based on this rule.

    Note

    The preceding examples are provided on the premise of "null_value": "".

    skip_header_line_count

    The number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data.

    The default value of this parameter is 0, which specifies that no rows are skipped.

    oss_ignore_quote_and_escape

    Specifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false, which specifies that quotation marks (") and escape characters are not ignored.

    Note

    This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.

    AnalyticDB for MySQL allows you to read and write data of TEXT files in Hive by using OSS external tables in the CSV format. The following statement can be used to create an external table:

    CREATE TABLE adb_csv_hive_format_oss (
      a tinyint,
      b smallint,
      c int,
      d bigint,
      e boolean,
      f float,
      g double,
      h varchar,
      i varchar, -- binary
      j timestamp,
      k DECIMAL(10, 4),
      l varchar, -- char(10)
      m varchar, -- varchar(100)
      n date
    ) ENGINE = 'OSS' TABLE_PROPERTIES='{
        "format": "csv",
        "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
        "accessid":"LTAIF****5FsE",
        "accesskey":"Ccw****iWjv",
        "url":"oss://<bucket-name>/adb_data/",
        "delimiter": "\\1",
        "null_value": "\\\\N",
        "oss_ignore_quote_and_escape": "true",
        "ossnull": 2,
    }';
    Note

    When you create an OSS external table in the CSV format to read and write data of a TEXT file in Hive, take note of the following items:

    • The default column delimiter for the TEXT file in Hive is \1. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \1 must be escaped to \\1 for the delimiter parameter.

    • By default, the NULL value of the TEXT file in Hive is \N. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \N must be escaped to \\\\N for the null_value parameter.

    • The BINARY, CHAR(N), and VARCHAR(N) data types in Hive all correspond to the VARCHAR data type in AnalyticDB for MySQL. Other basic data types in Hive such as BOOLEAN are the same as those in AnalyticDB for MySQL.

  • Create an OSS external table in the Parquet or ORC format

    The following statement shows how to create an external table in the Parquet format:

    CREATE TABLE IF NOT EXISTS oss_import_test_external_table
    (
        uid string,
        other string
    )
    ENGINE='OSS'
    TABLE_PROPERTIES='{
        "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
        "url":"oss://<bucket-name>/adb_data/",
        "accessid":"LTAIF****5FsE",
        "accesskey":"Ccw****iWjv",
        "format":"parquet"
    }';

    Parameter

    Description

    ENGINE= 'OSS'

    The storage engine that is used for the external table, which is OSS.

    TABLE_PROPERTIES

    The connection information that is used by AnalyticDB for MySQL to access OSS.

    endpoint

    The endpoint of the OSS bucket.

    Note

    AnalyticDB for MySQL can access OSS only from ECS instances over VPCs.

    You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.

    url

    The absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory.

    Examples:

    • Object: oss://<bucket-name>/adb/oss_import_test_data.parquet.

    • Directory: oss://<bucket-name>/adb_data/.

    Note

    When you create an external table, specify an actual absolute path as url.

    accessid

    The AccessKey ID that is used to access the object or directory in OSS.

    For information about how to obtain an AccessKey ID and an AccessKey secret, see Accounts and permissions.

    accesskey

    The AccessKey secret that is used to access the object or directory in OSS.

    format

    The format of the object.

    • When you create an external table in the Parquet format, you must set this parameter to parquet.

    • When you create an external table in the ORC format, you must set this parameter to orc.

    Note

    If the format parameter is not specified, the CSV format is used.

    Note
    • The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.

    • When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.

    • If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.

    The following table describes the mappings between data types in Parquet and AnalyticDB for MySQL V3.0.

    Basic type in Parquet

    Logical type in Parquet

    Data type in AnalyticDB for MySQL V3.0

    BOOLEAN

    N/A

    BOOLEAN

    INT32

    INT_8

    TINYINT

    INT32

    INT_16

    SMALLINT

    INT32

    N/A

    INT or INTEGER

    INT64

    N/A

    BIGINT

    FLOAT

    N/A

    FLOAT

    DOUBLE

    N/A

    DOUBLE

    • FIXED_LEN_BYTE_ARRAY

    • BINARY

    • INT64

    • INT32

    DECIMAL

    DECIMAL

    BINARY

    UTF-8

    • VARCHAR

    • STRING

    • JSON (JSON is available if an object in the Parquet format contains a column in the JSON format.)

    INT32

    DATE

    DATE

    INT64

    TIMESTAMP_MILLIS

    TIMESTAMP or DATETIME

    INT96

    N/A

    TIMESTAMP or DATETIME

    Important

    Parquet external tables that use columns of the STRUCT data type cannot be created.

    The following table describes the mappings between data types in ORC and AnalyticDB for MySQL V3.0.

    Data type in ORC

    Data type in AnalyticDB for MySQL V3.0

    BOOLEAN

    BOOLEAN

    BYTE

    TINYINT

    SHORT

    SMALLINT

    INT

    INT or INTEGER

    LONG

    BIGINT

    DECIMAL

    DECIMAL

    FLOAT

    FLOAT

    DOUBLE

    DOUBLE

    • BINARY

    • STRING

    • VARCHAR

    • VARCHAR

    • STRING

    • JSON (JSON is available if an object in the ORC format contains a column in the JSON format.)

    TIMESTAMP

    TIMESTAMP or DATETIME

    DATE

    DATE

    Important

    ORC external tables that use columns of the LIST, STRUCT, or UNION data type cannot be created. ORC external tables that use columns of the MAP data type can be created, but you cannot perform queries on the tables.

Create an OSS external table for a partitioned object

A hierarchical directory is generated for OSS data that contains partitions. Example:

parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ │ │ └── 000000_1
│ │ └── p3=SHENZHEN
│ │     └── 000000_0
│ └── p2=6
│     └── p3=SHENZHEN
│         └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│     ├── p3=SHANGHAI
│     │ └── 000000_0
│     └── p3=SHENZHEN
│         └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0

In the preceding example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition. If you want to query the data by partition, you must specify partition columns in the statement used to create an OSS external table. The following statement shows how to create an OSS external table and specify partition columns in the external table. In this example, a Parquet object is used.

CREATE TABLE IF NOT EXISTS oss_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='OSS'
TABLE_PROPERTIES='{
  "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
  "url":"oss://<bucket-name>/adb/oss_parquet_data_dir",
  "accessid":"LTAIF****5FsE",
  "accesskey":"Ccw****iWjv",
  "format":"parquet",
  "partition_column":"p1, p2, p3"
}';
Note
  • The partition_column property of TABLE_PROPERTIES must declare the partition columns (such as p1, p2, and p3 in the example). The sequence of partition columns specified by partition_column must be the same as the sequence of partition columns defined in the statement used to create the external table.

  • When you define partition columns in an external table, you must specify the columns that contain partitions (such as p1, p2, and p3 in the example) and their data types. You must specify the partition columns at the end of the statement.

  • The sequence of partition columns defined in the statement used to create an external table must be the same as the sequence of partition columns specified by the partition_column property.

  • Partition columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.

  • The syntax used to query partition columns and the way in which the query results are displayed are the same as those for other columns.

  • If the format parameter is not specified, the CSV format is used.