AnalyticDB for MySQL provides a variety of solutions for you to synchronize data in different scenarios. This topic describes how to import objects from Object Storage Service (OSS) to the AnalyticDB for MySQL database named adb_demo and query data in AnalyticDB for MySQL.

Prerequisites

  • A directory is created in an OSS bucket by performing the following operations to store data imported from an AnalyticDB for MySQL cluster:
    1. Activate OSS. For more information, see Activate OSS.
    2. Create an OSS bucket. For more information, see Create buckets.
      Important Make sure that the OSS bucket resides in the same region as the AnalyticDB for MySQL cluster.
    3. Create a directory. For more information, see Create a directory.
    4. Upload an object. For more information, see Upload objects.
      In this example, the oss_import_test_data.txt object is uploaded to the <bucket-name>.oss-cn-hangzhou.aliyuncs.com/adb/ directory in OSS. The row delimiter is a line feed, and the column delimiter is a semicolon (;). The following code shows some of the data contained in this object:
      uid;other
      12;hello_world_1
      27;hello_world_2
      28;hello_world_3
      33;hello_world_4
      37;hello_world_5
      40;hello_world_6
      ...     
  • An AnalyticDB for MySQL cluster is created. A whitelist is configured for the cluster. An account and a database are created for the cluster. For more information, see Use AnalyticDB for MySQL Data Warehouse Edition (V3.0).

Procedure

  1. Execute a CREATE TABLE statement to create an external table in the adb_demo database. For more information about how to create an OSS external table for an object that is in the CSV, Parquet, or TEXT format, see the "Syntax for creating an OSS external table" section of this topic.
  2. Query OSS data.

    You can query external table data in the same manner as you query AnalyticDB for MySQL table data. Example:

    select uid, other from oss_import_test_external_table where uid < 100 limit 10;
    • If an object is in the CSV or TEXT format and contains a large amount of data, we recommend that you import the object to AnalyticDB for MySQL before you query the data. Otherwise, the query performance may be compromised.
    • If an object is in the Parquet format, you can determine whether to directly query the data or import the object to AnalyticDB for MySQL before you query the data.
  3. Execute a CREATE TABLE statement to create a destination table named adb_oss_import_test in the adb_demo database to store the data imported from OSS.
    CREATE TABLE IF NOT EXISTS adb_oss_import_test
    (
        uid string,
        other string
    )
    DISTRIBUTED BY HASH(uid);                  
  4. Execute an INSERT statement to import data from OSS to AnalyticDB for MySQL.
    Important By default, INSERT INTO and INSERT OVERWRITE INTO SELECT statements import data synchronously. If hundreds of gigabytes of data is imported, the import may fail due to a disconnection from the client to the AnalyticDB for MySQL server. Therefore, if you want to import a large amount of data, we recommend that you execute the SUB JOB INSERT OVERWRITE INTO SELECT statement to import data asynchronously.
    • Method 1: Execute the INSERT INTO statement to import data. If the primary key has duplicate values, data is not repeatedly inserted and the INSERT INTO statement is equivalent to the INSERT IGNORE INTO statement. For more information, see INSERT INTO. Sample statement:
      INSERT INTO adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 2: Execute the INSERT OVERWRITE INTO statement to import data. If the primary key has duplicate values, the original value is overwritten by the new value. Sample statement:
      INSERT OVERWRITE INTO adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
    • Method 3: Asynchronously execute the INSERT OVERWRITE INTO statement to import data. Typically, SUBMIT JOB is used to submit an asynchronous task. You can add a hint (/* direct_batch_load=true*/) before the data import statement to accelerate the task. For more information, see Asynchronous writing. Sample statement:
      SUBMIT JOB INSERT OVERWRITE INTO adb_oss_import_test
      SELECT * FROM oss_import_test_external_table;
      The following information is returned:
      +---------------------------------------+
      | job_id                                |
      +---------------------------------------+
      | 2020112122202917203100908203303****** |

      For more information about how to submit asynchronous tasks, see Asynchronously submit an import task.

  5. Execute the following statement to query the data of the adb_oss_import_test table:
    SELECT * FROM adb_oss_import_test;

Syntax for creating an OSS external table

Create an OSS external table for an object that is in the CSV format

The oss_import_test_data.txt object is in the CSV format. The following section describes how to create an OSS external table for an object that is in the CSV format:

CREATE TABLE IF NOT EXISTS oss_import_test_external_table
(
    uid string,
    other string
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://<bucket-name>/adb/oss_import_test_data.csv",
    "accessid":"LTAIF****5FsE",
    "accesskey":"Ccw****iWjv",
    "delimiter":";",
    "skip_header_line_count":1
}'; 
ParameterRequiredDescription
ENGINE='OSS'YesThe storage engine that is used for the external table, which is OSS.
TABLE_PROPERTIESThe connection information that is used by AnalyticDB for MySQL to access OSS.
endpointThe endpoint of the OSS bucket.
Note AnalyticDB for MySQL can access OSS only from Elastic Compute Service (ECS) instances over VPCs.

You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.

urlThe absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory.
Examples:
  • Object: oss://<bucket-name>/adb/oss_import_test_data.csv.
  • Directory: oss://<bucket-name>/adb_data/.
accessidThe AccessKey ID that is used to access the object or directory in OSS.

For more information about how to obtain an AccessKey ID and an AccessKey secret, see Obtain an AccessKey pair.

accesskeyThe AccessKey secret that is used to access the object or directory in OSS.
delimiterThe column delimiter of the object in the CSV format. For example, you can set the column delimiter to a comma (,).
null_valueNoThe NULL value of the object in the CSV format. By default, an empty string is defined as NULL, which is "null_value": "".
Note This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.
ossnullThe rule for defining the NULL value of the object in the CSV format. Default value: 1. Valid values:
  • 1: EMPTY_SEPARATORS. Only empty strings are defined as NULL.

    For example, a,"",,c is expressed as "a","",NULL,"c" based on this rule.

  • 2: EMPTY_QUOTES. Only quotation marks ("") are defined as NULL.

    For example, a,"",,c is expressed as "a",NULL,"","c" based on this rule.

  • 3: BOTH. Both empty strings and quotation marks ("") are defined as NULL.

    For example, a,"",,c is expressed as "a",NULL,NULL,"c" based on this rule.

  • 4: NEITHER. Empty strings and quotation marks ("") are not defined as NULL.

    For example, a,"",,c is expressed as "a","","","c" based on this rule.

Note The preceding examples are provided on the premise of "null_value": "".
skip_header_line_countThe number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data.

The default value of this parameter is 0, which indicates that no rows are skipped.

oss_ignore_quote_and_escapeSpecifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false, which indicates that quotation marks (") and escape characters are not ignored.
Note This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.
Create an OSS external table for an object that is in the Parquet or ORC format
The following statement shows how to create an external table in the Parquet format:
CREATE TABLE IF NOT EXISTS oss_import_test_external_table
(
    uid string,
    other string
)
ENGINE='OSS'
TABLE_PROPERTIES='{
    "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
    "url":"oss://<bucket-name>/adb_data/",
    "accessid":"LTAIF****5FsE",
    "accesskey":"Ccw****iWjv",
    "format":"parquet"
}';
ParameterDescription
ENGINE= 'OSS'The storage engine that is used for the external table, which is OSS.
TABLE_PROPERTIESThe connection information that is used by AnalyticDB for MySQL to access OSS.
endpointThe endpoint of the OSS bucket.
Note AnalyticDB for MySQL can access OSS only from ECS instances over VPCs.

You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.

urlThe absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory.
Examples:
  • Object: oss://<bucket-name>/adb/oss_import_test_data.parquet.
  • Directory: oss://<bucket-name>/adb_data/.
Note When you create an external table, specify an actual absolute path as url.
accessidThe AccessKey ID that is used to access the object or directory in OSS.

For more information about how to obtain an AccessKey ID and an AccessKey secret, see Obtain an AccessKey pair.

accesskeyThe AccessKey secret that is used to access the object or directory in OSS.
formatThe format of the object.
  • When you create an external table in the Parquet format, you must set this parameter to parquet.
  • When you create an external table in the ORC format, you must set this parameter to orc.
Note If the format parameter is not specified, the CSV format is used.
Note
  • The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.
  • When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
  • If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.

When you create an external table for an object that is in the Parquet format, take note of the following mappings between data types:

The following table describes the mappings between data types in Parquet and AnalyticDB for MySQL V3.0.
Basic type in ParquetLogical type in ParquetData type in AnalyticDB for MySQL V3.0
BOOLEANNoneBOOLEAN
INT32INT_8TINYINT
INT32INT_16SMALLINT
INT32NoneINT or INTEGER
INT64NoneBIGINT
FLOATNoneFLOAT
DOUBLENoneDOUBLE
  • FIXED_LEN_BYTE_ARRAY
  • BINARY
  • INT64
  • INT32
DECIMALDECIMAL
BINARYUTF-8
  • VARCHAR
  • STRING
  • JSON (JSON is available if an object in the Parquet format contains a column in the JSON format.)
INT32DATEDATE
INT64TIMESTAMP_MILLISTIMESTAMP or DATETIME
INT96NoneTIMESTAMP or DATETIME
Important External tables in the Parquet format do not support the STRUCT data type. Tables cannot be created for this data type.

Create an OSS external table for a partitioned object that is in the Parquet or CSV format

A hierarchical directory is generated for OSS data that contains partitions. Example:
parquet_partition_classic/
├── p1=2020-01-01
│   ├── p2=4
│   │   ├── p3=SHANGHAI
│   │   │   ├── 000000_0
│   │   │   └── 000000_1
│   │   └── p3=SHENZHEN
│   │       └── 000000_0
│   └── p2=6
│       └── p3=SHENZHEN
│           └── 000000_0
├── p1=2020-01-02
│   └── p2=8
│       ├── p3=SHANGHAI
│       │   └── 000000_0
│       └── p3=SHENZHEN
│           └── 000000_0
└── p1=2020-01-03
    └── p2=6
        ├── p2=HANGZHOU
        └── p3=SHENZHEN
            └── 000000_0
In the preceding example, p1 indicates the level-1 partition, p2 indicates the level-2 partition, and p3 indicates the level-3 partition. If you want to query the data by partition, you must specify partition columns in the statement used to create an OSS external table. The following statement shows how to create an OSS external table and specify partition columns in the external table. In this example, a Parquet object is used.
CREATE TABLE IF NOT EXISTS oss_parquet_partition_table
(
  uid varchar,
  other varchar,
  p1 date,
  p2 int,
  p3 varchar
)
ENGINE='OSS'
TABLE_PROPERTIES='{
  "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
  "url":"oss://<bucket-name>/adb/oss_parquet_data_dir",
  "accessid":"LTAIF****5FsE",
  "accesskey":"Ccw****iWjv",
  "format":"parquet",
  "partition_column":"p1, p2, p3"
}';
Note
  • The partition_column property of TABLE_PROPERTIES must declare the partition columns (such as p1, p2, and p3 in the example). The sequence of partition columns specified by partition_column must be the same as the sequence of partition columns defined in the statement used to create the external table.
  • When you define partition columns in an external table, you must specify the columns that contain partitions (such as p1, p2, and p3 in the example) and their data types. You must specify the partition columns at the end of the statement.
  • The sequence of partition columns defined in the statement used to create an external table must be the same as the sequence of partition columns specified by the partition_column property.
  • Partition columns support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, STRING, DATE, and TIMESTAMP.
  • The syntax used to query partition columns and the way in which the query results are displayed are the same as those for other columns.
  • If the format parameter is not specified, the CSV format is used.

References

For more information about how to import data to AnalyticDB for MySQL, see Supported data sources.