This topic describes how to query Object Storage Service (OSS) objects by using the external tables of AnalyticDB for MySQL and how to import objects from OSS to AnalyticDB for MySQL Data Warehouse Edition (V3.0). OSS objects in the Parquet, CSV, or TEXT format can be imported by using external tables.
Prerequisites
- OSS is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create buckets, and Create directories.
- An AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster is created in the same virtual private cloud (VPC) as the OSS bucket. A whitelist is configured and a database account is created for the cluster. For more information, see Use AnalyticDB for MySQL.
- If the AnalyticDB for MySQL Data Warehouse Edition (V3.0) cluster is in elastic mode, ENI is turned on in the Network Information section of the Cluster Information page.
Description of sample data
oss_import_test_data.csv
object is uploaded to the <bucket-name>.oss-cn-hangzhou.aliyuncs.com/adb/
directory in OSS. The row delimiter is a line feed, and the column delimiter is a
semicolon (;
). The following code shows some of the data contained in this object:uid;other
12;hello_world_1
27;hello_world_2
28;hello_world_3
33;hello_world_4
37;hello_world_5
40;hello_world_6
...
Procedure
In this example, data in the oss_import_test_data.txt
object is imported to an AnalyticDB for MySQL database named adb_demo
.
Create an OSS external table for a non-partitioned object
- Create an OSS external table in the CSV format
The following statement shows how to create an external table in the CSV format:
CREATE TABLE IF NOT EXISTS oss_import_test_external_table ( uid string, other string ) ENGINE='OSS' TABLE_PROPERTIES='{ "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "url":"oss://<bucket-name>/adb/oss_import_test_data.csv", "accessid":"LTAIF****5FsE", "accesskey":"Ccw****iWjv", "delimiter":";", "skip_header_line_count":1 }';
Parameter Required Description ENGINE='OSS' Yes The storage engine used for the external table, which is OSS. TABLE_PROPERTIES The connection information used by AnalyticDB for MySQL to access OSS. endpoint The endpoint of the OSS bucket. Note AnalyticDB for MySQL can access OSS only from Elastic Compute Service (ECS) instances over VPCs.You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.
url The absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory. Examples:- Object:
oss://<bucket-name>/adb/oss_import_test_data.csv
. - Directory:
oss://<bucket-name>/adb_data/
.
accessid The AccessKey ID used to access the object or directory in OSS. For more information about how to obtain an AccessKey ID and an AccessKey secret, see Obtain an AccessKey pair.
accesskey The AccessKey secret used to access the object or directory in OSS. delimiter The column delimiter of the object in the CSV format. For example, you can set the column delimiter to a comma (,). null_value No The NULL
value of the object in the CSV format. By default, an empty string is defined asNULL
, which is"null_value": ""
.Note This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.ossnull The rule for defining the NULL
value of the object in the CSV format. Default value: 1. Valid values:- 1:
EMPTY_SEPARATORS
. Only empty strings are defined asNULL
.For example,
a,"",,c
is expressed as "a","",NULL,"c" based on this rule. - 2:
EMPTY_QUOTES
. Only quotation marks (""
) are defined asNULL
.For example,
a,"",,c
is expressed as "a",NULL,"","c" based on this rule. - 3:
BOTH
. Both empty strings and quotation marks (""
) are defined asNULL
.For example,
a,"",,c
is expressed as "a",NULL,NULL,"c" based on this rule. - 4:
NEITHER
. Empty strings and quotation marks (""
) are not defined asNULL
.For example,
a,"",,c
is expressed as "a","","","c" based on this rule.
Note The preceding examples are provided on the premise of"null_value": ""
.skip_header_line_count The number of header rows to skip when you import data. The first row of a CSV object is the table header. If you set this parameter to 1, the first row of the object is skipped when you import data. The default value of this parameter is 0, which indicates that no rows are skipped.
oss_ignore_quote_and_escape Specifies whether to ignore quotation marks (") and escape characters. The default value of this parameter is false
, which indicates that quotation marks (") and escape characters are not ignored.Note This parameter is supported only for AnalyticDB for MySQL clusters of V3.1.4.2 or later. For more information, see Release notes.AnalyticDB for MySQL allows you to read and write data of TEXT files in Hive by using OSS external tables in the CSV format. The following statement can be used to create an external table:CREATE TABLE adb_csv_hive_format_oss ( a tinyint, b smallint, c int, d bigint, e boolean, f float, g double, h varchar, i varchar, -- binary j timestamp, k DECIMAL(10, 4), l varchar, -- char(10) m varchar, -- varchar(100) n date ) ENGINE = 'OSS' TABLE_PROPERTIES='{ "format": "csv", "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "accessid":"LTAIF****5FsE", "accesskey":"Ccw****iWjv", "url":"oss://<bucket-name>/adb/adb_csv_hive_format_oss.txt", "delimiter": "\\1", "null_value": "\\\\N", "oss_ignore_quote_and_escape": "true", "ossnull": 2, }';
Note When you create an OSS external table in the CSV format to read and write data of a TEXT file in Hive, take note of the following items:- The default column delimiter for the TEXT file in Hive is
\1
. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \1 must be escaped to\\1
for thedelimiter
parameter. - By default, the
NULL
value of the TEXT file in Hive is\N
. If you want to use the OSS external table to read and write data of the TEXT file in Hive, \N must be escaped to\\\\N
for thenull_value
parameter. - The
BINARY
,CHAR(N)
, andVARCHAR(N)
data types in Hive all correspond to theVARCHAR
data type in AnalyticDB for MySQL. Other basic data types in Hive such asBOOLEAN
are the same as those in AnalyticDB for MySQL.
- Object:
- Create an OSS external table in the Parquet or ORC formatThe following statement shows how to create an external table in the Parquet format:
CREATE TABLE IF NOT EXISTS oss_import_test_external_table ( uid string, other string ) ENGINE='OSS' TABLE_PROPERTIES='{ "endpoint":"oss-cn-hangzhou-internal.aliyuncs.com", "url":"oss://<bucket-name>/adb_data/", "accessid":"LTAIF****5FsE", "accesskey":"Ccw****iWjv", "format":"parquet" }';
Parameter Description ENGINE= 'OSS' The storage engine used for the external table, which is OSS. TABLE_PROPERTIES The connection information used by AnalyticDB for MySQL to access OSS. endpoint The endpoint of the OSS bucket. Note AnalyticDB for MySQL can access OSS only from ECS instances over VPCs.You can log on to the OSS console, find the bucket, and then view the endpoint on the Overview page.
url The absolute path of the source object or directory in OSS. We recommend that you use a forward slash (/) at the end of the absolute path of the directory. Examples:- Object:
oss://<bucket-name>/adb/oss_import_test_data.parquet
. - Directory:
oss://<bucket-name>/adb_data/
.
accessid The AccessKey ID used to access the object or directory in OSS. For more information about how to obtain an AccessKey ID and an AccessKey secret, see Obtain an AccessKey pair.
accesskey The AccessKey secret used to access the object or directory in OSS. format The format of the object. - When you create an external table in the Parquet format, you must set this parameter
to
parquet
. - When you create an external table in the ORC format, you must set this parameter to
orc
.
Note If the format parameter is not specified, the CSV format is used.Note- The column names used in the statement to create an external table must be the same as those in the Parquet or ORC file. Column names are case-insensitive. The sequence of the columns in the statement must be the same as that in the Parquet or ORC file.
- When you create an external table, you can choose only specific columns in a Parquet or ORC file as the columns of the external table. Columns that are not selected in the Parquet or ORC file are not imported.
- If the statement used to create an external table contains a column that is not in the Parquet or ORC file, NULL is returned for this column.
The following table describes the mappings between data types in Parquet and AnalyticDB for MySQL V3.0.Basic type in Parquet Logical type in Parquet Data type in AnalyticDB for MySQL V3.0 BOOLEAN N/A BOOLEAN INT32 INT_8 TINYINT INT32 INT_16 SMALLINT INT32 N/A INT or INTEGER INT64 N/A BIGINT FLOAT N/A FLOAT DOUBLE N/A DOUBLE - FIXED_LEN_BYTE_ARRAY
- BINARY
- INT64
- INT32
DECIMAL DECIMAL BINARY UTF-8 - VARCHAR
- STRING
- JSON (JSON is available if an object in the Parquet format contains a column in the JSON format.)
INT32 DATE DATE INT64 TIMESTAMP_MILLIS TIMESTAMP or DATETIME INT96 N/A TIMESTAMP or DATETIME Important External tables in the Parquet format do not support theSTRUCT
data type. Tables cannot be created for this data type.The following table describes the mappings between data types in ORC and AnalyticDB for MySQL V3.0.Data type in ORC Data type in AnalyticDB for MySQL V3.0 BOOLEAN BOOLEAN BYTE TINYINT SHORT SMALLINT INT INT or INTEGER LONG BIGINT DECIMAL DECIMAL FLOAT FLOAT DOUBLE DOUBLE - BINARY
- STRING
- VARCHAR
- VARCHAR
- STRING
- JSON (JSON is available if an object in the ORC format contains a column in the JSON format.)
TIMESTAMP TIMESTAMP or DATETIME DATE DATE Important ORC external tables that use theLIST
,STRUCT
, orUNION
data type cannot be created. You can create ORC external tables that use theMAP
data type, but cannot perform queries on the tables. - Object:
Create an OSS external table for a partitioned object
parquet_partition_classic/
├── p1=2020-01-01
│ ├── p2=4
│ │ ├── p3=SHANGHAI
│ │ │ ├── 000000_0
│ │ │ └── 000000_1
│ │ └── p3=SHENZHEN
│ │ └── 000000_0
│ └── p2=6
│ └── p3=SHENZHEN
│ └── 000000_0
├── p1=2020-01-02
│ └── p2=8
│ ├── p3=SHANGHAI
│ │ └── 000000_0
│ └── p3=SHENZHEN
│ └── 000000_0
└── p1=2020-01-03
└── p2=6
├── p2=HANGZHOU
└── p3=SHENZHEN
└── 000000_0
CREATE TABLE IF NOT EXISTS oss_parquet_partition_table
(
uid varchar,
other varchar,
p1 date,
p2 int,
p3 varchar
)
ENGINE='OSS'
TABLE_PROPERTIES='{
"endpoint":"oss-cn-hangzhou-internal.aliyuncs.com",
"url":"oss://<bucket-name>/adb/oss_parquet_data_dir",
"accessid":"LTAIF****5FsE",
"accesskey":"Ccw****iWjv",
"format":"parquet",
"partition_column":"p1, p2, p3"
}';
- The partition_column property of
TABLE_PROPERTIES
must declare the partition columns (such as p1, p2, and p3 in the example). The sequence of partition columns specified by partition_column must be the same as the sequence of partition columns defined in the statement used to create the external table. - When you define partition columns in an external table, you must specify the columns that contain partitions (such as p1, p2, and p3 in the example) and their data types. You must specify the partition columns at the end of the statement.
- The sequence of partition columns defined in the statement used to create an external table must be the same as the sequence of partition columns specified by the partition_column property.
- Partition columns support the following data types:
BOOLEAN
,TINYINT
,SMALLINT
,INT
,INTEGER
,BIGINT
,FLOAT
,DOUBLE
,DECIMAL
,VARCHAR
,STRING
,DATE
, andTIMESTAMP
. - The syntax used to query partition columns and the way in which the query results are displayed are the same as those for other columns.
- If the format parameter is not specified, the CSV format is used.