Use metadata discovery to import data from OSS to Data Lakehouse Edition - AnalyticDB for MySQL

Data files that are generated from enterprise services, such as standard forms and log files, are uploaded to Object Storage Service (OSS) on a regular basis. However, metadata management of OSS objects is missing. As a result, OSS data is difficult to compute and analyze. AnalyticDB for MySQL provides the metadata discovery feature to allow the system to automatically map OSS paths, group OSS objects, explore OSS object fields, and identify field changes. This way, data lake metadata is automatically created and updated. You can create an OSS metadata discovery job to discover OSS buckets and data objects in the same region as an AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster and then import data from OSS to the cluster.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created in the same region as an Object Storage Service (OSS) bucket.
Data files are uploaded to the OSS bucket. The OSS path meets the following conditions:
- The OSS path is in the <BucketName>/directory/.../directory/table/object or <BucketName>/directory/.../directory/table/partition/.../partition/object format.
- The OSS bucket has at least one level of child directories.
- The objects in the same table or partition in the OSS path are in the same format.
- The objects in the same table or partition in the OSS path contain the same types and quantity of fields.

Limits

You can create only one metadata discovery job for each cluster in the same OSS path.

Procedure

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Data Ingestion > Metadata Discovery.
In the OSS Data Source section of the Metadata Discovery page, click Start Wizard.

On the OSS Data Source tab, configure the parameters that are described in the following table.

Section	Parameter	Description
Data Source Settings	Data Warehouse Mode	Automatic metadata discovery is built for OSS-based standard data warehouses and provides high identification precision. Note The free mode is not supported.
	OSS Directory	The OSS directory in which the objects are stored. The directory must end with a forward slash (/). The mapping results vary based on the value of the OSS Directory parameter. For more information, see the "Examples of mapping data from OSS paths to AnalyticDB for MySQL" section of this topic. Important When you configure the OSS Directory parameter, make sure that at least one level of child directories exists after the OSS directory.
	Path Filter Rules (Optional)	The rules for filtering the objects that you want to map to AnalyticDB for MySQL. Specify one or more path prefixes for the objects. Include: All objects that are stored in the OSS path are mapped to AnalyticDB for MySQL. Important If multiple objects are stored in the OSS path, the objects must be in the same format and contain the same types and quantity of fields. Exclude: The objects that are stored in the OSS path are not mapped to AnalyticDB for MySQL.
	Format Parser	The format parser that reads data from objects to determine the object format. By default, Automatic is selected. This value specifies that all built-in format parsers are called in sequence. You can select a format parser that can parse the format of the objects stored in the OSS path. Important If the selected format parser parses a format that is different from the object format, the mapping fails.
	Configuration Options (Optional)	The advanced custom settings. If you set the Format Parser parameter to csv, you can configure the following parameters: Field Delimiter: the column delimiter of CSV objects. Reference Identifier: the reference identifier of parsed AnalyticDB for MySQL tables. Table Header Mode: the method that is used to identify table headers of CSV tables. The table headers are used as column names. Allow Single Field: specifies whether to allow only one column to exist for a row of data in a CSV object. Valid values: false true Other Settings: You can configure parameters to specify whether to use the first row of a CSV object as column names. If you require this setting, contact technical support.
Scheduling Settings	Scheduling Frequency	The scheduling frequency for metadata discovery jobs to obtain the changes to object columns or data on a regular basis. If the OSS object data changes, the table data that is re-mapped to AnalyticDB for MySQL is updated based on the rules that are specified in the Destination Metadata Settings section.
Destination Metadata Settings	Schema Name	The name of the schema, which is the name of the mapped AnalyticDB for MySQL database. By default, a schema is created for each metadata discovery job. When you specify a name for the database, take note of the following items: The database name cannot be the same as the name of an existing database in the AnalyticDB for MySQL cluster. The database name cannot be the same as the name of an existing database in other metadata discovery jobs.
Destination Metadata Settings	Configuration Options (Optional)	The rules for updating the table data that is re-mapped to AnalyticDB for MySQL if an OSS object field changes or an OSS object is deleted. Object Field Change Rules: Add Only Columns: If you add a column to an OSS object, the column is also added to the mapped AnalyticDB for MySQL table in the next metadata discovery job. Ignore Table Updates: If you change the column or partition data of an OSS object, only the partition change is synchronized to the mapped AnalyticDB for MySQL table in the next metadata discovery job. Table schema changes, such as added or removed columns, are not synchronized. Object Delete Change Rules: Only the Ignore Delete Updates option is supported. If you delete an OSS object, the mapped AnalyticDB for MySQL table still exists in the next metadata discovery job. Important You can perform DDL and DML operations on the mapped AnalyticDB for MySQL tables. If you add a column to an AnalyticDB for MySQL table, the column still exists in the next metadata discovery job. If you remove a column from an AnalyticDB for MySQL table, the column is still mapped to the AnalyticDB for MySQL table in the next metadata discovery job.

After you configure the preceding parameters, click Create to create the metadata discovery job.
Note
After you create the metadata discovery job, the system automatically runs the job based on the specified scheduling frequency. If you want to immediately synchronize data, find the job on the Jobs tab and click Execute in the Actions column.
After the job starts, manage the job on the Jobs tab. You can view the status of the job and modify the configurations of the job.
Note
After the job is complete, you can choose Job Development > SQL Development to view the databases, tables, and partitions that are mapped to AnalyticDB for MySQL.

Examples of mapping data from OSS paths to AnalyticDB for MySQL

The results of mapping data from OSS paths to AnalyticDB for MySQL vary based on the following elements:

The OSS object and its path.
The OSS Directory parameter that is specified for the metadata discovery job.

The system maps the name of the first directory level after the OSS directory to the table name and the names of the other directory levels after the OSS directory to partition names. The following table describes the mapping examples.

OSS path	OSS directory	Mapped AnalyticDB for MySQL table
oss://adb/Table1/file1.csv oss://adb/Table1/file2.json	oss://adb/	No data mapping is performed. Causes: The objects that are stored in the Table1 directory are in different formats: CSV and JSON. Important If the objects are in the same format but the fields use different data types, no data mapping is performed.
oss://adb/Table2/year/month/day/file3.json oss://adb/Table2/year/month/day/file4.json	oss://adb/	The mapped AnalyticDB for MySQL table is a partitioned table named `Table2`. The partition fields are `partition_0`, `partition_1`, and `partition_2`. Note No partition keys exist in the partitions. In this case, partition_num is used.
	oss://adb/Table2/	The mapped AnalyticDB for MySQL table is a partitioned table named `year`. The partition fields are `partition_0` and `partition_1`.
	oss://adb/Table2/year/month/	The mapped AnalyticDB for MySQL table is a non-partitioned table named `day`.
	oss://adb/Table2/year/month/day/	No data mapping is performed. Causes: No directory levels exist after the specified OSS directory.
oss://adb/Table3/year=2022/month=03/day=01/file5.csv oss://adb/Table3/year=2022/month=03/day=01/file6.csv	oss://adb/	The mapped AnalyticDB for MySQL table is a partitioned table named `Table3`. The partition fields are `year`, `month`, and `day`.
	oss://adb/Table3/ oss://adb/Table3/year=2022/	No data mapping is performed. Causes: The year=2022 or month=03 directory does not conform to the naming conventions for AnalyticDB for MySQL tables. For more information, see Limits.
oss://adb/Table4/2020/03/30/file7.csv oss://adb/Table3/2020/03/30/file8.csv	oss://adb/	The mapped AnalyticDB for MySQL table is a partitioned table named `Table4`. The partition fields are `partition_0`, `partition_1`, and `partition_2`. Note No partition keys exist in the partitions. In this case, partition_num is used.
	oss://adb/Table3/ oss://adb/Table3/2020/	No data mapping is performed. Causes: The 2020 or 03 directory does not conform to the naming conventions for AnalyticDB for MySQL tables. For more information, see Limits.