All Products
Search
Document Center

Data Lake Formation:Metadata discovery

Last Updated:Feb 20, 2024

As business runs, large amounts of data is accumulated in data lakes. Different from the strictly managed data in data warehouses, this portion of data may be metadata that is stored in data lakes without being managed or regulated. Metadata discovery can analyze data in a data lake in a specific format and automatically generate metadata information. Metadata extraction can be executed periodically or manually to achieve schema-on-read for data lake analysis and computing.

Limits

  1. The extracted data can be stored only in an Object Storage Service (OSS) bucket of standard storage.

  2. Currently, metadata discovery supports only the JSON, CSV, Parquet, ORC, Hudi, Delta, and AVRO formats.

  3. The metadata extraction process consumes computing power but does not incur fees.

Procedure

Create an extraction task

  1. Log on to the Data Lake Formation (DLF) console.

  2. In the left-side navigation pane, choose Metadata > Metadata Discovery.

  3. On the Metadata Discovery page, click Create Extraction Task.

  4. On the Create Extraction Task page, set the parameters that are described in the following table.

Parameter

Description

Extraction Task Name

The name of the metadata extraction task. The name can contain letters, digits, and underscores (_).

Select OSS Path

The path of the OSS bucket from which you want to extract data.

Exclusion Mode

The file paths that you want to exclude from the specified OSS path. You can use regular expressions to match the file paths to be excluded.

Parse Format

The format of data from which you extract metadata, such as JSON, CSV, Parquet, ORC, Hudi, Delta, or AVRO. If you set this parameter to Automatic identification, data files are automatically parsed.

Destination Database

The metadatabase in which you want to store the extracted metadata.

Destination Table Prefix

The prefix that is used to generate a name for the destination metadata table. The name of the destination metadata table consists of this prefix and the name of the source file.

Method to Handle Table Field Update

The method that is used to process the updated fields if the source table from which data is extracted contains different fields from those in the destination metadata table. The following methods are available:

  • Adds the updated columns, but does not delete the existing columns in the destination metadata table.

  • Updates the schema of the destination metadata table, and generates a new schema for the destination metadata table based on the latest schema of the source table.

  • Ignores the updates, and does not modify the existing destination metadata table.

Method to Process Deleted OSS Objects

The method that is used to process the data deleted from the source table in the OSS bucket in the process of metadata extraction. The following methods are available:

  • Deletes the corresponding metadata.

  • Ignores the updates and does not delete metadata from the destination metadata table.

The RAM role.

The role that is used to execute the metadata extraction task. The default value is AliyunDLFWorkFlowDefaultRole, which is granted the permission to execute DLF extraction tasks.

Execution Policy

  • Manual execution: manually runs metadata extraction tasks.

  • Scheduling execution: periodically runs metadata extraction tasks at the specified time.

Extraction Policy

Partial Data Extraction: When DLF extracts metadata, it scans only partial metadata in each file. This extraction method takes a short period of time. The result accuracy of Partial Data Extraction is lower than that of Extract All. You can adjust metadata information on the metadata editing page.

Extract All: When DLF extracts metadata, it scans all metadata in each file. If the amount of data is large, this extraction method is time-consuming. The results of Extract All are more accurate.

5. Confirm that the parameters are correctly set. Then, click Save and Execute.