Use Metadata Discovery to automatically generate metadata from OSS data - Data Lake Formation

Metadata discovery scans data stored in your Object Storage Service (OSS) data lake and automatically generates metadata. Run it on demand or on a schedule to achieve schema-on-read for data lake analysis and computing.

How it works

When you create an extraction task, DLF scans the files at the specified OSS path, infers the schema from the file content, and registers the resulting tables and partitions in the metadata catalog. DLF derives table names from names specified in the path and partition names from key-value segments such as dt=1.

For example, specifying oss://my-bucket/my-path/my-table/dt=1/data.csv creates a table named my-table with a partition dt=1, and infers the schema from data.csv.

Limitations

Constraint	Details
Storage type	OSS buckets must use standard storage
Supported formats	JSON, CSV, Parquet, ORC, Hudi, Delta, and Avro
Billing	Metadata extraction consumes computing resources but does not incur fees

Prerequisites

Before you begin, ensure that you have:

An OSS bucket that uses standard storage
The RAM role AliyunDLFWorkFlowDefaultRole assigned with permission to run DLF extraction tasks

Create an extraction task

Log on to the Data Lake Formation console.
In the left navigation bar, click Metadata > Metadata Discovery.
On the Metadata Discovery page, click Create Extraction Task.

Configure the extraction task parameters.

Source and destination

Parameter	Description
Extraction Task Name	A name for the metadata extraction task.
Select OSS Path	The OSS path to scan. Use the format `oss://<bucket>/<directory path>/<table (optional)>/<partition (optional)>/<file>`. DLF creates tables and partitions based on the names specified in the path.
Exclusion Mode	File paths to exclude from the scan. Use regular expressions to match paths to exclude.
Destination Database	The database where extracted metadata is stored.
Destination Table Prefix	A prefix for destination table names. The full table name combines this prefix with the source file name.

Remove .DS_Store files from the OSS directory to prevent parsing errors.

Schema handling

Parameter	Description
Parse Format	The file format to parse. Choose a specific format (JSON, CSV, Parquet, ORC, Hudi, Delta, or Avro) or use automatic detection to let DLF identify the format.
Method of Handle Table Field Update	How DLF handles schema changes when the source file has different fields from the destination table: Add Columns and Retain Existing Columns, Update Table Schema and Generate Table Results Based on the last detected table schema, or Ignore Updates and Not Modify Table.
Method to Process Deleted OSS objects	How DLF handles metadata when source files are deleted from OSS: Deletes Metadata or Ignore Updates and Not Delete Tables.

ORC files do not support the detection of new columns.

Execution settings

Parameter	Description
RAM Role	The RAM role used to run the extraction task. Defaults to `AliyunDLFWorkFlowDefaultRole`.
Execution Policy	Manual: run the extraction task on demand. Scheduling: run the extraction task automatically at the specified time.
Extraction Policy	Partial Data Extraction: scans only partial metadata in each file. Faster, but less accurate — adjust the schema on the metadata editing page if needed. Extract All: scans all metadata in each file. More accurate, but slower for large datasets.

Confirm the relevant parameters for task execution, and click Save and Execute.