Metadata discovery scans data stored in your Object Storage Service (OSS) data lake and automatically generates metadata. Run it on demand or on a schedule to achieve schema-on-read for data lake analysis and computing.
How it works
When you create an extraction task, DLF scans the files at the specified OSS path, infers the schema from the file content, and registers the resulting tables and partitions in the metadata catalog. DLF derives table names from names specified in the path and partition names from key-value segments such as dt=1.
For example, specifying oss://my-bucket/my-path/my-table/dt=1/data.csv creates a table named my-table with a partition dt=1, and infers the schema from data.csv.
Limitations
| Constraint | Details |
|---|---|
| Storage type | OSS buckets must use standard storage |
| Supported formats | JSON, CSV, Parquet, ORC, Hudi, Delta, and Avro |
| Billing | Metadata extraction consumes computing resources but does not incur fees |
Prerequisites
Before you begin, ensure that you have:
-
An OSS bucket that uses standard storage
-
The RAM role
AliyunDLFWorkFlowDefaultRoleassigned with permission to run DLF extraction tasks
Create an extraction task
-
Log on to the Data Lake Formation console.
-
In the left navigation bar, click Metadata > Metadata Discovery.
-
On the Metadata Discovery page, click Create Extraction Task.
-
Configure the extraction task parameters.
Source and destination
Parameter Description Extraction Task Name A name for the metadata extraction task. Select OSS Path The OSS path to scan. Use the format oss://<bucket>/<directory path>/<table (optional)>/<partition (optional)>/<file>. DLF creates tables and partitions based on the names specified in the path.Exclusion Mode File paths to exclude from the scan. Use regular expressions to match paths to exclude. Destination Database The database where extracted metadata is stored. Destination Table Prefix A prefix for destination table names. The full table name combines this prefix with the source file name. Remove
.DS_Storefiles from the OSS directory to prevent parsing errors.Schema handling
Parameter Description Parse Format The file format to parse. Choose a specific format (JSON, CSV, Parquet, ORC, Hudi, Delta, or Avro) or use automatic detection to let DLF identify the format. Method of Handle Table Field Update How DLF handles schema changes when the source file has different fields from the destination table: Add Columns and Retain Existing Columns, Update Table Schema and Generate Table Results Based on the last detected table schema, or Ignore Updates and Not Modify Table. Method to Process Deleted OSS objects How DLF handles metadata when source files are deleted from OSS: Deletes Metadata or Ignore Updates and Not Delete Tables. ORC files do not support the detection of new columns.
Execution settings
Parameter Description RAM Role The RAM role used to run the extraction task. Defaults to AliyunDLFWorkFlowDefaultRole.Execution Policy Manual: run the extraction task on demand. Scheduling: run the extraction task automatically at the specified time. Extraction Policy Partial Data Extraction: scans only partial metadata in each file. Faster, but less accurate — adjust the schema on the metadata editing page if needed. Extract All: scans all metadata in each file. More accurate, but slower for large datasets. -
Confirm the relevant parameters for task execution, and click Save and Execute.