All Products
Search
Document Center

DataWorks:Metadata collection

Last Updated:Mar 03, 2026

Metadata collection in DataWorks Data Map extracts technical metadata -- databases, tables, columns, partitions, and data lineage -- from data sources across workspaces in the same region and aggregates it into a unified catalog. Collected metadata becomes searchable and browsable in Data Map.

Two types of crawlers are available:

  • Built-in crawlers run automatically with zero configuration. They collect metadata from MaxCompute and Data Lake Formation (DLF) in near real-time.

  • Custom crawlers connect to additional data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. Create and schedule them to fit your collection needs.

How it works

Built-in crawlers automatically synchronize metadata from integrated data sources (MaxCompute and DLF) using an internal mechanism that requires no user configuration. Custom crawlers connect to an external data source through a resource group, read the catalog structure (databases, schemas, tables, columns), and write the extracted metadata to Data Map. Depending on the data source, a crawler also collects partition information and data lineage.

Collected metadata enables the following workflows in Data Map:

  • Data discovery -- search for tables across data sources by name, field, or owner.

  • Lineage analysis -- trace data origins and downstream dependencies for impact analysis and troubleshooting.

  • Data governance -- classify data assets, manage access control, monitor data quality, and enforce lifecycle policies.

Open the metadata collection page

  1. Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.

  2. In the left navigation pane of Data Map, click metadata collection icon to open the metadata collection page.

Built-in crawlers

Built-in crawlers are preconfigured by DataWorks and run automatically in near real-time. They collect core metadata from integrated data sources with no manual setup -- manage only the collection scope.

Important

If a table is missing from Data Map, go to My Data > My Tools > Refresh Table Metadata to manually sync it.

MaxCompute default crawler

The MaxCompute Default Crawler collects metadata from MaxCompute projects in your account.

View crawler details

  1. In the Built-in section of the metadata collection page, find the MaxCompute Default Crawler card and click Details.

  2. The details page has two tabs:

    • Basic Information -- displays crawler attributes such as collection type and mode. This tab is read-only.

    • Data Scope -- lists the MaxCompute projects included in the collection.

Modify the collection scope

  1. Click the Data Scope tab, then click Modify Data Scope.

  2. In the dialog box, select or clear the MaxCompute projects to include.

  3. Click Confirm.

Important

By default, the scope includes all MaxCompute projects bound to workspaces in the current region under the current tenant. Metadata from unselected projects becomes invisible in Data Map.

Configure metadata visibility

  1. In the Data Scope list, find the target project and click Permission Configurations in the Actions column.

  2. Select a visibility policy:

    • Public Within Tenant -- all tenant members can search for and view metadata from this project.

    • Only members in the associated workspace can search and view. -- restricts metadata access to members of specific workspaces, providing data isolation.

DLF default crawler

Important

To enable real-time collection of DLF metadata, grant the Data Reader permission to the service linked role AliyunServiceRoleForDataworksOnEmr in the DLF console.

The DLF Default Crawler collects metadata from Data Lake Formation (DLF) within your account. By default, all accessible catalogs (including DLF and DLF-Legacy versions) are collected.

  1. In the Built-in section of the metadata collection page, find the DLF Default Crawler card and click Details to view basic information.

  2. Click the Data Scope tab to view the list of DLF catalogs and their table counts.

Custom crawlers

Custom crawlers extend metadata collection to data sources not covered by built-in crawlers. Two categories are supported:

  • Conventional data sources -- Hologres, StarRocks, MySQL, Oracle, CDH Hive, and others. The system parses database table structures to automatically extract field attributes, indexes, and partitions.

  • Metadata-type data sources (Catalog) -- non-DLF native lake format metadata such as Paimon Catalog.

Create a custom crawler

  1. In the custom crawler list section of the metadata collection page, click Create Metadata Collection.

  2. Select the collection type. Choose the target data source type (for example, Hologres or StarRocks).

  3. Configure basic information and resource group.

    Important

    - If the data source has whitelist restrictions, see Overview of network connectivity solutions and Configure a whitelist. - If the data source does not have whitelist restrictions, see Network connectivity and operations on resource groups. - If the connectivity test fails with backend service call failed: test connectivity failed.not support data type, contact technical support to upgrade the resource group.

    Basic Configurations:

    FieldDescription
    Select WorkspaceThe workspace that contains the data source
    Select Data SourceA data source created in the selected workspace. The system displays data source details after selection.
    NameA name for the crawler. Defaults to the data source name.

    Resource Group Configuration:

    FieldDescription
    Resource GroupThe resource group that runs the collection task
    Test Network ConnectivityVerifies that the resource group can access the data source. Run this test before proceeding.
  4. Define the collection scope. Select the databases (Database/Schema) to collect. For database-granular data sources, the corresponding database is selected by default. Additional databases outside the data source can also be selected.

    Important

    - A database can be configured in only one crawler. If a database is grayed out, it is already being collected by another crawler. - Narrowing the collection scope makes metadata outside the scope unsearchable in Data Map.

  5. Configure intelligent enhancement settings and the collection plan.

    To configure a 5-minute interval, select hourly collection and check all minute options.
    Important

    Periodic collection is supported only for production environment data sources.

    Intelligent Enhancement Settings (Beta):

    SettingDescription
    AI-Enhanced DescriptionUses large language models to automatically generate business descriptions for tables and fields after metadata collection. View AI-generated descriptions (such as table remarks and field descriptions) on the table details page in Data Map.

    Collection Plan:

    SettingDescription
    Trigger ModeManual -- the crawler runs only when manually triggered. Use this for one-time or on-demand collection.
    Cycle -- runs on a schedule (monthly, daily, weekly, or hourly). The system updates metadata automatically at the configured interval.
  6. Click Save or Save and Run.

Manage custom crawlers

After creation, crawlers appear in the custom list. The following operations are available.

List operations:

  • Run, Stop, or Delete a crawler directly from the list. The available actions depend on the crawler status (for example, a running crawler shows Stop instead of Run).

  • Use the Filter and Search features to locate a specific crawler.

Important

Deleting a crawler removes its collected metadata objects from Data Map. This action cannot be undone.

View details and logs:

Click the crawler name to open its details page:

TabDescription
Basic InformationAll configuration items of the crawler
Data ScopeCurrent collection scope. Click Modify Data Scope to update it.
Run LogsExecution history for each collection task, including start time, duration, status, and data volume. Click View Logs to troubleshoot failed tasks.
Before the first collection run, the table count and latest update time in the Data Scope tab are empty.
The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, ElasticSearch, Tablestore (OTS), MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.

Trigger a manual collection:

Click Collect Metadata in the upper-right corner to immediately run a collection task. Use this to quickly reflect a newly created table in Data Map.

Collection scope and timeliness

Data tables

The following table lists the collection granularity and update timeliness for each supported data source.

Data source typeCollection modeTable/field timelinessPartition timelinessLineage timeliness
MaxComputeSystem default auto-collection (Instance)Standard project: Real-timeChinese mainland regions: Real-timeReal-time
External project: T+1Overseas regions: T+1
DLFSystem default auto-collection (Instance)Real-timeReal-timeSupported for Serverless Spark, Serverless StarRocks, and Serverless Flink engines only
HologresManually create crawler (Database)Depends on scheduleNot supportedReal-time
EMR HiveManually create crawler (Instance)Depends on scheduleDepends on scheduleReal-time
CDH HiveManually create crawler (Instance)Depends on scheduleReal-timeReal-time
StarRocksManually create crawler (Database)Instance mode: Real-time; Connection string mode: Depends on scheduleNot supportedReal-time (Instance mode only)
AnalyticDB for MySQLManually create crawler (Database)Depends on scheduleNot supportedReal-time (requires a support ticket)
AnalyticDB for SparkManually create crawler (Instance)Real-timeNot supportedReal-time
AnalyticDB for PostgreSQLManually create crawler (Database)Depends on scheduleNot supportedReal-time
LindormManually create crawler (Instance)Depends on scheduleNot supportedReal-time
Tablestore (OTS)Manually create crawler (Instance)Depends on scheduleNot supportedNot supported
MongoDBManually create crawler (Instance)Depends on scheduleNot supportedNot supported
ElasticSearchManually create crawler (Instance)Depends on scheduleNot supportedT+1
Paimon CatalogManually create crawler (Catalog)Depends on scheduleDepends on scheduleNot supported
Other sources (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, etc.)Manually create crawler (Database)Depends on scheduleNot supportedNot supported
Important

For DLF metadata lineage and EMR Hive lineage, enable EMR_HOOK on the EMR cluster.

AnalyticDB for Spark and AnalyticDB for MySQL share the same metadata collection entry point.

Task code

Data Map supports code search and quick location for the following sources. All are auto-collected.

Code sourceCollection scope
Data StudioNodes and code
Data Studio (Legacy)Nodes and code
Data AnalysisSQL queries and code
DataService StudioAPI data push services

API assets

Data Map supports viewing DataService Studio API metadata. All are auto-collected.

API typeCollection scope
Generated API (Codeless UI)APIs created via codeless UI
Generated API (Code editor)APIs created via code editor
Registered APIRegistered APIs
Service OrchestrationService orchestration workflows

AI assets

Data Map supports viewing and managing AI assets, including lineage tracking for data and model origins. All are auto-collected.

Asset typeCollection scope
DatasetPAI: Create or register dataset; DataWorks: Create dataset
AI ModelPAI: Model training, register model, deploy model service
Algorithm TaskPAI: Training task, workflow task, distributed training task
Model ServicePAI: Deploy model service (EAS deployment)

Workspace

Workspace metadata is auto-collected when a workspace is created in DataWorks.

Billing

Each collection task consumes 0.25 CUs multiplied by the task runtime. See Resource group fees.

Each successful collection generates a scheduling instance, billed separately. See Scheduling instance fees.

Limitations

  • If the data source uses whitelist access control, add the resource group IP addresses to the database whitelist. See Metadata collection whitelist.

  • Cross-region collection is not recommended. Keep DataWorks and the data source in the same region. To collect metadata across regions, use a public IP address when creating the data source. See Data source management.

  • The MySQL metadata crawler does not support OceanBase data sources.

  • Metadata collection is not supported for AnalyticDB for MySQL data sources with SSL enabled.

Next steps

After metadata is collected, use Data Map to:

  • Search for tables and view their details, field information, partitions, and data previews. See Metadata details.

  • Analyze upstream and downstream lineage to understand the full data processing pipeline. See View lineages.

  • Organize assets into data albums for business-oriented data management. See Data albums.

FAQ

Collection times out or fails for MySQL and other database sources

This is typically a whitelist issue. Add the vSwitch CIDR block of the resource group to the database whitelist.

If the whitelist is already configured, verify that the resource group can reach the data source by running Test Network Connectivity on the crawler details page.