Metadata Collection - DataWorks - Alibaba Cloud Documentation Center

Metadata collection in DataWorks Data Map allows you to centrally manage metadata from various data sources. Collected metadata is visible in Data Map. This topic describes how to create a crawler to collect metadata.

Overview

Metadata collection is essential for building an enterprise-level data map and managing data assets. It uses crawlers to automatically extract technical metadata (databases, tables, columns), data lineage, and partition information from DataWorks data sources (such as MaxCompute, Hologres, MySQL, and CDH Hive) across workspaces in the same region. This metadata is aggregated into DataWorks Data Map to provide a unified data view.

Metadata collection allows you to:

Build a unified data view: Break down data silos and centrally manage multi-source heterogeneous metadata.
Enable data discovery and search: Allow data consumers to quickly and accurately find the data they need.
Analyze full-link lineage: Trace data origins and destinations to facilitate impact analysis and troubleshooting.
Empower data governance: Perform data classification, grading, access control, quality monitoring, and lifecycle management based on complete metadata.

Billing

By default, each collection task consumes 0.25 CUs × task runtime. For more information, see Resource group fees. Each successful collection generates a scheduling instance. For more information, see Scheduling instance fees.

Limitations

If the data source uses whitelist access control, you must configure the database whitelist. For more information, see Metadata Collection Whitelist.
Cross-region metadata collection is not recommended. Ensure DataWorks and the data source are in the same region. To collect metadata across regions, use a public IP address when creating the data source. For more information, see Data Source Management.
The MySQL metadata crawler does not support OceanBase data sources.
Metadata collection is not supported for AnalyticDB for MySQL data sources with SSL enabled.

Entry point

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.
In the left navigation pane, click to go to the metadata collection page.

Built-in crawlers

Built-in crawlers are preconfigured and run automatically by DataWorks in near real-time. They collect core metadata integrated with DataWorks. You do not need to create them; you only need to manage their scope.

Important

If you cannot find the target table in Data Map, go to My Data > My Tools > Refresh Table Metadata to manually sync the table.

MaxCompute default crawler

This crawler collects metadata from MaxCompute projects in your account. On the details page, use Modify Data Scope to select projects and Permission Configurations to set metadata visibility within the tenant.

In the Built-in section of the metadata collection page, find the MaxCompute Default Crawler card and click Details.
The MaxCompute Default Crawler details page contains the Basic Information and Data Scope tabs.
- Basic Information: Displays basic attributes of the crawler, such as collection type and mode. This information is read-only.
- Data Scope: Manages which MaxCompute projects to collect.
Modify collection scope:
1. Click Data Scope and click Modify Data Scope.
2. In the dialog box, select or clear the MaxCompute projects to collect.
  Important
  The default scope includes all MaxCompute projects bound to workspaces in the current region under the current tenant. After the scope is modified, only metadata objects within the scope are visible in Data Map. Metadata that is not selected will be invisible.
3. Click OK to save the changes.
Configure metadata visibility:
- In the Data Scope list, find the target project and click Actions in the Permission Configurations column.
- Select a visibility policy based on your data governance requirements:
  - Public Within Tenant: All members in the tenant can search for and view metadata of this project.
  - Only members in the associated workspace can search and view.: Only members of specific workspaces can access metadata of this project, ensuring data isolation.

DLF Default Crawler

Important

To support real-time collection of DLF metadata, you must grant the Data Reader permission to the Service Linked Role AliyunServiceRoleForDataworksOnEmr in the DLF console.

The DLF Default Crawler collects metadata from Data Lake Formation (DLF) within your account.

In the Built-in section of the metadata collection page, find the DLF Default Crawler card and click Details to view basic information.
Click the Data Scope tab to view the list of DLF Catalogs included in the collection scope and their table counts.
By default, all accessible Catalogs (including DLF and DLF-Legacy versions) are collected.

Custom crawlers

Custom crawlers provide unified metadata management across environments and engines.

For conventional data sources
Supports custom crawlers for traditional structured or semi-structured data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. The system parses the physical database table structure to automatically extract and synchronize metadata such as field attributes, indexes, and partitions.
For metadata-type data sources (Catalog)
Supports direct collection of metadata-type data sources for non-DLF managed, self-declared native lake format metadata, such as Paimon Catalog.

Create custom crawler

In the custom crawler list section of the metadata collection page, click Create Metadata Collection.
Select collection type: On the type selection page, select the target data source type to collect, such as Hologres or StarRocks.
Configure basic information and resource group:
- Basic Configurations:
  - Select Workspace: Select the workspace containing the data source.
  - Select Data Source: Select a created target data source from the drop-down list. After selection, the system automatically displays details of the data source.
  - Name: Enter a name for the crawler for future identification. The default name is the same as the data source name.
- Resource Group Configuration:
  - Resource Group: Select a resource group to run the collection task.
  - Test Network Connectivity: This step is critical. Click Test Network Connectivity to ensure the resource group can successfully access the data source.
    Important
    Check the data source for whitelist restrictions. If you need to collect metadata with whitelist access control enabled, see Overview of network connectivity solutions and Configure a whitelist to configure whitelist permissions.
    If the data source does not have whitelist restrictions, see Network connectivity and operations on resource groups for network connectivity configuration.
    If the connectivity test fails with error: backend service call failed: test connectivity failed.not support data type, contact technical support to upgrade the resource group.
Configure metadata collection:
- Collection Scope: Define the databases (Database/Schema) to collect. If the data source is database-granular, the corresponding database is selected by default. You can select additional databases outside the data source.
  Important
  - A database can be configured in only one crawler. If a database cannot be selected, it is already being collected by another crawler.
  - If you narrow the collection scope, metadata outside the scope becomes unsearchable in Data Map.
Configure Intelligent Enhancement Settings and Collection Plan:
- Intelligent Enhancement Settings (Beta):
  - AI Collection Description: When enabled, the system uses LLMs to automatically generate business descriptions for your tables and fields after metadata collection, greatly improving metadata readability and usability. After collection is complete, you can view the AI-generated information (such as table remarks and field descriptions) on the details page of the table object in Data Map.
- Collection Plan:
  - Trigger Mode: Select Manual or Periodic.
    - Manual: The crawler runs only when manually triggered. This applies to one-time or on-demand collection.
    - Periodic: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system will automatically update metadata periodically.
      To configure a minute-level scheduled task, select hourly collection and check all minute options to achieve a 5-minute interval task.
      Important
      Periodic collection is supported only for production environment data sources.
Save configuration: Click Save or Save and Run to complete the creation of the crawler.

Manage custom crawlers

After a crawler is created, it appears in the custom list. You can perform the following management operations:

List operations: In the list, you can directly Run, Stop, or Delete the crawler. Use the Filter and Search features at the top to quickly locate the target crawler.
Important
Deleting a metadata crawler removes its collected metadata objects from Data Map. Users cannot search for or view these objects. Caution: This action cannot be undone.
View details and logs: Click the crawler name to view its details.
- Basic Information: View all configuration items of the crawler.
- Data Scope: View or Modify Data Scope.
  If viewed before collection, the table count and latest update time will be empty.
  The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, ElasticSearch, Tablestore (OTS), MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.
- Run Logs: Track the execution history of each collection task. You can view the start time, duration, status, and volume of collected data. When a task fails, clicking View Logs is the key entry point for locating and resolving issues.
Manually execute collection: In the upper-right corner, click Collect Metadata to immediately trigger a collection task. Use this to immediately view a newly created table in Data Map.

Next steps

After metadata is collected, you can use Data Map to:

Search for your collected tables in Data Map and view their details, field information, partitions, and data preview. For more information, see Metadata details.
Analyze upstream and downstream lineage relationships of tables to understand the full data processing link. For more information, see View lineages.
Add assets to data albums to organize and manage your data from a business perspective. For more information, see Data albums.

FAQ

Q: Collection times out or fails for database sources like MySQL?
A: Ensure that the vSwitch CIDR Block of the resource group is added to the whitelist.

Collection scope and timeliness

Data tables

Data Source Type	Collection Mode	Collection granularity	Update timeliness
Data Source Type	Collection Mode	Collection granularity	Table/field	Partition	Lineage
MaxCompute	System default auto-collection	Instance	Standard project: Real-time External project: T+1	Chinese mainland regions: Real-time Overseas regions: T+1	Real-time
Data Lake Formation (DLF)	System default auto-collection	Instance	Real-time	Real-time	Lineage is supported for DLF metadata of Serverless Spark, Serverless StarRocks, and Serverless Flink engines; other engines are not supported. Important For EMR clusters, you must enable EMR_HOOK.
Hologres	Manually create crawler	Database	Depends on schedule	Not supported	Real-time
EMR Hive		Instance	Depends on schedule	Depends on schedule	Real-time Important You must enable EMR_HOOK for the cluster.
CDH Hive		Instance	Depends on schedule	Real-time	Real-time
StarRocks		Database	Instance Mode: Real-time. Connection String Mode: Depends on schedule.	Not supported	Real-time Important Lineage collection is supported only in Instance Mode. Lineage cannot be collected in Connection String Mode.
AnalyticDB for MySQL		Database	Depends on schedule	Not supported	Real-time Note You need to submit a ticket to enable the data lineage feature for AnalyticDB for MySQL instances.
AnalyticDB for Spark		Instance	Real-time	Not supported	Real-time
AnalyticDB for PostgreSQL		Database	Depends on schedule	Not supported	Real-time
Lindorm		Instance	Depends on schedule	Not supported	Real-time
Tablestore (OTS)		Instance	Depends on schedule	Not supported	Not supported
MongoDB		Instance	Depends on schedule	Not supported	Not supported
ElasticSearch		Instance	Depends on schedule	Not supported	T+1 update
Paimon Catalog		Catalog	Depends on schedule	Depends on schedule	Not supported
Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, etc.)		Database	Depends on schedule	Not supported	Not supported

Note

AnalyticDB for Spark and AnalyticDB for MySQL use the same metadata collection entry point.

Task code

Data Map supports code search and quick location. The following table describes the supported scope.

Code source	Collection scope	Trigger method
Data Studio	Data Studio - Create node and edit code	Auto collection
Data Studio (Legacy)	Data Studio (Legacy) - Create node and edit code
Data Analysis	Data Analysis - Create SQL query and edit code
DataService Studio	DataService Studio - Create API data push service

API assets

Data Map supports viewing DataService Studio API metadata:

API Type	Collection scope	Trigger method
Generated API (Codeless UI)	DataService Studio - Create API via codeless UI	Auto collection
Generated API (Code editor)	DataService Studio - Create API via code editor
Registered API	DataService Studio - Register API
Service Orchestration	DataService Studio - Create Service Orchestration

AI assets

Data Map supports viewing and managing AI assets, and provides AI asset lineage to trace the origin, usage, and evolution of data and models. The following table describes the support for AI assets.

Type	Collection scope	Trigger method
Dataset	PAI - Create dataset/Register dataset DataWorks - Create dataset	Auto collection
AI Model	PAI - Model training task/Register model/Deploy model service
Algorithm Task	PAI - Training task/Workflow task/Distributed training task
Model Service	PAI - Deploy model service (EAS deployment)

Workspace

Data Map supports viewing workspace metadata:

Project	Collection Mode	Trigger method
Workspace	DataWorks - Create workspace	Auto collection