All Products
Search
Document Center

DataWorks:Collect Metadata

Last Updated:Nov 20, 2025

DataWorks Data Map provides the Collect Metadata feature, which helps you centrally manage metadata from various DataWorks data sources. You can view all collected metadata in Data Map. This topic describes how to create a crawler to collect metadata from your data sources into DataWorks.

Overview

Collect Metadata is a core feature for building an enterprise-level data map and achieving unified data asset management. It runs crawlers to automatically extract technical metadata, such as databases, tables, and fields, along with data lineage and partition information. The crawlers extract this information from various DataWorks data sources, such as MaxCompute, Hologres, MySQL, and CDH Hive, that are distributed across different workspaces in the same region. The collected metadata is then consolidated in DataWorks Data Map to provide a unified data view.

Collect Metadata lets you:

  • Build a unified data view: Break down data silos by centrally managing heterogeneous metadata from multiple sources.

  • Support data discovery and search: Enable data consumers to quickly and accurately find the data they need.

  • Enable end-to-end lineage analysis: Clearly trace the origin and flow of data for impact analysis and troubleshooting.

  • Empower data governance: Perform data classification, permission control, quality monitoring, and lifecycle management based on complete metadata.

Billing

Each collection task consumes 0.25 CU × task runtime by default, which incurs resource group fees. Each successful collection generates a scheduling instance, which incurs task scheduling fees.

Limits

  • When you acquire metadata from a data source that uses a whitelist for access control, you must configure the database whitelist in advance. For more information, see Metadata collection whitelist.

  • Cross-region collect metadata is not recommended. The DataWorks region should be the same as the data source region. To perform cross-region collect metadata, you must use a public endpoint when you create the data source. For more information, see Data Source Management.

  • Using a MySQL Database Collector to acquire metadata from an OceanBase data source is not supported.

Go to the feature page

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.

  2. In the navigation pane on the left, click image to go to the collect metadata page.

Built-in crawlers

System-built-in crawlers are pre-configured by the DataWorks platform and run automatically in near real-time. They are primarily used to collect core metadata that is deeply integrated with DataWorks. You do not need to create them. You only need to manage their scope.

Important

If you cannot find the target table in Data Map, go to My Data > My Tools > Refresh Table Metadata to manually sync the table.

MaxCompute Default Crawler

This crawler collects metadata from the MaxCompute projects under your account. You can go to the details page to select the projects for collection using the Modify Data Scope option and set metadata visibility within the tenant using the Permission Configurations option.

  1. On the collect metadata page, in the Built-in section, find the MaxCompute Default Crawler card and click Details.

  2. The MaxCompute Default Crawler details page contains the Basic Information and Data Scope tabs.

    • Basic Information: Displays the basic properties of the crawler, such as the collection type and method. This information is read-only.

    • Data Scope: Manage the MaxCompute projects from which the crawler collects metadata.

  3. Modify collection scope:

    1. Switch to the Data Scope tab and click the Modify Data Scope button.

    2. In the dialog box that appears, select or clear the checkboxes for the MaxCompute projects to include in the collection.

      Important

      The default scope includes all MaxCompute projects in the current region that are attached to a workspace under the current tenant. After you modify the data scope, the metadata objects in Data Map are updated to match the new scope. This means that metadata for unselected projects will not be visible.

    3. Click Confirm to save the changes.

  4. Configure metadata visibility:

    • In the Data Scope list, find the target project and click Permission Configurations in the Actions column.

    • Select a visibility policy based on your data governance requirements:

      • Public within Tenant: All members within the tenant can search for and view the metadata of this project.

      • Only members in the associated workspace can search and view.: Only members of specific workspaces can access the metadata of this project. This ensures data isolation.

DLF Default Crawler

Important

To support real-time collection of DLF metadata, you must grant the Data Reader permission to the service-linked role AliyunServiceRoleForDataworksOnEmr in the DLF console.

The DLF Default Crawler collects metadata from Data Lake Formation (DLF) under your account.

  1. On the collect metadata page, in the Built-in section, find the DLF Default Crawler card and click Details to view its basic information.

  2. Switch to the Data Scope tab to view the list of DLF Catalogs that are within the collection scope and the number of tables they contain.

    By default, all accessible Catalogs are collected, including DLF and DLF-Legacy versions.

Custom crawlers

You need to create a custom crawler to collect metadata from data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive.

Create a custom crawler

  1. On the collect metadata page, in the custom crawler section, click Create Metadata Collection.

  2. Select Collection Type: On the type selection page, select the type of the target data source from which to collect metadata, such as Hologres or StarRocks.

  3. Configure basic information and the resource group:

    • Basic configuration:

      • Select Workspace: Select the workspace where the data source is located.

      • Select Data Source: Select an existing target data source from the drop-down list. After you select a data source, the system automatically displays its details.

      • Name: Enter a name for the crawler. By default, the crawler name is the same as the data source name.

    • Resource group configuration:

      • Resource Group: Select a resource group to run the collection task.

      • Test Network Connectivity: This step is crucial. Click Test Network Connectivity to ensure that the resource group can successfully access the data source.

        Important
  4. Metadata Collection Configurations:

    • Collection scope: Define the databases (Database/Schema) from which to collect metadata. If the data source has database-level granularity, the database associated with the data source is selected by default. You can also select other databases.

      Important
      • A database can only be configured in one crawler. If a database cannot be selected, it is already being collected by another crawler.

      • If you narrow the collection scope, metadata outside the new scope is no longer searchable in Data Map.

  5. Intelligent Enhancement Settings and Collection Plan:

    • Intelligent Enhancement Settings (Beta):

      • AI-Enhanced Description: If you enable this feature, the system uses large models to automatically generate business descriptions for your tables and fields after collecting metadata. This greatly improves metadata readability and usability. After the collection is complete, you can go to the details page of a table object in Data Map to view the AI-generated information, such as table descriptions and field descriptions.

    • Collection Plan:

      • Trigger Mode: Select Manual or Cycle.

        • Manual: The crawler runs only when you manually trigger it. This is suitable for one-time or on-demand collection scenarios.

        • Cycle: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system will automatically update the metadata periodically.

          To configure a task with minute-level granularity, set the schedule to hourly and then select the desired minutes. For example, you can configure a task to run every 5 minutes.
          Important

          Only data sources in the production environment support periodic collection.

  6. Save configuration: Click Save or Save and Run to create the crawler.

Manage custom crawlers

After a crawler is created, it appears in the custom crawler list. You can perform the following management operations:

  • List operations: In the list, you can Run, Stop, or Delete a crawler. Use the Filter and Search functions at the top to quickly locate the target crawler.

    Important

    After a crawler is deleted, the metadata objects it collected are also removed from Data Map. These objects and their details will no longer be searchable or viewable. Proceed with caution.

  • View details and logs: Click the name of the target crawler to go to its details page.

    • Basic Information: View all configuration items of the crawler.

    • Data Scope: View or Modify Data Scope.

      If viewed before a collection runs, the table count and last update time are empty.
      Modifying the scope is not supported for the following data sources: EMR Hive, CDH Hive, Lindorm, Elasticsearch, OTS, and AnalyticDB for Spark in AnalyticDB for MySQL.
    • Run Logs: Track the execution history of each collection task. You can view the task's start time, duration, status, and the volume of data collected. If a task fails, click View Logs to find information for troubleshooting and resolving the issue.

  • Manually run a collection task: In the upper-right corner of the details page, click the Collect Metadata button to immediately trigger a collection task. This is useful if you want to immediately view a newly created table in Data Map.

What to do next

After metadata is successfully collected, you can take full advantage of the features in Data Map:

  • Search for your collected tables in Data Map to view their details, field information, partitions, and data previews. For more information, see Metadata details.

  • Analyze the upstream and downstream lineage of tables to understand the entire data processing flow. For more information, see Data lineage analysis.

  • Add assets to a data collection to organize and manage your data from a business perspective. For more information, see Data collections.

FAQ

  • Q: Why do collection tasks for databases such as MySQL time out or fail?

    A: Check whether you have added the VSwitch CIDR Block of the resource group to the whitelist.

Appendix: Collection scope and timeliness

Data tables

Data source type

Collection method

Collection granularity

Metadata update timeliness

Table/Field

Partition

Lineage

MaxCompute

System default automatic collection

Instance

Standard project: Real-time

External project: T+1

Regions in the Chinese mainland: Real-time

Regions outside China: T+1

Real-time

Data Lake Formation (DLF)

Instance

Real-time

Real-time

Data lineage is supported for DLF metadata from Serverless Spark, Serverless StarRocks, and Serverless Flink engines. It is not supported for others.

Important

For EMR clusters, you must enable EMR_HOOK.

Hologres

Create a crawler manually

Database

Depends on the collection schedule

Not supported

Real-time

EMR Hive

Instance

Depends on the collection schedule

Depends on the collection schedule

Real-time

Important

You must enable EMR_HOOK for the cluster.

CDH Hive

Instance

Depends on the collection schedule

Real-time

Real-time

StarRocks

Database

  • Instance mode: Real-time.

  • Connection string mode: Depends on the collection schedule.

Not supported

Real-time

Important

Only instance mode supports data lineage collection. Connection string mode cannot collect data lineage.

AnalyticDB for MySQL

Database

Depends on the collection schedule

Not supported

Real-time

Note

You must submit a ticket to enable the data lineage feature for your AnalyticDB for MySQL instance.

AnalyticDB for Spark

Instance

Real-time

Not supported

Real-time

AnalyticDB for PostgreSQL

Database

Depends on the collection schedule

Not supported

Real-time

Lindorm

Instance

Depends on the collection schedule

Not supported

Real-time

OTS

Instance

Depends on the collection schedule

Not supported

Not supported

Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, etc.)

Database

Depends on the collection schedule

Not supported

Not supported

Note

AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.

Task code

Data Map supports searching for and quickly locating task code. The following table describes the supported scope for code searches.

Code source

Collection scope

Trigger method

DataStudio

Data Studio - Create a node and edit the code

Automatic collection

DataStudio (Legacy)

DataStudio (legacy version) - Create a node and edit the code

DataAnalysis

DataAnalysis - Create an SQL query and edit the code

DataService Studio

DataService Studio - Create an API data push service

API assets

Data Map supports viewing the metadata of DataService Studio APIs, as detailed below:

API type

Collection scope

Trigger method

Generate API (Codeless UI)

DataService Studio - Create an API using the codeless UI

Automatic collection

Generate API (Code editor)

DataService Studio - Create an API using the code editor

Registered API

DataService Studio - Register an API

Service Orchestration

DataService Studio - Create a service orchestration

AI assets

Data Map supports viewing and managing AI assets. It also provides an AI asset lineage feature to track the source, usage, and evolution of data and models. The following table describes the support for various AI assets.

Asset type

Collection scope

Trigger method

Dataset

  • PAI - Create/Register a dataset

  • DataWorks - Create a dataset

Automatic collection

AI model

PAI - Model training task/Register model/Deploy model service

Algorithm Task

PAI - Training task/Flow task/Distributed training task

Model Service

PAI - Deploy model service (EAS deployment)

Workspace

Data Map supports viewing workspace metadata, as detailed below:

Item

Collection method

Trigger method

Workspace

DataWorks - Create a workspace

Automatic collection