DataWorks Data Map provides the Collect Metadata feature, which helps you centrally manage metadata from various DataWorks data sources. You can view all collected metadata in Data Map. This topic describes how to create a crawler to collect metadata from your data sources into DataWorks.
Overview
Collect Metadata is a core feature for building an enterprise-level data map and achieving unified data asset management. It runs crawlers to automatically extract technical metadata, such as databases, tables, and fields, along with data lineage and partition information. The crawlers extract this information from various DataWorks data sources, such as MaxCompute, Hologres, MySQL, and CDH Hive, that are distributed across different workspaces in the same region. The collected metadata is then consolidated in DataWorks Data Map to provide a unified data view.
Collect Metadata lets you:
Build a unified data view: Break down data silos by centrally managing heterogeneous metadata from multiple sources.
Support data discovery and search: Enable data consumers to quickly and accurately find the data they need.
Enable end-to-end lineage analysis: Clearly trace the origin and flow of data for impact analysis and troubleshooting.
Empower data governance: Perform data classification, permission control, quality monitoring, and lifecycle management based on complete metadata.
Billing
Each collection task consumes 0.25 CU × task runtime by default, which incurs resource group fees. Each successful collection generates a scheduling instance, which incurs task scheduling fees.
Limits
When you acquire metadata from a data source that uses a whitelist for access control, you must configure the database whitelist in advance. For more information, see Metadata collection whitelist.
Cross-region collect metadata is not recommended. The DataWorks region should be the same as the data source region. To perform cross-region collect metadata, you must use a public endpoint when you create the data source. For more information, see Data Source Management.
Using a MySQL Database Collector to acquire metadata from an OceanBase data source is not supported.
Go to the feature page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to Data Map.
In the navigation pane on the left, click
to go to the collect metadata page.
Built-in crawlers
System-built-in crawlers are pre-configured by the DataWorks platform and run automatically in near real-time. They are primarily used to collect core metadata that is deeply integrated with DataWorks. You do not need to create them. You only need to manage their scope.
If you cannot find the target table in Data Map, go to to manually sync the table.
MaxCompute Default Crawler
This crawler collects metadata from the MaxCompute projects under your account. You can go to the details page to select the projects for collection using the Modify Data Scope option and set metadata visibility within the tenant using the Permission Configurations option.
On the collect metadata page, in the Built-in section, find the MaxCompute Default Crawler card and click Details.
The MaxCompute Default Crawler details page contains the Basic Information and Data Scope tabs.
Basic Information: Displays the basic properties of the crawler, such as the collection type and method. This information is read-only.
Data Scope: Manage the MaxCompute projects from which the crawler collects metadata.
Modify collection scope:
Switch to the Data Scope tab and click the Modify Data Scope button.
In the dialog box that appears, select or clear the checkboxes for the MaxCompute projects to include in the collection.
ImportantThe default scope includes all MaxCompute projects in the current region that are attached to a workspace under the current tenant. After you modify the data scope, the metadata objects in Data Map are updated to match the new scope. This means that metadata for unselected projects will not be visible.
Click Confirm to save the changes.
Configure metadata visibility:
In the Data Scope list, find the target project and click Permission Configurations in the Actions column.
Select a visibility policy based on your data governance requirements:
Public within Tenant: All members within the tenant can search for and view the metadata of this project.
Only members in the associated workspace can search and view.: Only members of specific workspaces can access the metadata of this project. This ensures data isolation.
DLF Default Crawler
To support real-time collection of DLF metadata, you must grant the Data Reader permission to the service-linked role AliyunServiceRoleForDataworksOnEmr in the DLF console.
The DLF Default Crawler collects metadata from Data Lake Formation (DLF) under your account.
On the collect metadata page, in the Built-in section, find the DLF Default Crawler card and click Details to view its basic information.
Switch to the Data Scope tab to view the list of DLF Catalogs that are within the collection scope and the number of tables they contain.
By default, all accessible Catalogs are collected, including DLF and DLF-Legacy versions.
Custom crawlers
You need to create a custom crawler to collect metadata from data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive.
Create a custom crawler
On the collect metadata page, in the custom crawler section, click Create Metadata Collection.
Select Collection Type: On the type selection page, select the type of the target data source from which to collect metadata, such as Hologres or StarRocks.
Configure basic information and the resource group:
Basic configuration:
Select Workspace: Select the workspace where the data source is located.
Select Data Source: Select an existing target data source from the drop-down list. After you select a data source, the system automatically displays its details.
Name: Enter a name for the crawler. By default, the crawler name is the same as the data source name.
Resource group configuration:
Resource Group: Select a resource group to run the collection task.
Test Network Connectivity: This step is crucial. Click Test Network Connectivity to ensure that the resource group can successfully access the data source.
ImportantIf the data source has whitelist-based access control enabled, you must configure the whitelist permissions. For more information, see Network connectivity solutions and General configuration: Add a whitelist.
If the data source does not use a whitelist, you must establish a network connection for it. For more information, see Resource group operations and network connectivity.
Metadata Collection Configurations:
Collection scope: Define the databases (Database/Schema) from which to collect metadata. If the data source has database-level granularity, the database associated with the data source is selected by default. You can also select other databases.
ImportantA database can only be configured in one crawler. If a database cannot be selected, it is already being collected by another crawler.
If you narrow the collection scope, metadata outside the new scope is no longer searchable in Data Map.
Intelligent Enhancement Settings and Collection Plan:
Intelligent Enhancement Settings (Beta):
AI-Enhanced Description: If you enable this feature, the system uses large models to automatically generate business descriptions for your tables and fields after collecting metadata. This greatly improves metadata readability and usability. After the collection is complete, you can go to the details page of a table object in Data Map to view the AI-generated information, such as table descriptions and field descriptions.
Collection Plan:
Trigger Mode: Select Manual or Cycle.
Manual: The crawler runs only when you manually trigger it. This is suitable for one-time or on-demand collection scenarios.
Cycle: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system will automatically update the metadata periodically.
To configure a task with minute-level granularity, set the schedule to hourly and then select the desired minutes. For example, you can configure a task to run every 5 minutes.
ImportantOnly data sources in the production environment support periodic collection.
Save configuration: Click Save or Save and Run to create the crawler.
Manage custom crawlers
After a crawler is created, it appears in the custom crawler list. You can perform the following management operations:
List operations: In the list, you can Run, Stop, or Delete a crawler. Use the Filter and Search functions at the top to quickly locate the target crawler.
ImportantAfter a crawler is deleted, the metadata objects it collected are also removed from Data Map. These objects and their details will no longer be searchable or viewable. Proceed with caution.
View details and logs: Click the name of the target crawler to go to its details page.
Basic Information: View all configuration items of the crawler.
Data Scope: View or Modify Data Scope.
If viewed before a collection runs, the table count and last update time are empty.
Modifying the scope is not supported for the following data sources: EMR Hive, CDH Hive, Lindorm, Elasticsearch, OTS, and AnalyticDB for Spark in AnalyticDB for MySQL.
Run Logs: Track the execution history of each collection task. You can view the task's start time, duration, status, and the volume of data collected. If a task fails, click View Logs to find information for troubleshooting and resolving the issue.
Manually run a collection task: In the upper-right corner of the details page, click the Collect Metadata button to immediately trigger a collection task. This is useful if you want to immediately view a newly created table in Data Map.
What to do next
After metadata is successfully collected, you can take full advantage of the features in Data Map:
Search for your collected tables in Data Map to view their details, field information, partitions, and data previews. For more information, see Metadata details.
Analyze the upstream and downstream lineage of tables to understand the entire data processing flow. For more information, see Data lineage analysis.
Add assets to a data collection to organize and manage your data from a business perspective. For more information, see Data collections.
FAQ
Q: Why do collection tasks for databases such as MySQL time out or fail?
A: Check whether you have added the VSwitch CIDR Block of the resource group to the whitelist.
Appendix: Collection scope and timeliness
Data tables
Data source type | Collection method | Collection granularity | Metadata update timeliness | ||
Table/Field | Partition | Lineage | |||
MaxCompute | System default automatic collection | Instance | Standard project: Real-time External project: T+1 | Regions in the Chinese mainland: Real-time Regions outside China: T+1 | Real-time |
Data Lake Formation (DLF) | Instance | Real-time | Real-time | Data lineage is supported for DLF metadata from Serverless Spark, Serverless StarRocks, and Serverless Flink engines. It is not supported for others. Important For EMR clusters, you must enable EMR_HOOK. | |
Hologres | Create a crawler manually | Database | Depends on the collection schedule | Real-time | |
EMR Hive | Instance | Depends on the collection schedule | Depends on the collection schedule | Real-time Important You must enable EMR_HOOK for the cluster. | |
CDH Hive | Instance | Depends on the collection schedule | Real-time | Real-time | |
StarRocks | Database |
| Real-time Important Only instance mode supports data lineage collection. Connection string mode cannot collect data lineage. | ||
AnalyticDB for MySQL | Database | Depends on the collection schedule | Real-time Note You must submit a ticket to enable the data lineage feature for your AnalyticDB for MySQL instance. | ||
AnalyticDB for Spark | Instance | Real-time | Real-time | ||
AnalyticDB for PostgreSQL | Database | Depends on the collection schedule | Real-time | ||
Lindorm | Instance | Depends on the collection schedule | Real-time | ||
OTS | Instance | Depends on the collection schedule | |||
Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, etc.) | Database | Depends on the collection schedule | |||
AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.
Task code
Data Map supports searching for and quickly locating task code. The following table describes the supported scope for code searches.
Code source | Collection scope | Trigger method |
DataStudio | Data Studio - Create a node and edit the code | Automatic collection |
DataStudio (Legacy) | DataStudio (legacy version) - Create a node and edit the code | |
DataAnalysis | DataAnalysis - Create an SQL query and edit the code | |
DataService Studio | DataService Studio - Create an API data push service |
API assets
Data Map supports viewing the metadata of DataService Studio APIs, as detailed below:
API type | Collection scope | Trigger method |
Generate API (Codeless UI) | DataService Studio - Create an API using the codeless UI | Automatic collection |
Generate API (Code editor) | DataService Studio - Create an API using the code editor | |
Registered API | DataService Studio - Register an API | |
Service Orchestration | DataService Studio - Create a service orchestration |
AI assets
Data Map supports viewing and managing AI assets. It also provides an AI asset lineage feature to track the source, usage, and evolution of data and models. The following table describes the support for various AI assets.
Asset type | Collection scope | Trigger method |
Dataset |
| Automatic collection |
AI model | PAI - Model training task/Register model/Deploy model service | |
Algorithm Task | PAI - Training task/Flow task/Distributed training task | |
Model Service | PAI - Deploy model service (EAS deployment) |
Workspace
Data Map supports viewing workspace metadata, as detailed below:
Item | Collection method | Trigger method |
Workspace | DataWorks - Create a workspace | Automatic collection |