How to create a collector to gather metadata from various data sources into DataWorks - DataWorks

DataWorks Data Map provides the Metadata Collection feature, which allows you to centrally consolidate and manage metadata from various DataWorks data sources. In Data Map, you can view the metadata aggregated from all your sources. This topic describes how to create a crawler that collects and consolidates metadata from various data sources into DataWorks.

Overview

Metadata collection is a core feature for building an enterprise-grade Data Map and achieving unified data asset management. A crawler automatically extracts technical metadata (such as databases, tables, and columns), data lineage, and partition information from DataWorks data sources (such as MaxCompute, Hologres, MySQL, and CDH Hive) that are scattered across different workspaces within the same region, and consolidates this information into DataWorks Data Map to create a unified data view.

Metadata collection allows you to:

Build a unified data view: Break down data silos by centrally managing metadata from multiple, heterogeneous sources.
Enable data discovery and search: Allow data consumers to quickly and accurately find the data they need.
Achieve end-to-end data lineage analysis: Clearly trace the origin and flow of data to facilitate impact analysis and troubleshooting.
Empower data governance: Implement data classification and grading, access control, quality monitoring, and lifecycle management based on comprehensive metadata.

Billing

By default, each collection task consumes 0.25 CU multiplied by the task run time, incurring Resource group fees. Each successful collection generates a scheduling instance, incurring Scheduling instance fees.

Limitations

If a data source uses an allowlist for access control, you must configure it beforehand. For more information, see Metadata collection allowlist.
The DataWorks deployment and the data source must be in the same region. If you must collect metadata across regions, use a public endpoint when you create the data source. For more information, see Data source management.
Metadata collection is not supported for AnalyticDB for MySQL data sources that have SSL enabled.

Entry point

Log on to the DataWorks console. In the target region, click Data Governance > Data Map in the left-side navigation pane. On the page that appears, click Go to Data Map.
In the left-side navigation pane, click to open the Metadata Collection page.

Built-in crawlers

Built-in crawlers are preconfigured and automatically executed (near real-time) by the DataWorks platform. They are primarily used to collect core metadata from sources deeply integrated with DataWorks. You do not need to create them. You only need to perform simple scope management.

Important

If you cannot find the target table in Data Map, go to My Data > My Tools > Refresh Table Metadata to manually synchronize the relevant tables.

MaxCompute default crawler

This crawler collects metadata from MaxCompute projects under your account. You can go to the details page and click Modify Data Scope to select the projects to collect, and click Permission Configurations to configure the visibility of metadata within the tenant.

In the Built-in section on the Metadata Collection page, find the MaxCompute Default Crawler card and click Details.
The MaxCompute Default Crawler details page contains two tabs: Basic Information and Data Scope.
- Basic Information: Displays the basic properties of the crawler, such as the collection type and method. This information is read-only.
- Data Scope: Manages which MaxCompute projects this crawler collects.
Modify the collection scope:
1. Switch to the Data Scope tab and click Modify Data Scope.
2. In the dialog that appears, select or clear the MaxCompute projects that you want to collect.
  
  Important
  By default, the scope includes all MaxCompute projects associated with workspaces in the current region under the current tenant. After you modify the data scope, the metadata objects collected in Data Map are consistent with the current data scope. Metadata of unselected projects is not visible.
3. Click OK to save the changes.
Configure metadata visibility:
- In the Data Scope list, find the target project and click Permission Configurations in the Actions column.
- Select a visibility policy based on your data governance requirements:
  - Public Within Tenant: All members within the tenant can search for and view the metadata of this project.
  - Only members in the associated workspace can search and view.: Only members of the specific workspace can access the metadata of this project, ensuring data isolation.

DLF default crawler

Important

To enable real-time collection of DLF metadata, you must grant Data Reader permissions to the service-linked role AliyunServiceRoleForDataworksOnEmr in the DLF console.

The DLF Default Crawler collects metadata from Data Lake Formation (DLF) under your account.

In the Built-in section on the Metadata Collection page, find the DLF Default Crawler card and click Details to view the basic information.
Switch to the Data Scope tab to view the list of DLF catalogs included in the collection scope and the number of tables in each catalog.

By default, all accessible catalogs (including DLF and DLF-Legacy versions) are collected.

Custom crawlers

Custom crawlers are designed to provide you with unified metadata management across environments and engines.

For standard data sources

You can create custom crawlers for traditional structured or semi-structured data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. By configuring collection tasks, the system can deeply parse the physical schema of the source and automatically extract and synchronize metadata such as column attributes, indexes, and partitions.
For metadata-type data sources (Catalog)

For metadata-type data sources with self-declared native lake-format metadata not managed by DLF, such as Paimon Catalog, you can also create crawlers for direct collection.

Create Custom Crawler

In the custom crawler list section on the Metadata Collection page, click Create Metadata Collection.
Select the collection type: On the type selection page, select the target data source type to collect, such as Hologres or StarRocks.
Configure basic settings and resource group:
- Basic Configurations:
  - Select a workspace: Select the workspace where the data source to collect resides.
  - Select Data Source: Select a target data source from the drop-down list. After you make a selection, the system automatically displays the detailed information of the data source.
  - Name: Enter a name for the crawler for future identification. By default, this is the same as the data source name.
- Resource Group Configuration:
  - Resource Group: Select a resource group to run the collection task.
  - Test Network Connectivity: This step is critical. Click Test Network Connectivity to make sure that the resource group can successfully access the data source.
    Important
    
    Check whether the data source has an allowlist enabled. If you need to collect metadata from a data source with allowlist-based access control, see Overview and Configure a whitelist to configure allowlist permissions.
    
    If the data source does not have an allowlist enabled, see Resource group operations and network connectivity to establish network connectivity for the data source.
    
    If the connectivity test returns the error Backend service call failed: test connectivity failed.not support data type, contact technical support to upgrade the resource group.
Configure metadata collection:
- Collection Scope: Define the databases (database/schema) to collect. If the data source is at the database granularity, the database corresponding to the data source is selected by default. You can select additional databases beyond the data source.
  Important
  - Each database can be configured in only one crawler. If a database cannot be selected, it has already been collected by another crawler.
  - After you narrow the collection scope, metadata outside the scope cannot be searched in Data Map.
Configure intelligent enhancement and collection plan:
- Intelligent enhancement configuration (Beta):
  - AI-generated descriptions: When enabled, the system uses large language model capabilities to automatically generate business descriptions for your tables and columns after metadata is collected, significantly improving metadata readability and usability. After collection is complete, you can view the AI-generated information (such as table descriptions and column descriptions) on the details page of the table object in Data Map.
- Collection Plan:
  - Trigger Mode: Select Manual or Periodic.
    - Manual: The crawler runs only when you manually trigger it. This mode is suitable for one-time or on-demand collection scenarios.
    - Periodic: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system automatically updates metadata on a periodic basis.
      
      To configure a minute-level scheduled task, set the collection period to hourly and select all minute-level granularity options to achieve a task that runs every 5 minutes.
      
      Important
      Only data sources in the production environment support periodic collection.
Save the configuration: Click Save or Save and Run to complete the crawler creation.

Manage custom crawlers

After a crawler is created, it appears in the custom crawler list. You can perform the following management operations:

List operations: In the list, you can directly perform operations such as Run, Stop, and Delete on the crawler. Use the Filter and Search features at the top to quickly locate the target crawler.

Important
After you delete a metadata crawler, the metadata objects collected by that crawler in Data Map become invalid. You can no longer search for or view the objects and their details from that crawler. Proceed with caution.
Batch operations: Select multiple crawlers in the list, then use Run or Stop at the bottom of the list to trigger or terminate multiple collection tasks in one go, improving management efficiency.
Crawler status: The crawler list shows the current status of each crawler. Common statuses include Not Running, Running, Succeeded, and Failed.

Note
If the data source bound to a crawler is unbound or becomes invalid, the crawler enters the Frozen state. A frozen crawler cannot be run or edited; only deletion is allowed.
View details and logs: Click the name of the target crawler to go to its details page.
- Basic Information: View all configuration items of the crawler.
- Data Scope: View or Modify Data Scope.
  
  If no collection has been performed, the table count and last update time are empty.
  
  The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, Elasticsearch, OTS, MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.
- Run Logs: Track the execution history of each collection task. You can view the start time, duration, status, and volume of data collected for each task. When a task fails, clicking View Logs is the key entry point for locating and resolving issues.
Manually run a collection: In the upper-right corner of the details page, click Collect Metadata to immediately trigger a collection task. This is useful when you want a newly created table to appear in Data Map right away.

Next steps

After metadata is collected successfully, you can take full advantage of Data Map capabilities:

Search for collected tables in Data Map and view their details, column information, partitions, and data preview. For more information, see Metadata details.
Analyze the upstream and downstream lineage of tables to understand the full data processing pipeline. For more information, see Data lineage.
Add assets to a Data Collection to organize and manage your data from a business perspective. For more information, see Data Albums.

FAQ

Q: Why does a MySQL or other database-type collection time out or fail?

A: Check whether the vSwitch CIDR Block of the resource group has been added to the allowlist.

Appendix: Collection scope and timeliness

Data tables

Data Source Type	Collection Mode	Collection granularity	Metadata update timeliness
Data Source Type	Collection Mode	Collection granularity	Table/Column	Partition	Lineage
MaxCompute	Automatically collected by the system by default	instance	Standard projects: Real-time External projects: T+1	China mainland regions: Real-time International regions: T+1	Real-time
Data Lake Formation (DLF)	Automatically collected by the system by default	Instance	Real-time	Real-time	Lineage is supported for DLF metadata of Serverless Spark, Serverless StarRocks, Serverless Flink engines, and EMR Impala engine. Other engines are not supported. Important For EMR clusters, you must enable EMR_HOOK. To display the lineage of EMR Impala tasks, you must enable lineage logging in the Impala configuration of the EMR cluster. Only EMR DataLake clusters are supported. This feature is currently in gray release. Contact Alibaba Cloud technical support to enable it before use. For configuration details, see Data lineage.
Hologres	Manually create a crawler	Database	Depends on the collection period	Not supported	Real-time
EMR Hive		Instance	Depends on the collection period	Depends on the collection period	Real-time Important You must enable EMR_HOOK for the cluster. To display the lineage of EMR Impala tasks, you must enable lineage logging in the Impala configuration of the EMR cluster. Only EMR DataLake clusters are supported. This feature is currently in gray release. Contact Alibaba Cloud technical support to enable it before use. For configuration details, see Data lineage.
CDH Hive		Instance	Depends on the collection period	Real-time	Real-time
StarRocks		Database	Instance mode: Real-time. Connection string mode: Depends on the collection period.	Not supported	Real-time Important Only instance mode supports lineage collection. Connection string mode does not support lineage collection.
AnalyticDB for MySQL		Database	Depends on the collection period	Not supported	Real-time Note You must submit a ticket to enable the data lineage feature for the AnalyticDB for MySQL instance.
AnalyticDB for Spark		Instance	Real-time	Not supported	Real-time
AnalyticDB for PostgreSQL		Database	Depends on the collection period	Not supported	Real-time
Lindorm		Instance	Depends on the collection period	Not supported	Real-time
OTS		Instance	Depends on the collection period	Not supported	Not supported
MongoDB		Instance	Depends on the collection period	Not supported	Not supported
Elasticsearch		Instance	Depends on the collection period	Not supported	T+1 update
Paimon Catalog		Catalog	Depends on the collection period	Depends on the collection period	Not supported
Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, OceanBase, etc.)		Database	Depends on the collection period	Not supported	Not supported

Note

AnalyticDB for Spark and AnalyticDB for MySQL share the same metadata collection entry point.

Task code

Data Map supports task code search and quick navigation. The following describes the scope of searchable code.

Code source	Collection scope	Trigger mode
Data Studio	Data Studio - Create a node and edit code	Automatic collection
Legacy Data Studio	Legacy Data Studio - Create a node and edit code
Data Analysis	Data Analysis - Create an SQL query and edit code
Data Service	Data Service - Create an API Data Push service

API assets

Data Map supports viewing metadata of Data Service APIs, as described below:

API Type	Collection scope	Trigger mode
API generation (wizard mode)	Data Service - Create an API in wizard mode	Automatic collection
API generation (script mode)	Data Service - Create an API in script mode
Register API	Data Service - Register an API
API orchestration	Data Service - Create an API orchestration

AI assets

Data Map supports viewing and managing AI assets, and provides AI asset lineage to trace the origin, usage, and evolution of data and models. The following describes the support status for each type of AI asset.

Asset type	Collection scope	Trigger mode
Dataset	PAI - Create a dataset / Register a dataset DataWorks - Create a dataset	Automatic collection
AI model	PAI - Model training task / Register a model / Deploy a model service
Algorithm task	PAI - Training task / Workflow task / Distributed training task
Model service	PAI - Deploy a model service (EAS deployment)

Workspace

Data Map supports viewing workspace metadata, as described below:

Project	Collection Mode	Trigger mode
Workspace	DataWorks - Create a workspace	Automatic collection