Metadata collection in DataWorks Data Map extracts technical metadata -- databases, tables, columns, partitions, and data lineage -- from data sources across workspaces in the same region and aggregates it into a unified catalog. Collected metadata becomes searchable and browsable in Data Map.
Two types of crawlers are available:
Built-in crawlers run automatically with zero configuration. They collect metadata from MaxCompute and Data Lake Formation (DLF) in near real-time.
Custom crawlers connect to additional data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. Create and schedule them to fit your collection needs.
How it works
Built-in crawlers automatically synchronize metadata from integrated data sources (MaxCompute and DLF) using an internal mechanism that requires no user configuration. Custom crawlers connect to an external data source through a resource group, read the catalog structure (databases, schemas, tables, columns), and write the extracted metadata to Data Map. Depending on the data source, a crawler also collects partition information and data lineage.
Collected metadata enables the following workflows in Data Map:
Data discovery -- search for tables across data sources by name, field, or owner.
Lineage analysis -- trace data origins and downstream dependencies for impact analysis and troubleshooting.
Data governance -- classify data assets, manage access control, monitor data quality, and enforce lifecycle policies.
Open the metadata collection page
Log on to the DataWorks console. In the top navigation bar, select the target region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.
In the left navigation pane of Data Map, click
to open the metadata collection page.
Built-in crawlers
Built-in crawlers are preconfigured by DataWorks and run automatically in near real-time. They collect core metadata from integrated data sources with no manual setup -- manage only the collection scope.
If a table is missing from Data Map, go to My Data > My Tools > Refresh Table Metadata to manually sync it.
MaxCompute default crawler
The MaxCompute Default Crawler collects metadata from MaxCompute projects in your account.
View crawler details
In the Built-in section of the metadata collection page, find the MaxCompute Default Crawler card and click Details.
The details page has two tabs:
Basic Information -- displays crawler attributes such as collection type and mode. This tab is read-only.
Data Scope -- lists the MaxCompute projects included in the collection.
Modify the collection scope
Click the Data Scope tab, then click Modify Data Scope.
In the dialog box, select or clear the MaxCompute projects to include.
Click Confirm.
By default, the scope includes all MaxCompute projects bound to workspaces in the current region under the current tenant. Metadata from unselected projects becomes invisible in Data Map.
Configure metadata visibility
In the Data Scope list, find the target project and click Permission Configurations in the Actions column.
Select a visibility policy:
Public Within Tenant -- all tenant members can search for and view metadata from this project.
Only members in the associated workspace can search and view. -- restricts metadata access to members of specific workspaces, providing data isolation.
DLF default crawler
To enable real-time collection of DLF metadata, grant the Data Reader permission to the service linked role AliyunServiceRoleForDataworksOnEmr in the DLF console.
The DLF Default Crawler collects metadata from Data Lake Formation (DLF) within your account. By default, all accessible catalogs (including DLF and DLF-Legacy versions) are collected.
In the Built-in section of the metadata collection page, find the DLF Default Crawler card and click Details to view basic information.
Click the Data Scope tab to view the list of DLF catalogs and their table counts.
Custom crawlers
Custom crawlers extend metadata collection to data sources not covered by built-in crawlers. Two categories are supported:
Conventional data sources -- Hologres, StarRocks, MySQL, Oracle, CDH Hive, and others. The system parses database table structures to automatically extract field attributes, indexes, and partitions.
Metadata-type data sources (Catalog) -- non-DLF native lake format metadata such as Paimon Catalog.
Create a custom crawler
In the custom crawler list section of the metadata collection page, click Create Metadata Collection.
Select the collection type. Choose the target data source type (for example, Hologres or StarRocks).
Configure basic information and resource group.
Important- If the data source has whitelist restrictions, see Overview of network connectivity solutions and Configure a whitelist. - If the data source does not have whitelist restrictions, see Network connectivity and operations on resource groups. - If the connectivity test fails with
backend service call failed: test connectivity failed.not support data type, contact technical support to upgrade the resource group.Basic Configurations:
Field Description Select Workspace The workspace that contains the data source Select Data Source A data source created in the selected workspace. The system displays data source details after selection. Name A name for the crawler. Defaults to the data source name. Resource Group Configuration:
Field Description Resource Group The resource group that runs the collection task Test Network Connectivity Verifies that the resource group can access the data source. Run this test before proceeding. Define the collection scope. Select the databases (Database/Schema) to collect. For database-granular data sources, the corresponding database is selected by default. Additional databases outside the data source can also be selected.
Important- A database can be configured in only one crawler. If a database is grayed out, it is already being collected by another crawler. - Narrowing the collection scope makes metadata outside the scope unsearchable in Data Map.
Configure intelligent enhancement settings and the collection plan.
To configure a 5-minute interval, select hourly collection and check all minute options.
ImportantPeriodic collection is supported only for production environment data sources.
Intelligent Enhancement Settings (Beta):
Setting Description AI-Enhanced Description Uses large language models to automatically generate business descriptions for tables and fields after metadata collection. View AI-generated descriptions (such as table remarks and field descriptions) on the table details page in Data Map. Collection Plan:
Setting Description Trigger Mode Manual -- the crawler runs only when manually triggered. Use this for one-time or on-demand collection. Cycle -- runs on a schedule (monthly, daily, weekly, or hourly). The system updates metadata automatically at the configured interval. Click Save or Save and Run.
Manage custom crawlers
After creation, crawlers appear in the custom list. The following operations are available.
List operations:
Run, Stop, or Delete a crawler directly from the list. The available actions depend on the crawler status (for example, a running crawler shows Stop instead of Run).
Use the Filter and Search features to locate a specific crawler.
Deleting a crawler removes its collected metadata objects from Data Map. This action cannot be undone.
View details and logs:
Click the crawler name to open its details page:
| Tab | Description |
|---|---|
| Basic Information | All configuration items of the crawler |
| Data Scope | Current collection scope. Click Modify Data Scope to update it. |
| Run Logs | Execution history for each collection task, including start time, duration, status, and data volume. Click View Logs to troubleshoot failed tasks. |
Before the first collection run, the table count and latest update time in the Data Scope tab are empty.
The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, ElasticSearch, Tablestore (OTS), MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.
Trigger a manual collection:
Click Collect Metadata in the upper-right corner to immediately run a collection task. Use this to quickly reflect a newly created table in Data Map.
Collection scope and timeliness
Data tables
The following table lists the collection granularity and update timeliness for each supported data source.
| Data source type | Collection mode | Table/field timeliness | Partition timeliness | Lineage timeliness |
|---|---|---|---|---|
| MaxCompute | System default auto-collection (Instance) | Standard project: Real-time | Chinese mainland regions: Real-time | Real-time |
| External project: T+1 | Overseas regions: T+1 | |||
| DLF | System default auto-collection (Instance) | Real-time | Real-time | Supported for Serverless Spark, Serverless StarRocks, and Serverless Flink engines only |
| Hologres | Manually create crawler (Database) | Depends on schedule | Not supported | Real-time |
| EMR Hive | Manually create crawler (Instance) | Depends on schedule | Depends on schedule | Real-time |
| CDH Hive | Manually create crawler (Instance) | Depends on schedule | Real-time | Real-time |
| StarRocks | Manually create crawler (Database) | Instance mode: Real-time; Connection string mode: Depends on schedule | Not supported | Real-time (Instance mode only) |
| AnalyticDB for MySQL | Manually create crawler (Database) | Depends on schedule | Not supported | Real-time (requires a support ticket) |
| AnalyticDB for Spark | Manually create crawler (Instance) | Real-time | Not supported | Real-time |
| AnalyticDB for PostgreSQL | Manually create crawler (Database) | Depends on schedule | Not supported | Real-time |
| Lindorm | Manually create crawler (Instance) | Depends on schedule | Not supported | Real-time |
| Tablestore (OTS) | Manually create crawler (Instance) | Depends on schedule | Not supported | Not supported |
| MongoDB | Manually create crawler (Instance) | Depends on schedule | Not supported | Not supported |
| ElasticSearch | Manually create crawler (Instance) | Depends on schedule | Not supported | T+1 |
| Paimon Catalog | Manually create crawler (Catalog) | Depends on schedule | Depends on schedule | Not supported |
| Other sources (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, etc.) | Manually create crawler (Database) | Depends on schedule | Not supported | Not supported |
For DLF metadata lineage and EMR Hive lineage, enable EMR_HOOK on the EMR cluster.
AnalyticDB for Spark and AnalyticDB for MySQL share the same metadata collection entry point.
Task code
Data Map supports code search and quick location for the following sources. All are auto-collected.
| Code source | Collection scope |
|---|---|
| Data Studio | Nodes and code |
| Data Studio (Legacy) | Nodes and code |
| Data Analysis | SQL queries and code |
| DataService Studio | API data push services |
API assets
Data Map supports viewing DataService Studio API metadata. All are auto-collected.
| API type | Collection scope |
|---|---|
| Generated API (Codeless UI) | APIs created via codeless UI |
| Generated API (Code editor) | APIs created via code editor |
| Registered API | Registered APIs |
| Service Orchestration | Service orchestration workflows |
AI assets
Data Map supports viewing and managing AI assets, including lineage tracking for data and model origins. All are auto-collected.
| Asset type | Collection scope |
|---|---|
| Dataset | PAI: Create or register dataset; DataWorks: Create dataset |
| AI Model | PAI: Model training, register model, deploy model service |
| Algorithm Task | PAI: Training task, workflow task, distributed training task |
| Model Service | PAI: Deploy model service (EAS deployment) |
Workspace
Workspace metadata is auto-collected when a workspace is created in DataWorks.
Billing
Each collection task consumes 0.25 CUs multiplied by the task runtime. See Resource group fees.
Each successful collection generates a scheduling instance, billed separately. See Scheduling instance fees.
Limitations
If the data source uses whitelist access control, add the resource group IP addresses to the database whitelist. See Metadata collection whitelist.
Cross-region collection is not recommended. Keep DataWorks and the data source in the same region. To collect metadata across regions, use a public IP address when creating the data source. See Data source management.
The MySQL metadata crawler does not support OceanBase data sources.
Metadata collection is not supported for AnalyticDB for MySQL data sources with SSL enabled.
Next steps
After metadata is collected, use Data Map to:
Search for tables and view their details, field information, partitions, and data previews. See Metadata details.
Analyze upstream and downstream lineage to understand the full data processing pipeline. See View lineages.
Organize assets into data albums for business-oriented data management. See Data albums.
FAQ
Collection times out or fails for MySQL and other database sources
This is typically a whitelist issue. Add the vSwitch CIDR block of the resource group to the database whitelist.
If the whitelist is already configured, verify that the resource group can reach the data source by running Test Network Connectivity on the crawler details page.