All Products
Search
Document Center

DataWorks:Metadata collection

Last Updated:Apr 10, 2024

DataWorks Data Map provides the metadata collection feature that allows you to collect metadata from various data sources to Data Map, manage the collected metadata in a centralized manner, and view the collected metadata by data source in Data Map. This topic describes how to create a crawler to collect metadata from each data source to DataWorks.

Prerequisites

A data source is added to a workspace. For information about how to add a data source, see the topics in the Data source management directory.

Overview

After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects full existing metadata at a time, collects incremental metadata every day, and then aggregates the full and incremental metadata to Data Map. Then, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineages.

Note
  • If the default collection plan does not meet your business requirements, you can modify the collection plan. For more information, see Manage metadata crawlers.

  • After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.

Supported data source types and metadata collection methods

Data source type

Metadata collection method

Whether the crawler is available in Data Map

Metadata update timeliness

Table/Field

Partition

Data lineage

MaxCompute

  • Associate a data source with DataStudio

  • Automatic metadata collection

No

Regular project: real-time

External project: T+1

Region in the Chinese mainland: real-time

Region outside China: T+1

T+1

EMR (Metadata storage method: DLF)

Note

Make sure that EMR_HOOK is enabled for a cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

No

Real-time

Real-time

Real-time

EMR (Metadata storage method: HMS or RDS)

Note

Make sure that EMR_HOOK is enabled for a cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

Real-time

Real-time

Real-time

Hologres

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

AnalyticDB for PostgreSQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

AnalyticDB for MySQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

Note

You must submit a ticket to enable the data lineage feature for an AnalyticDB for MySQL instance.

CDH Hive

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

Depends on the custom collection plan

Real-time

Real-time

DLF

Automatic metadata collection

No

Real-time

Real-time

Not applicable

Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse

  • Add a data source in SettingCenter

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Not supported

Limits

  • You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in another workspace, you can contact the workspace administrator to add your account to the workspace as a member. For more information, see Add workspace members and assign roles to them.

  • If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Configure IP address whitelists for metadata collection.

  • We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Add and manage data sources.

  • You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.

Procedure

  1. Go to the DataMap page in the DataWorks console.

  2. In the left-side navigation pane of the DataMap page, click Collect Metadata.

    On the page that appears, you can switch between Data Source Perspective and Workspace Perspective to view or manage metadata collection from the selected perspective.

    • If you select Data Source Perspective, you can view the types of data sources that you configured in the workspaces to which the current logon account belongs and manage crawlers by data source type.

    • If you select Workspace Perspective, you can view the workspaces to which the current logon account belongs and manage the crawlers for data sources by workspace. If no data source is available in a workspace, you can click Create Data Source to go to the Data Source page and create a data source in SettingCenter.

View metadata crawlers

  • Overall statistics on metadata collection

    On the Collect Metadata page, you can switch between Data Source Perspective and Workspace Perspective to view the overall information about metadata collection. You can view the number of data sources for which a crawler is created in the selected perspective.整体统计

  • Metadata collection details

    To view the details of metadata collection, you can click the desired data source type or workspace, or click Manage in the upper-right corner of the desired data source type or workspace on the Data Source Perspective or Workspace Perspective tab. On the Data Sources for Which Crawler Is Created tab, you can view the following information about a crawler: Status, Execution Plan, Last Run At, Last Running Time/s, Average Running Time/s, and Tables Found During Last Run.明细列表

Manage metadata crawlers

Click Manage in the upper-right corner of the desired data source type or workspace. The Data Sources for Which Crawler Is Created tab appears. On this tab, you can view the list of data sources of the selected data source type or the list of data sources for which a crawler is created in the selected workspace. You can perform the following operations on an existing crawler.

Run a metadata crawler

You can manually run a metadata crawler. To run a metadata crawler, find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to collect the metadata of the data source once.

Modify the collection plan of a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Edit in the Actions column of the data source to modify the collection plan of the metadata crawler. The collection plans include manual metadata collection and periodic metadata collection.

  • Manual metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you must manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.

  • Periodic metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you do not need to manually trigger the crawler to run. The system periodically collects metadata of the data source to Data Map and updates the collected metadata based on the collection plan.

Delete a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Remove in the Actions column of the data source to delete the metadata crawler of the data source. After you delete the metadata crawler of the data source, the data source is moved to the Data Sources for Which No Crawler Is Created tab and the metadata of the data source is no longer collected.

Create a metadata crawler

After you add a data source or register a cluster to a workspace, you can go to Data Map to enable the metadata collection feature. You can view information about metadata collection for the data source or cluster on the Data Sources for Which Crawler Is Created tab.

If you want to recollect the metadata of a data source after you delete the metadata crawler of the data source, you can create a metadata crawler for the data source on the Data Sources for Which No Crawler Is Created tab.

  1. Click Data Sources for Which No Crawler Is Created.

  2. Find the desired data source and click Create Crawler in the Actions column of the data source. In the Configure Collection Plan dialog box, configure the parameters.

    Note

    Parameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.

    配置采集计划

    Parameter

    Description

    Resource Group Name

    Select the resource group that is connected to the data source whose metadata you want to collect. You can select one of the following resource groups in Data Map based on your business requirements:

    • Default resource group named default

    • Your exclusive resource group for scheduling

    • Your exclusive resource group for Data Integration

    Test Network Connectivity

    After you select a resource group, if you want to re-test the network connectivity between the resource group and the data source whose metadata you want to collect, you can click Test Network Connectivity. If the message The connectivity test failed. is displayed, you can refer to the following instructions to locate the cause:

    Collection Plan

    The metadata collection plan. Valid values: Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify.

    • Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.

    • Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month.

      Important

      Specific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month.

    • Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week.

      If you do not configure the Time parameter, the system automatically collects metadata of the data source once at 00:00:00 on the specific days of each week.

    • Daily: The system automatically collects metadata of the data source once at a specified point in time of each day.

    • Hourly: The system automatically collects metadata of the data source once on the Nth minute of each hour.

  3. Verify that the configurations of the crawler are correct and click Confirmation.

    The system collects metadata of the data source based on the configured collection plan. If you select Manual Crawling, you can find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to manually collect the metadata of the data source based on your business requirements.

What to do next

After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see View resource information, Search for tables, and Table management from the business perspective: Data albums.