DataWorks DataMap provides metadata crawlers that you can use to collect metadata of all or specific E-MapReduce (EMR) databases. DataMap also allows you to use the manual table synchronization feature to collect metadata of a single table. This improves the efficiency of collecting metadata of a single table. After you collect the metadata, you can view the related data in DataMap. This topic describes how to collect metadata of EMR tables to DataWorks.

Prerequisites

An EMR cluster is associated with your workspace as a compute engine instance. For information about how to associate an EMR cluster with a DataWorks workspace as an EMR compute engine instance, see Create and manage workspaces.

Background information

After you create a metadata crawler to collect full metadata of EMR tables, the system enables automated incremental metadata collection. This way, the metadata crawler can automatically synchronize incremental metadata of the EMR tables to DataWorks.

Limits

  • Only one metadata crawler can be created for each EMR cluster. You can select one or more databases of which you want to collect metadata for each crawler.
  • The metadata collection capability varies based on the type of the EMR cluster and the metadata storage type. The following table provides the details.
    EMR cluster typeMetadata storage typeCollect metadata of a single table

    (Use the manual table synchronization feature on the All Data page)

    Collect metadata of a database

    (Create a metadata crawler on the Data Discovery page)

    DataLake cluster in the new data lake scenario DLF Unified MetadataSupported No configuration is required. The system automatically updates metadata.
    Self-managed RDS or Built-in MySQLSupported The related configurations are required. You must manually update metadata based on your business requirements.
    Hadoop cluster in the old data lake scenario DLF Unified MetadataSupported No configuration is required. The system automatically updates metadata.
    Self-managed RDS or Built-in MySQLNot supported The related configurations are required. You must manually update metadata based on your business requirements.
    Note
  • Only an Alibaba Cloud account, a RAM user to which the AliyunDataWorksFullAccess policy is attached, or a RAM user to which the metadata collection administrator role is assigned can collect metadata.

Use the manual table synchronization feature to collect metadata of a single table

  1. Log on to the DataWorks console and go to the DataMap page. For more information, see Go to the homepage of DataMap.
  2. In the top navigation bar of the DataMap page, click All Data.
  3. In the upper-right corner of the page that appears, click Manually Synchronize Table. In the Manually Synchronize Table dialog box, select E-MapReduce for Data Source Type and configure the following parameters for the desired EMR table: Cluster ID, Database, and Table Name.
    Manually Synchronize
  4. After the configuration is complete, click Start Synchronize to synchronize metadata of the desired EMR table.

Create a metadata crawler to collect metadata of a database

After you create a metadata crawler to collect full metadata of EMR tables, the system enables automated incremental metadata collection. This way, the metadata crawler can automatically synchronize incremental metadata from the EMR tables to DataWorks.

  1. Log on to the DataWorks console and go to the DataMap page. For more information, see Go to the homepage of DataMap.
  2. In the top navigation bar, click Data Discovery.
  3. Open the Create Crawler dialog box.
    1. In the left-side navigation pane, choose Metadata collection > E-MapReduce.
    2. On the E-MapReduce Metadata Crawler page, click Create Crawler.
      Create Crawler button
  4. Configure the metadata crawler.
    1. In the Create Crawler dialog box, select the cluster of which you want to collect metadata from the Select a cluster drop-down list.
      Create Crawler dialog box
    2. Optional:Select one or more databases of which you want to collect metadata from the Database drop-down list. If you do not select a database, the crawler automatically collects metadata of all databases in the cluster.
    3. Click Authorize. On the Metadata tab of the page that appears, click Enable.
      Enable
      Note
      • By default, after an EMR cluster is associated with a workspace as a compute engine instance, the workspace is authorized to collect metadata of the EMR cluster.
      • For EMR clusters that are associated with DataWorks workspaces but no metadata of the EMR clusters is collected, you must manually grant the DataWorks workspaces the access permissions on the EMR clusters.
    4. In the Confirm Operation message, click OK.
    5. Return to the Create Crawler dialog box on the Data Discovery page and click Refresh.
    6. After the value of the Authorization Status parameter changes to Authorized, click OK to create the crawler.

Manage crawlers

On the E-MapReduce Metadata Crawler page, you can manage the created crawlers. For example, you can delete or run a crawler. Crawler
AreaDescription
1In this area, you can enter the name of a crawler to search for the crawler.
Note Fuzzy match is supported. If you enter a keyword in the search box, crawlers whose names contain the keyword are displayed.
2In this area, you can view information about a created crawler, such as the status of the crawler, the databases from which the crawler collects metadata, and the time when the crawler was last run.
  • Running Status: the status of the crawler. A crawler can be in one of the following states:
    • Collected: The crawler finishes collecting metadata.
    • Never Synchronized: The crawler has not collected metadata.
    • Failed: The crawler fails to collect metadata. In this case, you can rerun the crawler.
  • Object: the databases from which the crawler collects metadata.
  • Last Run At: the time when the crawler was last run.
You can also perform the following operations on the crawler:
  • Run: Run the crawler to collect metadata based on the configurations of the crawler.
    • If the crawler has not been run, you can click Run in the Actions column. When the crawler enters the Collected state, metadata is collected.
    • If the crawler has been run, the Run button in the Actions column is dimmed. In this case, if you want to collect metadata from databases that are different from the previously selected ones, you must click Delete in the Actions column to delete the existing crawler. Then, create another crawler.
  • Delete: If you want to delete the crawler, click Delete in the Actions column. In the Delete Instance message, click OK.
3In this area, you can perform the following operations:
  • Manually Synchronize: If the metadata of a table is collected but the table cannot be found in DataMap, or the changes in a table are not synchronized to DataMap, click Manually Synchronize. In the Manually Synchronize Table dialog box, configure the Cluster ID, Database, and Table Name parameters, and click Start Synchronization. This way, you can manually synchronize data in the specified table to DataMap.
  • Refresh: Click Refresh to refresh the status of created crawlers.

What to do next

After metadata is collected, you can view the details of the collected metadata on the All Data tab in DataMap. For more information, see Search for tables.