This topic describes how to create a crawler to collect metadata from an E-MapReduce (EMR) data source. You can view the collected metadata on the Data Map page.

Prerequisites

An EMR cluster is associated with your workspace as a compute engine instance. For more information, see Associate an EMR compute engine instance with a workspace.

Background information

After full metadata from an EMR data source is collected, the system automatically synchronizes new metadata from the data source.

Limits

  • Only one crawler can be created for each cluster. You can select one or more databases from which metadata is to be collected for each crawler.
  • Metadata can be collected by using an Alibaba Cloud account, as a RAM user to which the AliyunDataWorksFullAccess policy is attached, or as a RAM user that is assigned the metadata collection administrator role.

Create a crawler

  1. Log on to the DataWorks console and go to the DataMap page. For more information, see Go to the homepage.
  2. In the top navigation bar, click Data Discovery.
  3. Open the Create Crawler dialog box.
    1. In the left-side navigation pane, choose Metadata collection > E-MapReduce.
    2. On the E-MapReduceMetadata Crawler page, click Create Crawler.
      Create Crawler button
  4. Configure the crawler.
    1. In the Create Crawler dialog box, select the cluster from which you want to collect metadata from the Select a cluster drop-down list.
      Note
      Create Crawler dialog box
    2. Optional: Select one or more databases from which you want to collect metadata from the Database drop-down list. If you do not select a database, the crawler automatically collects metadata from all the databases in the cluster.
    3. Click Authorize. On the Metadata tab of the page that appears, click Enable.
      Enable
      Note
      • By default, after an EMR cluster is associated with a workspace as a compute engine instance, the workspace is authorized to collect metadata from the EMR cluster.
      • You must manually grant permissions for EMR clusters that are associated with DataWorks and from which metadata has not been collected.
    4. In the Confirm Operation message, click OK.
    5. Return to the Create Crawler dialog box on the Data Discovery page and click Refresh.
    6. After the value of the Authorization Status parameter changes to Authorized, click OK to create the crawler.

Manage crawlers

On the E-MapReduceMetadata Crawler page, you can manage the created crawlers. For example, you can delete or run a crawler. Crawler
No. Description
1 In this section, you can enter the name of a crawler to search for the crawler.
Note The fuzzy match mode is supported. If you enter a keyword in the search box, crawlers whose names or data source names contain the keyword are displayed.
2 In this section, you can view detailed information about a created crawler, such as the status of the crawler, the databases from which the crawler collects metadata, and the last time when the crawler was run.
  • Running Status: the status of the crawler. Valid values:
    • Collected: Metadata is collected.
    • Never Synchronized: Metadata has not been collected.
    • Failed: The crawler fails to collect metadata. In this case, you can rerun the crawler. If the collection still fails, submit a ticket.
  • Object: the databases from which the crawler collects metadata.
  • Last Run At: the last time when the crawler was run.
You can also perform the following operations on the crawler:
  • Run: Run the crawler to collect metadata based on the configurations of the crawler.
    • If the crawler has not been run, you can click Run in the Actions column. When the crawler enters the Collected state, metadata is collected.
    • If the crawler has been run, the Run button in the Actions column is dimmed. In this case, if you want to collect metadata from databases that are different from the previously selected ones, you must click Delete in the Actions column to delete the existing crawler. Then, create another crawler.
  • Delete: If you want to delete the crawler, click Delete in the Actions column. In the Delete Instance message, click OK.
3 In this section, you can perform the following operations:
  • Manually Synchronize: If the metadata of a table is collected but the table cannot be found in Data Map, or the changes in a table are not synchronized to Data Map, click Manually Synchronize. In the Manually Synchronize Table dialog box, set the Cluster ID, Database, and Table Name parameters, and click Start Synchronization. This way, you can manually synchronize the specified table to Data Map.
  • Refresh: Click Refresh to refresh the status of created crawlers.

What to do next

After metadata is collected, you can view the details of the collected metadata on the All Data page in Data Map. For more information, see Search for and filter tables.