DataWorks allows you to collect metadata such as the table schema and the linage information of tables in Data Map. You can view the schema of a table and the relationships between tables. This topic describes how to create a crawler to collect metadata from a CDH Hive data source. You can view the collected metadata in Data Map.

Prerequisites

A CDH cluster is associated with the current DataWorks workspace. For more information, see Associate a CDH compute engine instance with a workspace.

Background information

After full metadata from an EMR data source is collected, the system automatically synchronizes new metadata from the data source.

Limits

  • You cannot collect metadata across regions. You must create a crawler in the region in which the metadata to be collected resides.
  • You must collect metadata over the Internet.

Create a crawler

  1. Go to the Data Discovery page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the metadata to be collected resides. Find the workspace and click Data Development in the Actions column.
    4. On the DataStudio page, click the Icon icon in the upper-left corner and choose All Products > Data governance > DataMap.
    5. In the top navigation bar, click Data Discovery to go to the Data Discovery page.
  2. Create a crawler.
    1. In the left-side navigation pane, choose Metadata collection > CDH Hive.
    2. On the CDH Hive Metadata Crawler page, click Create Crawler.
  3. Configure the crawler.
    1. Select a CDH cluster.
      In the Create Crawler dialog box, select the CDH cluster from which you want to collect metadata from the Cluster drop-down list.
    2. Configure an execution plan.
      In the Create Crawler dialog box, select an execution plan from the Execution Plan drop-down list.
      You can select On-demand Execution, Monthly, Weekly, Daily, Hourly, or Customize from the Execution Plan drop-down list. The execution plan to be generated varies based on the execution cycle. The system collects metadata from the CDH Hive data source based on the execution cycle that you specify. The following descriptions provide the details:
      • On-demand Execution: You must manually run the crawler. The system collects metadata from the CDH Hive data source based on your business requirements.
      • Monthly: The system automatically collects metadata from the CDH Hive data source once at a specific time on several specific days of each month.
        Notice Specific months do not have the 29th, 30th, or 31st day. In these months, the system does not collect metadata from the CDH Hive data source on these days. We recommend that you do not select the last few days of a month.
        The following figure shows that the system automatically collects metadata from the CDH Hive data source once at 09:00 on the 1st, 11th, and 21st days of each month. An expression is automatically generated for the Cron Expression parameter based on the values of the Date and Time parameters. Monthly
      • Weekly: The system automatically collects metadata from the CDH Hive data source once at a specific time on several specific days of each week.
        The following figure shows that the system automatically collects metadata from the CDH Hive data source once at 03:00 on Sunday and Monday of each week. An expression is automatically generated for the Cron Expression parameter based on the values of the Date and Time parameters.WeeklyIf the Time parameter is not set, the system automatically collects metadata from the CDH Hive data source once at 00:00:00 on the specified days of each week.
      • Daily: The system automatically collects metadata from the CDH Hive data source once at a specific time of each day.
        The following figure shows that the system automatically collects metadata from the CDH Hive data source once at 01:00 each day. An expression is automatically generated for the Cron Expression parameter based on the values of the Date and Time parameters.Daily
      • Hourly: The system automatically collects metadata from the CDH Hive data source once on the N × 5th minute of each hour.
        Note For a CDH Hive metadata collection task that is run each hour, you can set the time to a multiple of 5 minutes.
        The following figure shows that the system automatically collects metadata from the CDH Hive data source on the 5th and 10th minutes of each hour. An expression is automatically generated for the Cron Expression parameter based on the values of the Date and Time parameters.Hourly
      • Customize: You can enter a CRON expression in the Cron Expression field. The system automatically collects metadata based on the time configuration that matches the CRON expression.
    3. Click OK.

Manage the crawler

On the CDH Hive Metadata Crawler page, you can view, edit, and delete the created crawler. CDH Hive Metadata Crawler page
Area No. Description
1 In this area, you can enter the name of the crawler to search for the crawler.
Note The fuzzy match mode is supported. If you enter a keyword in the search box, crawlers whose names contain the keyword are displayed.
2 In this area, you can view the details of the crawler in the Status, Execution Plan, Last Run At, Last Consumed Time, and Average Running Time columns.
You can also perform the following operations on the crawler:
  • Details: View the CDH cluster and the execution plan that are configured for the crawler.
  • Edit: Modify the CDH cluster and the execution plan that are configured for the crawler.
  • Delete: Delete the crawler.
  • Run: Run the crawler to collect metadata based on the configurations.
  • Stop: Stop the crawler.
Note Run and Stop are dimmed in the Actions column unless you select On-demand Execution from the Execution Plan drop-down list.

What to do next

After the metadata is collected, you can view the details of the collected metadata on the All Data page of Data Map.