This topic describes how to create a crawler to collect metadata from an Object Storage Service (OSS) data source to DataWorks. You can view the collected metadata on the Data Map page.

Prerequisites

An EMR cluster is associated with your workspace as a compute engine instance. For more information, see Associate an EMR compute engine instance with a workspace.

Limits

  • You cannot collect metadata across regions. You must create a crawler in the region where the source metadata resides to collect the metadata.
  • You must collect metadata over the Internet.
  • Metadata collection from OSS data sources is in invitational preview and is supported only in the China (Shanghai) region.

Procedure

  1. In the left-side navigation pane, click OSS.
  2. On the OSSMetadata Crawler page, click Create Crawler.
  3. In the Create Crawler dialog box, complete the wizard.
    1. In the Basic Information step, set the parameters.
      Create Crawler
      Parameter Description
      Crawler Name Required. The name of the crawler. You must set a unique name.
      Crawler Description The description of the crawler.
      Data Source Type The type of the data source from which you want to collect metadata. The default value is OSS and cannot be changed.
    2. Click Next.
    3. In the Select Collection Object step, set the parameters to specify the data source.
      Select Collection Object
      Parameter Description
      Workspace The workspace of the OSS data source from which you want to collect metadata.
      Data Source The OSS data source from which you want to collect metadata. If no data source is available, go to the Data Source page and create an OSS data source. For more information, see Configure an OSS connection.
      Object Path The path of the OSS object from which you want to collect metadata.
      Path Traversal Specifies whether to traverse sub-paths in the specified path.
      Prefix The prefix of the names of tables that the crawler automatically generates. By default, a generated table is named after the corresponding OSS object.
    4. Click Next.
    5. In the Configure Execution Plan step, configure an execution plan.
      Configure Execution Plan
      Parameter Description
      Execution Plan The execution plan of the crawler. Valid values: On-demand Execution, Monthly, Weekly, Daily, Hourly, and Customize.
      Update Options The policy used to update the tables that store the collected metadata.
      Delete Options The policy used to delete the tables that store the collected metadata.
    6. Click Next.
    7. In the Confirm Information step, check the information that you specified and click Confirm.
  4. On the OSSMetadata Crawler page, find the created crawler and click Run in the Actions column.
    After the crawler is run, click the number in the Updated Tables in Last Run or Added Tables in Last Run column to view the details of the updated or created tables.
    You can also perform the following operations on the OSSMetadata Crawler page:
    • Click Details in the Actions column of a crawler. In the Crawler Details dialog box, view the detailed information about the crawler.
    • Click Edit in the Actions column of a crawler. In the Edit Crawler dialog box, modify the configurations of the crawler.
    • Click Delete in the Actions column of a crawler. In the Confirm message, click Ok to delete the crawler.
    • Click Stop in the Actions column of a crawler that is running to stop the crawler.
  5. View the metadata collected from the OSS data source.
    1. In the top navigation bar, click All Data.
    2. Select OSS from the drop-down list in the upper part of the page.
    3. Click the name of a table that stores the collected metadata and view the table details.