You can use the sampling crawler of DataWorks to sample a CDH Hive table for sensitive data detection in Data Security Guard. If you configure de-identification rules in Data Security Guard, data of the sensitive fields that match the rules is de-identified when you preview data on the details page of a table in Data Map. This topic describes how to create a CDH Hive sampling crawler.

Prerequisites

Limits

  • You can use sampling crawlers only in the China (Shanghai) and China (Chengdu) regions.
  • Data sampling for databases by cluster is supported. You can create only one sampling crawler for each cluster, but you can select one or more databases from which sample data is to be collected for each sampling crawler.
  • By default, if you do not specify the database for data sampling in a cluster, the data of all databases are sampled in the cluster.
  • Sample data can be collected by using an Alibaba Cloud account, as a RAM user to which the AliyunDataWorksFullAccess policy is attached.
  • After you create, modify, or delete CDH Hive tables, you must recollect the sample data.
  • The system collects sample data only based on your requirements.

Create a sampling crawler

  1. Log on to the DataWorks console and go to the DataMap page. For more information, see Go to the homepage.
  2. In the top navigation bar, click Data Discovery.
  3. Create a sampling crawler.
    1. In the left-side navigation pane, choose Data sampling collector > CDH Hive.
    2. On the CDH HiveData sampling collector page, click New collector. The New Data Sampling Collector dialog box appears.
  4. Configure the sampling crawler.
    New Data Sampling Collector dialog box
    Parameter Description
    Cluster The CDH cluster for which you want to collect sample data. You can select one from the CDH clusters that are associated with DataWorks workspaces in the current region. For more information, see Integrate and use CDH.
    Database The database for which data sampling is performed. By default, if you do not set this parameter, the data of all databases in the cluster are sampled.
    Exclusive Resource Group The exclusive resource group for scheduling that is used to connect to the CDH cluster.
    Sampling collection service The CDH component that is used to collect sampling data. For more information, see Integrate and use CDH.
    Collection account number The account that is used for collecting sampling data. The parameter is automatically set based on the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts that you configure to associate the CDH cluster on the Workspace Management page. For more information, see Associate a CDH compute engine instance with a workspace.
    Execution Plan Specifies the frequency for collecting sample data. The parameter can be set only to On-demand Execution.
  5. Click OK.

Manage sampling crawlers

On the CDH HiveData sampling collector page, you can view, modify, and delete the created sampling crawler. CDH HiveData sampling collector page
No. Description
1 In this area, you can search for a sampling crawler by entering its name in the search box.
Note Fuzzy match is supported. If you enter a keyword in the search box, sampling crawlers whose names match the keyword are displayed.
2 In this area, you can view the details of a sampling crawler in the Status, Execution Plan, Last Run At, Last Consumed Time, and Average Running Time columns.
You can also perform the following operations on the sampling crawler:
  • Details: View the configurations of the sampling crawler.
  • Edit: Modify the configurations of the sampling crawler, including the Cluster and Exclusive Resource Group parameters.
  • Delete: Delete the sampling crawler.
  • Run: Run the sampling crawler to collect sample data based on the configurations. After you run the sampling crawler, detected sensitive fields are displayed in Data Security Guard. If you configure de-identification rules, data of the sensitive fields that match the rules is de-identified when you preview data on the details page of a table in Data Map.
  • Stop: Stop the sampling crawler.

What to do next

Sample data is collected for CDH Hive tables. If you configure de-identification rules in Data Security Guard, data of the sensitive fields that match the rules is de-identified when you preview data on the details page of a table in Data Map. For more information, see Overview and View the details of a table.