You can use the sampling crawler of DataWorks to sample a CDH Hive table for sensitive data detection in Data Security Guard. If you configure de-identification rules in Data Security Guard, data of the sensitive fields that match the rules is de-identified when you preview data on the details page of a table in Data Map. This topic describes how to create a CDH Hive sampling crawler.
- A DataWorks exclusive resource group for scheduling is created. For more information, see Create and use an exclusive resource group for scheduling.
- A CDH cluster is associated with the current DataWorks workspace. For more information, see Associate a CDH compute engine instance with a workspace.
- The Data Security Guard service is activated and sensitive data detection rules are configured. For more information, see Activate Data Security Guard and Manage sensitive field types.
- You can use sampling crawlers only in the China (Shanghai) and China (Chengdu) regions.
- Data sampling for databases by cluster is supported. You can create only one sampling crawler for each cluster, but you can select one or more databases from which sample data is to be collected for each sampling crawler.
- By default, if you do not specify the database for data sampling in a cluster, the data of all databases are sampled in the cluster.
- Sample data can be collected by using an Alibaba Cloud account, as a RAM user to which the AliyunDataWorksFullAccess policy is attached.
- After you create, modify, or delete CDH Hive tables, you must recollect the sample data.
- The system collects sample data only based on your requirements.
Create a sampling crawler
- Log on to the DataWorks console and go to the DataMap page. For more information, see Go to the homepage.
- In the top navigation bar, click Data Discovery.
- Create a sampling crawler.
- In the left-side navigation pane, choose .
- On the CDH HiveData sampling collector page, click New collector. The New Data Sampling Collector dialog box appears.
- Configure the sampling crawler.
Parameter Description Cluster The CDH cluster for which you want to collect sample data. You can select one from the CDH clusters that are associated with DataWorks workspaces in the current region. For more information, see Integrate and use CDH. Database The database for which data sampling is performed. By default, if you do not set this parameter, the data of all databases in the cluster are sampled. Exclusive Resource Group The exclusive resource group for scheduling that is used to connect to the CDH cluster. Sampling collection service The CDH component that is used to collect sampling data. For more information, see Integrate and use CDH. Collection account number The account that is used for collecting sampling data. The parameter is automatically set based on the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts that you configure to associate the CDH cluster on the Workspace Management page. For more information, see Associate a CDH compute engine instance with a workspace. Execution Plan Specifies the frequency for collecting sample data. The parameter can be set only to On-demand Execution.
- Click OK.
Manage sampling crawlers
|1||In this area, you can search for a sampling crawler by entering its name in the search
Note Fuzzy match is supported. If you enter a keyword in the search box, sampling crawlers whose names match the keyword are displayed.
|2||In this area, you can view the details of a sampling crawler in the Status, Execution Plan, Last Run At, Last Consumed Time, and Average Running Time columns.
You can also perform the following operations on the sampling crawler: