You can use the sampling crawler of DataWorks to sample a CDH Hive table for sensitive data detection in Data Security Guard. If you configure de-identification rules in Data Security Guard, data of the sensitive fields that match the rules is de-identified when you preview data on the details page of a table in Data Map. This topic describes how to create a CDH Hive sampling crawler.
Prerequisites
- A DataWorks exclusive resource group for scheduling is created. For more information, see Create and use an exclusive resource group for scheduling.
- A CDH cluster is associated with the current DataWorks workspace. For more information, see Associate a CDH compute engine instance with a workspace.
- The Data Security Guard service is activated and sensitive data detection rules are configured. For more information, see Activate Data Security Guard and Manage sensitive field types.
Limits
- You can use sampling crawlers only in the China (Shanghai) and China (Chengdu) regions.
- Data sampling for databases by cluster is supported. You can create only one sampling crawler for each cluster, but you can select one or more databases from which sample data is to be collected for each sampling crawler.
- By default, if you do not specify the database for data sampling in a cluster, the data of all databases are sampled in the cluster.
- Sample data can be collected by using an Alibaba Cloud account, as a RAM user to which the AliyunDataWorksFullAccess policy is attached.
- After you create, modify, or delete CDH Hive tables, you must recollect the sample data.
- The system collects sample data only based on your requirements.
Create a sampling crawler
Manage sampling crawlers
On the CDH HiveData sampling collector page, you can view, modify, and delete the created sampling crawler. 

No. | Description |
---|---|
1 | In this area, you can search for a sampling crawler by entering its name in the search
box.
Note Fuzzy match is supported. If you enter a keyword in the search box, sampling crawlers
whose names match the keyword are displayed.
|
2 | In this area, you can view the details of a sampling crawler in the Status, Execution Plan, Last Run At, Last Consumed Time, and Average Running Time columns.
You can also perform the following operations on the sampling crawler:
|