How to configure sensitive data detection rules and run sensitive data detection tasks - DataWorks

Data Security Guard lets you configure sensitive data detection rules based on sensitive field types. After you configure a rule, you can use it to detect the corresponding type of sensitive data in your tenant. DataWorks provides various built-in sensitive field types and detection rules. If the built-in rules do not meet your business requirements, you can create custom sensitive field types and detection rules. This topic describes how to create a sensitive field type and configure a data detection rule.

Background information

DataWorks lets you define data detection rules based on the sensitivity level and category of data. This helps you detect sensitive data in your organization. If detection results are inaccurate, you can view and manually correct sensitive data detection results. The Sensitive data overview module displays the distribution of all sensitive fields that have recently matched detection rules, categorized by project. The following figure shows how data detection rules are used.

Go to the Data Detection Rules page

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Click the icon in the upper-left corner. Then, choose All Products > Data Governance > Data Security Guard. On the page that appears, click Try Now to go to the Data Security Guard page.
Note
- If your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.
- If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.

In the navigation pane on the left, click Rule Configuration > Sensitive Data Detection to go to the Data Detection Rules page.

Step 1: Configure classification and grading for sensitive fields

A sensitive field type must belong to a data category and have a defined sensitivity level. Therefore, you must configure data classification and grading before you create a sensitive field type and configure a detection rule.

Data Security Guard provides a built-in classification and grading template. The template includes four sensitivity levels and four major categories that you can use directly. DataWorks lets you edit the classifications and grades in the built-in template or create custom ones. You can define up to 10 sensitivity levels. For categories, you can define multilayer categories, subcategories, and the sensitive field types they contain.
You can configure sensitivity grading for fields on the Rule Configuration > Data Classification and Grading page.
- The Data Classification and Grading page displays the default built-in template. Click the icon next to the template to edit the template name, description, and number of grades.
You can configure classification for sensitive fields on the Rule Configuration > Sensitive Data Detection page.
- If you are new to Data Security Guard, the default categories from the Built-in classification and grading template are displayed on the left side of the Data Detection Rules page. You can search for a category by name. You can also click the icon next to a category name to add a same-level category, add a subcategory, rename the category, or delete the category.
- If you are an existing Data Security Guard user, you can create up to four data categories on the left side of the Data Detection Rules page.

Note

A category name must be unique. It must be 1 to 30 characters in length and can contain only letters and digits.
Before you delete a category, check if it contains any published sensitive data detection rules. If it does, you must deactivate all rules in the category before you delete it. For more information, see Manage data detection rules.
For more information about how to configure sensitive data grading, see Configure sensitive data classification and grading.

Step 2: Configure a sensitive data detection rule

Sensitive data detection rules must be configured based on sensitive field types. This topic describes the configuration details, using creating a sensitive field type and configuring a data detection rule as an example. You can also configure a data detection rule based on a built-in sensitive field type.

On the Data Detection Rules page, click + Sensitive field type in the upper-right corner to add a sensitive field type.

Configure the basic information of the sensitive field type.

On the Basic Information tab, configure parameters for the sensitive field, such as its type, classification, and grading. c4d5ddbe9d6dd319096ed9dc93957d61

The following table describes the main parameters.

Parameter	Description
Sensitive Field Type	The custom name of the sensitive field type, such as name, ID card number, or phone number. The name must be unique.
Category	The category to which the sensitive field type belongs. If the existing categories do not meet your needs, go to the Data Classification and Grading page to configure a category. For more information, see Configure sensitive data classification and grading.
Sensitivity Level	The sensitivity level to which the sensitive field type belongs. A larger number indicates a higher sensitivity level. If the existing grades do not meet your needs, go to the Data Classification and Grading page to configure a grade. For more information, see Configure sensitive data classification and grading.

Click Next.

Configure the detection rule for the sensitive field type.

On the Rule Configuration tab, configure the sensitive data detection rule and its match conditions, and then test the accuracy of the rule.

Parameter	Description
Rule Hits	Select a hit condition for the detection rule from the drop-down list on the right: Satisfy any rule: The detection rule is hit if either the `Data content detection` or `Field name detection` condition is met. Meet all rules: The detection rule is hit only if both the `Data content detection` and `Field name detection` conditions are met. Note The Rule Hits parameter takes effect only for `Data content detection` and `Field name detection` rules.
Data Content Detection	Detects sensitive data based on the data content of a field, which is the field's value. For example, if the value of the `name` field is Zhang San, the rule detects Zhang San. Note The content scan feature is available only in DataWorks Professional Edition and higher. If you use a lower edition of DataWorks, upgrade to the Professional Edition or higher. For more information about how to upgrade, see Select and pay for a software version. Define the content of the sensitive data detection rule based on the rule type to match sensitive text. Four rule types are available: Regular Expression: Enter a regular expression for the detection rule and enter test data to test the accuracy of the rule. Built-in Detection Rule: Select a built-in detection rule and enter test data to test the accuracy of the rule. Note You can select Built-in Detection Rule only in DataWorks Enterprise Edition. Sample Library: Select a configured rule sample and enter test data to test the accuracy of the rule. For more information about how to configure samples, see Detection using a sample library. Custom Model: Select a custom rule model and enter test data to test the accuracy of the rule. For more information about how to configure a custom model, see Detection using a custom model. Note You can select the Custom Model rule only for the MaxCompute DPI engine. You can use Custom Model only in DataWorks Enterprise Edition.
Field Name Detection	Detects sensitive data based on the name of a field. For example, if the value of the `name` field is Zhang San, the rule detects `name`. Enter the fields to be detected as sensitive data. You can specify multiple fields. The logical relationship between the fields is `OR`. The input formats for different data sources are as follows: EMR, CDH, and MaxCompute: `project.table.column` Hologres: `instance_id.project.table.column` You can use an asterisk () as a wildcard character in any segment of the input format. For example: a.b.: All fields in table b of project a are detected as sensitive data. ab.c.salary: All salary fields in tables whose names start with c in projects whose names start with ab are detected as sensitive data. cd.ef.sa*ry : All fields whose names start with sa and end with ry in tables whose names start with ef in projects whose names end with cd are detected as sensitive data.
Field Comment Detection	Detects sensitive data based on the comment of a field. For example, you can configure the comments for a phone number sensitive field type as "phone number" and "contact method". When the system detects that a data comment contains "contact method", the data is detected as a phone number. Enter the field comments in the input box. The comment can be 0 to 100 characters in length. All character types are supported. You can add up to 10 input boxes.
Field Exclusion	Enter the fields to exclude. Fields that match the exclusion rules are not hit by this detection rule. You can specify multiple fields. The logical relationship between the fields is `OR`. The input formats for different data sources are as follows: EMR, CDH, and MaxCompute: `project.table.column` Hologres: `instance_id.project.table.column` You can use an asterisk () as a wildcard character in any segment of the input format. For example: a.b.: All fields in table b of project a are detected as sensitive data. ab.c.salary: All salary fields in tables whose names start with c in projects whose names start with ab are detected as sensitive data. cd.ef.sa*ry : All fields whose names start with sa and end with ry in tables whose names start with ef in projects whose names end with cd are detected as sensitive data.
Hit Rate Configuration	Defines a custom hit rate for the rule. This specifies the percentage of non-empty data in a column that must match the `Data content detection` condition for the detection rule to be hit. For example, 50%. The default value is 50%. The hit rate is calculated using the formula: `100% × Number of data records in the column that hit the detection rule / Total number of data records in the column`. Note The hit rate takes effect only for `Data content detection` rules.

Publish the data detection rule.
Click Publish to publish the current data detection rule. After the rule is published, you can use it in a detection task to detect the corresponding sensitive data.

Note

If you do not need to use the rule immediately, click Save as Draft to save the data detection rule.
If data in a column matches the detection rules of multiple sensitive field types, the rules take effect in the following order:
- If the number of match conditions is the same for these sensitive field types, the detection order is Field Name Detection > Data Content Detection > Field Comment Detection.
- If the number and types of match conditions are the same, the detection rule for the sensitive field type with the higher sensitivity level takes precedence.

Step 3: Authorize and start a sensitive data detection task

After you configure the sensitive data detection rules, you must authorize and start a sensitive data detection task. After the task starts, the platform detects sensitive data in the tenant based on the detection rules.

Authorize the sensitive data detection task.
The first time you start a sensitive data detection task, click Enable and Authorize in the upper-left corner of the Sensitive Data Detection page and grant permissions as prompted.
Note
After the sensitive data detection task starts, you can click Authorization Records in the upper-right corner of the Sensitive Data Detection page to view the authorization details.

Start the sensitive data detection task.

Configure the sensitive data detection task.

When you configure a sensitive data detection task, you must configure its type, scan method, and scan scope. You can configure a real-time task, a scheduled task, or a one-time task.

Configure a real-time task.

The following table describes the parameters.

Parameter

Description

Account for Detection

Specify an Alibaba Cloud account or a RAM user to sample and scan data. The data is sampled and scanned using the selected account. The range of data that can be sampled varies based on the permissions of the account.

Note

If you use a RAM user for detection, the RAM user must have permissions on the MaxCompute project.

Real-time Detection

Only ODPS supports real-time detection. When ODPS metadata changes (such as adding a table or field, or changing a field), Data Security Guard automatically starts a sensitive data detection task for the changed metadata.

Data Security Guard obtains metadata changes in real time. If the change is caused by a new table or field, the new table or field may not have content yet. In this case, only metadata is used for sensitive data detection.

Configure a scheduled task.The following table describes the parameters.

Parameter	Description
Task Execution	You must manually start the task.
Scan and update policy for subsequent detection tasks	Two options are available: Rescan and update results only for changed rules, the data affected by them, and data with no results. Rescan all data and overwrite all previous results. You can select Do not overwrite manually corrected results.
Account for Detection	Specify an Alibaba Cloud account or a RAM user to sample and scan data. The data is sampled and scanned using the selected account. The range of data that can be sampled and scanned varies based on the permissions of the account. Note If you use a RAM user to sample and scan data, the RAM user must have permissions on the MaxCompute project.
Content Detection	Specifies whether the Content detection and Metadata detection rules take effect. The corresponding rules take effect only after you select them. Note If you do not select Content detection, Data Security Guard does not sample or scan data. The content detection rules do not take effect. However, the field name and field comment detection rules still take effect.
Sampling Quantity	The number of data records to sample for content detection. We recommend a value greater than 100. You must configure this parameter if you select Content Detection.
Scan Frequency and Scan Time	Define the scan epoch for the scheduled task. Configure this parameter only if you set Task Type to Scheduled task. Set the scan frequency to Once a week or Once a day. For weekly scans, select a day from Monday to Friday. The time range is 0:00 to 23:59.
Scan Scope	Configure the range of data for the sensitive data detection task to scan. All: Scans all data under the authorized account of the current tenant. Partial data: Scans the data of tables in a specified project. Note By default, the project scope includes all projects of all data processing engines. You can scan specified tables in ODPS, EMR, and Hologres projects. The total length of a table name is `0 to 100` characters. All character types are supported. If you leave this blank, all tables are scanned. The `.` wildcard character is supported. For example, `.name` indicates a name with the suffix `name`. `private.` indicates a name with the prefix `private`. Separate multiple table or field names with commas (,). Select Partial data* to add multiple project or database scan scopes. The final scan scope is the union of all added scopes. Manually select a project in the pane on the left. After you select a project, the data tables within that project or database are displayed on the right. You can manually select tables or select all tables at once. By default, all data tables in the database are selected. You can search for project or database scopes and data tables by keyword. To search for a data table by keyword, first select a project to search within.

Configure a one-time task.The following table describes the parameters.

Parameter	Description
Scan and update policy for detection tasks	Two options are available: Rescan and update results only for changed rules, the data affected by them, and data with no results. Rescan all data and overwrite all previous results. You can select Do not overwrite manually corrected results.
Account for Detection	Specify an Alibaba Cloud account or a RAM user to sample and scan data. The data is sampled and scanned using the selected account. The range of data that can be sampled and scanned varies based on the permissions of the account. Note If you use a RAM user to sample and scan data, the RAM user must have permissions on the MaxCompute project.
Content Detection	Specifies whether the Content detection and Metadata detection rules take effect. The corresponding rules take effect only after you select them. Note If you do not select Content detection, Data Security Guard does not sample or scan data. The content detection rules do not take effect. However, the field name and field comment detection rules still take effect.
Sampling Quantity	The number of data records to sample for content detection. We recommend a value greater than 100. You must configure this parameter if you select Content Detection.
Scan Scope	Configure the range of data for the sensitive data detection task to scan. All: Scans all data under the authorized account of the current tenant. Partial data: Scans the data of tables in a specified project. Note By default, the project scope includes all projects of all data processing engines. You can scan specified tables in ODPS, EMR, and Hologres projects. The total length of a table name is `0 to 100` characters. All character types are supported. If you leave this blank, all tables are scanned. The `.` wildcard character is supported. For example, `.name` indicates a name with the suffix `name`. `private.` indicates a name with the prefix `private`. Separate multiple table or field names with commas (,). Select Partial data* to add multiple project or database scan scopes. The final scan scope is the union of all added scopes. Manually select a project in the pane on the left. After you select a project, the data tables within that project or database are displayed on the right. You can manually select tables or select all tables at once. By default, all data tables in the database are selected. You can search for project or database scopes and data tables by keyword. To search for a data table by keyword, first select a project to search within.

Click Run to start the scan task.
After the task starts, the Task Status changes as follows:
1. Real-time task: The status changes to Running.
2. Scheduled task: The status changes to Running. When the configured scan time is reached, the platform detects sensitive data based on the configuration.
3. One-time task: A progress bar chart is displayed. When the progress reaches 100%, the scan is complete. Progress is calculated using the formula: (Number of tables scanned in the current task / Total number of tables to scan in the current task) × 100%.
Note
1. If you modify a detection rule, the new rule takes effect in the next scheduled task, not in real time. To trigger a new task immediately, you must manually create a one-time detection task.
2. After the scan task is complete, the Task Status is updated to No Task.

Manage data detection rules

Copy rule: To quickly copy an existing rule, click the icon. The new rule name has the suffix -copy by default, and its status is Draft. You can configure it as needed.
Edit rule: To modify rule information, click the icon.
Note
- You cannot modify the basic information of rules that are configured based on built-in sensitive field types.
- If you modify a rule, detection results for fields that matched the previous version of the rule are cleared.
Delete rule: If a rule is no longer needed, click the icon to delete it.
Important
Deleting a detection rule for a sensitive data type has significant impacts. Read the following impacts carefully before you confirm the deletion.
- Records of this sensitive field type are deleted from the detection results. For more information, see View and manually correct sensitive data detection results.
- Statistics for this sensitive field type are no longer included in the sensitive data distribution information in the Data Discovery module. For more information, see Sensitive data overview.
- If a Fraud Detection rule references this sensitive field type, the reference is removed. For more information, see Fraud Detection management.
Publish rules in a batch: After a rule is published, the platform uses it to detect the corresponding sensitive data. If you have many rules, you can publish them in a batch.
1. On the Data Detection Rules page, click Publish in Batch and select the rules to publish.
  Note
  You can select only rules in the Draft state.
2. Click Publish. After the rules are published, their status changes to Published.
  Note
  To cancel the publication, click Cancel. The rules revert to their original Draft state.
Deactivate rules in a batch: When a rule is deactivated, the platform no longer uses it to detect the corresponding type of sensitive data. Records of this sensitive field type are deleted from modules such as Data Discovery and Manual Data Correction. Before you deactivate a rule, check if it is referenced by data masking rules or Fraud Detection rules. If it is, you must first deactivate the data masking rule and remove the reference from the Fraud Detection rule. For more information, see Create a data masking rule and Fraud Detection management.
1. On the Data Detection Rules page, click Deactivate in Batch and select the rules to deactivate.
  Note
  You can only select rules in the Published state.
2. Click Batch Deactivate. After the rules are deactivated, their status changes to Draft.
  Note
  To cancel the deactivation, click Cancel. The rules revert to their original Published state.

What to do next: View task execution records

The Sensitive Data Detection > Detection Tasks > Task Execution Records page displays the records of completed tasks from the past week. This page does not include records of tasks that are currently running. You can view details such as Start Time, End Time, Duration, Task Type, Owner, and Data Range.