Configure Sensitive Data Identification Rules and Scan Schedules - Dataphin - Alibaba Cloud - Dataphin

Prerequisites

An identification rule has been created. For instructions, see Create and manage identification rules.

Limitations

By default, automatic scans that use identification rules do not scan views. You can enable view scanning in the rule running configuration. You can also add or import identification results for views manually.

Permissions

A security administrator can create and manage identification rules, modify the rule running configuration, and enable the automatic inheritance configuration.

Rule running configuration

On the Dataphin homepage, choose Governance > Data Security from the top navigation bar.
In the left-side navigation pane, choose Data Discovery > Identification rules. On the Identification rules page, click the drop-down arrow next to Create identification rule and select Rule running configuration.

In the Rule running configuration dialog box, configure the parameters.

Parameter		Description
Scan configuration	Scheduling cycle	By default, identification rules are scheduled to run once per day. You can adjust the scheduling cycle as needed. A longer cycle reduces resource consumption but may delay the discovery of sensitive data. You can select Day, Week, or Month. If the system time zone (the time zone in your user center) is different from the scheduling time zone (configured in Management Center > System Settings > Basic Settings), the rule runs based on the system time zone.
	Real-time scan of compute source tables	This feature is disabled by default. When enabled, the system automatically triggers a scan on a table if it is newly created, its structure changes (field added, field renamed, or table renamed), or its data is modified through `insert`, `delete`, or `update` operations in Dataphin. The scan identifies and tags sensitive fields. Note Enabling real-time scanning speeds up discovery and protection of sensitive data but may increase compute resource consumption. Real-time scanning is not supported for data source tables.
	Scan scope	Select the scan scope for the identification rule. By default, Exclude views is selected. You can switch to Include views. Note This setting does not affect identification results that are manually added or imported in batches. You can still add results for views directly. If you select Include views, both rule-based automatic scans and data lineage-based automatic inheritance apply data categorization and classification to views. Views include physical views, logical views, data source views, data source materialized views, and materialized views.
	Concurrency	Controls the maximum number of identification tasks that can run simultaneously. This includes tasks from the Data Standard module for intelligent mapping and tagging, as well as scheduled, manual, real-time, and inheritance scans based on data lineage from the Data Security module. The default value is 16. Valid values are integers from 1 to 100. Note This setting takes effect only when automatically triggered sampling queries are disabled. Increasing the concurrency can speed up scanning but consumes more cluster compute resources. Configure this value carefully to ensure system stability.
Sampling configuration Note These settings apply to both automatic sampling and temporary sampling queries that are triggered for content-based identification when automatic sampling is disabled.	Automatic sampling	This option is enabled if data sampling is configured under Governance > Metadata > Sampling configuration and the trigger scenario is set to identification rule execution or standard tagging rule execution. Otherwise, it is disabled. When enabled, the system performs automatic data sampling according to the settings in Metadata - Sampling configuration. When an identification rule runs, the system first checks for existing sample data. If none is available, it samples data based on the configured automatic sampling update policy. Note We recommend enabling this option when identification rules involve content-based recognition or when the Data Standard module uses intelligent mapping based on recognized features. This helps prevent data staleness and avoids extra resource consumption from temporary data queries. When automatic sampling is enabled, data sampling tasks are automatically triggered for data source tables.
	Query space for compute source tables	If sampled data is unavailable for content-based identification, a temporary data query is required. You must select a compute resource to run this query. You can modify this configuration in Governance > Metadata > Sampling configuration > Compute Source. Note Temporary data query tasks consume compute resources. Typically, select the project where the data table is located. To reduce resource pressure and query costs on your primary project, you can assign a dedicated project or resource queue for temporary data queries. This helps avoid interference with regular business tasks. Ensure that the account configured for the compute source in the selected project has read permissions for the relevant data tables. Temporary query tasks for data source tables can run only within their respective data sources. When scanning a lakehouse table with one of the following compute engines—E-MapReduce 3.x, E-MapReduce 5.x, CDH 5.x, CDH 6.x, FusionInsight 8.x, Asiainfo-Data DP 5.3, Cloudera Data Platform 7.x, Lindorm (compute engine), Amazon EMR, or Transwarp TDH—the project's associated compute source must have Spark tasks enabled. For tables in the Kudu storage format, the project's associated compute source must have Impala tasks enabled to scan data.
	Scan blackout period	During this period, the system blocks new automatically triggered data sampling queries, causing them to fail immediately. This prevents these tasks from consuming excessive compute resources that could affect production tasks. You can modify this configuration in Governance > Metadata > Sampling configuration > Compute Source.

Note

The Concurrency, Scan blackout period, Sampling configuration, and resource configurations defined here are shared with the feature scanning configurations in the Data Standard module. Changes to these settings in one module also apply to the other.
Global feature recognition tasks include those from both the Data Standard and Data Security modules.
- Data Standard: Includes tasks for mapping rules that intelligently match and apply tags based on recognized features (both manual and scheduled rules).
- Data Security: Includes scheduled, manual, and real-time scans, as well as identification tasks based on data lineage inheritance.

Click OK to save the configuration.

Automatic inheritance configuration

On the Identification rules page, click Automatic inheritance configuration.

In the Data lineage-based automatic inheritance configuration dialog box, configure the parameters.

Parameter	Description
Automatic inheritance	Disabled by default. When enabled, you can configure the scenarios and rules for automatic inheritance based on data lineage. Note When automatic inheritance is enabled, it applies only to direct data lineage. Downstream fields automatically inherit the sensitivity level of their direct upstream fields. Combined with the default data masking policy, this automatically protects new data, reduces manual configuration, and ensures consistency across related data assets.
Inheritance rule	When there is only one inheritance result, you can select Inherit category and sensitivity level or Inherit only the sensitivity level, not the category. Inherit category and level: Enables more precise application of data masking policies to the field. Inherit level only, not category: Inherits the data level from the direct upstream field. You can manually specify the data category later in the identification records. When there are multiple inheritance results, you can select Inherit only the highest sensitivity level, not the category or Inherit the highest sensitivity level and the category of its source field. Inherit highest level only, not category: Inherits the highest data level from all direct upstream fields. You can manually specify the data category later in the identification records. Inherit highest level + category of the field with the highest level: If multiple fields have the same highest sensitivity level but different categories, the final category is determined based on the following priority: category priority > update time of the identification record > modification time of the category.
Trigger scenarios	You can select Identification rule execution or Data lineage update. Identification rule execution: For the direct downstream targets of a scanned object, the system calculates inheritance results based on the scan results of that object. Note Each time an identification rule runs, it queries the downstream fields of the objects in the rule's scope and generates auto-inheritance results according to the configured inheritance rule. If the upstream fields are different but the resulting category and level are the same, the source field for the inheritance result is updated. If a new category and level are inherited, a new record is created. Data lineage update: For each output field with updated data lineage, the system calculates inheritance results based on its input fields. Note Each time a task is submitted to the development environment or deployed to the production environment, the system queries the input tables of the output table to get the data lineage of the input fields and generates automatic inheritance results. If the upstream fields are different but the resulting category and level are the same, the source field for the inheritance result is updated. If a new category and level are inherited, a new record is created. You must select at least one trigger scenario.

Note

For identification results without a specified category, you can manually assign a suitable category based on the inheritance source. We recommend configuring a default data masking policy to ensure that data with an inherited sensitivity level is properly masked, thereby enhancing data security.
The priority for the final identification result is, from highest to lowest: Manual Override > Automatic Identification > Automatic data lineage inheritance.

Click OK to save the configuration.