After you create identification rules, you can customize how they scan your data based on your business requirements. You can run scheduled, manual, or real-time scans. You can also automatically inherit data categories and sensitivity levels from upstream sources based on data lineage and generate identification results through inheritance tasks. This topic describes how to configure identification rules and how identification results are generated.
Prerequisites
An identification rule has been created. For instructions, see Create and manage identification rules.
Limitations
By default, automatic scans that use identification rules do not scan views. You can enable view scanning in the rule running configuration. You can also add or import identification results for views manually.
Permissions
A security administrator can create and manage identification rules, modify the rule running configuration, and enable the automatic inheritance configuration.
Rule running configuration
On the Dataphin homepage, choose Governance > Data Security from the top navigation bar.
In the left-side navigation pane, choose Data Discovery > Identification rules. On the Identification rules page, click the drop-down arrow next to Create identification rule and select Rule running configuration.
In the Rule running configuration dialog box, configure the parameters.
Parameter
Description
Scan configuration
Scheduling cycle
By default, identification rules are scheduled to run once per day. You can adjust the scheduling cycle based on your business needs. A longer cycle can reduce resource consumption but may delay the discovery of sensitive data. You can select Day, Week, or Month.
If the system time zone (the time zone in your user center) is different from the scheduling time zone (configured in Management Center > System Settings > Basic Settings), the rule runs based on the system time zone.
Real-time scan of compute source tables
This feature is disabled by default. When enabled, the system automatically triggers a scan on a table if it is newly created, its structure changes (field added, field renamed, or table renamed), or its data is modified through
insert,delete, orupdateoperations in Dataphin. The scan identifies and tags sensitive fields.NoteEnabling real-time scanning enables faster discovery and protection of sensitive data but may increase the consumption of compute resources. Evaluate this setting based on your needs.
Real-time scanning is not supported for data source tables.
Scan scope
Select the scan scope for the identification rule. By default, Exclude views is selected. You can switch to Include views.
NoteThis setting does not affect identification results that are manually added or imported in batches. You can still add results for views directly.
If you select Include views, both rule-based automatic scans and data lineage-based automatic inheritance apply data categorization and classification to views.
Views include physical views, logical views, data source views, data source materialized views, and materialized views.
Concurrency
Controls the maximum number of identification tasks that can run simultaneously. This includes tasks from the Data Standard module for intelligent mapping and tagging, as well as scheduled, manual, real-time, and inheritance scans based on data lineage from the Data Security module. The default value is 16. Valid values are integers from 1 to 100.
NoteThis setting takes effect only when automatically triggered sampling queries are disabled.
Increasing the concurrency can speed up the scanning process but consumes more cluster compute resources. To ensure system stability, configure this value carefully based on your business requirements.
Sampling configuration
NoteThese settings apply to both automatic sampling and temporary sampling queries that are triggered for content-based identification when automatic sampling is disabled.
Automatic sampling
This option is enabled if data sampling is configured under Governance > Metadata > Sampling configuration and the trigger scenario is set to identification rule execution or standard tagging rule execution. Otherwise, it is disabled.
When enabled, the system performs automatic data sampling according to the settings in Metadata - Sampling configuration. When an identification rule runs, the system first checks for existing sample data. If no sample data is available, it performs data sampling based on the configured automatic sampling update policy.
NoteWe recommend enabling this option when identification rules involve content-based recognition or when the Data Standard module is configured for intelligent mapping based on recognized features. This helps prevent data staleness and avoids the extra resource consumption of temporary data queries.
When automatic sampling is enabled, data sampling tasks are automatically triggered for data source tables.
Query space for compute source tables
If sampled data is unavailable for content-based identification, a temporary data query is required. You must select a compute resource to run this query. You can modify this configuration in Governance > Metadata > Sampling configuration > Compute Source.
NoteTemporary data query tasks consume compute resources. Typically, you can select the project where the data table is located.
To reduce resource pressure and query costs on your primary project, you can assign a dedicated project or resource queue for temporary data queries. This helps avoid interference with regular business tasks.
Ensure that the account configured for the compute source in the selected project has read permissions for the relevant data tables.
Temporary query tasks for data source tables can run only within their respective data sources.
When scanning a lakehouse table with one of the following compute engines—E-MapReduce 3.x, E-MapReduce 5.x, CDH 5.x, CDH 6.x, FusionInsight 8.x, Asiainfo-Data DP 5.3, Cloudera Data Platform 7.x, Lindorm (compute engine), Amazon EMR, or Transwarp TDH—the project's associated compute source must have Spark tasks enabled. For tables in the Kudu storage format, the project's associated compute source must have Impala tasks enabled to scan data.
Scan blackout period
During this period, the system blocks new automatically triggered data sampling queries, causing them to fail immediately. This prevents these tasks from consuming excessive compute resources that could affect production tasks, ensuring the stability of your online data services. You can modify this configuration in Governance > Metadata > Sampling configuration > Compute Source.
NoteThe Concurrency, Scan blackout period, Sampling configuration, and resource configurations defined here are shared with the feature scanning configurations in the Data Standard module. Changes to these settings in one module also apply to the other.
Global feature recognition tasks include those from both the Data Standard and Data Security modules.
Data Standard: Includes tasks for mapping rules that intelligently match and apply tags based on recognized features (both manual and scheduled rules).
Data Security: Includes scheduled, manual, and real-time scans, as well as identification tasks based on data lineage inheritance.
Click OK to save the configuration.
Automatic inheritance configuration
On the Identification rules page, click Automatic inheritance configuration.
In the Data lineage-based automatic inheritance configuration dialog box, configure the parameters.
Parameter
Description
Automatic inheritance
Disabled by default. When enabled, you can configure the scenarios and rules for automatic inheritance based on data lineage.
NoteWhen automatic inheritance is enabled, it applies only to direct data lineage. Downstream fields automatically inherit the sensitivity level of their direct upstream fields. This feature works with the default data masking policy to automatically protect new data, reducing manual configuration effort and ensuring consistency across related data assets.
Inheritance rule
When there is only one inheritance result, you can select Inherit category and sensitivity level or Inherit only the sensitivity level, not the category.
Inherit category and level: Enables more precise application of data masking policies to the field.
Inherit level only, not category: Inherits the data level from the direct upstream field. You can manually specify the data category later in the identification records.
When there are multiple inheritance results, you can select Inherit only the highest sensitivity level, not the category or Inherit the highest sensitivity level and the category of its source field.
Inherit highest level only, not category: Inherits the highest data level from all direct upstream fields. You can manually specify the data category later in the identification records.
Inherit highest level + category of the field with the highest level: If multiple fields have the same highest sensitivity level but different categories, the final category is determined based on the following priority: category priority > update time of the identification record > modification time of the category.
Trigger scenarios
You can select Identification rule execution or Data lineage update.
Identification rule execution: For the direct downstream targets of a scanned object, the system calculates inheritance results based on the scan results of that object.
NoteEach time an identification rule runs, it queries the downstream fields of the objects in the rule's scope and generates auto-inheritance results according to the configured inheritance rule.
If the upstream fields are different but the resulting category and level are the same, the source field for the inheritance result is updated. If a new category and level are inherited, a new record is created.
Data lineage update: For each output field with updated data lineage, the system calculates inheritance results based on its input fields.
NoteEach time a task is submitted to the development environment or deployed to the production environment, the system queries the input tables of the output table to get the data lineage of the input fields and generates automatic inheritance results.
If the upstream fields are different but the resulting category and level are the same, the source field for the inheritance result is updated. If a new category and level are inherited, a new record is created.
You must select at least one trigger scenario.
NoteFor identification results without a specified category, you can manually assign a suitable category based on the inheritance source. We recommend configuring a default data masking policy to ensure that data with an inherited sensitivity level is properly masked, thereby enhancing data security.
The priority for the final identification result is, from highest to lowest: Manual Override > Automatic Identification > Automatic data lineage inheritance.
Click OK to save the configuration.