All Products
Search
Document Center

Dataphin:Identification rules and identification methods

Last Updated:Jan 13, 2026

After establishing identification rules, you can customize the scanning method to meet your business needs. The system supports scheduled, manual, and real-time scans. It also enables automatic inheritance of upstream classifications and gradings based on data lineage, allowing identification results to be generated through inheritance tasks. This topic describes how to configure identification rules and the process for producing identification results.

Prerequisites

Identification rules have been established. For instructions on how to create them, see Create and manage identification rules.

Limits

By default, identification rules do not scan view objects automatically. However, you can enable view scanning in the rule's runtime configuration. Additionally, you have the option to manually add or batch import identification results for views.

Permission description

Security administrators have the authority to create and manage identification rules, modify rule runtime configurations, and activate automatic inheritance configurations.

Identification rule runtime configuration

  1. On the Dataphin home page, select Administration > Data Security from the top menu bar.

  2. In the navigation pane on the left, select Data Identification > Classification Rule. On the Classification Rule page, click the drop-down arrow next to New Identification Rule and select Rule Runtime Configuration.

  3. In the Rule Runtime Configuration dialog box, configure the parameters.

    Parameter

    Description

    Scan Configuration

    Scheduling Epoch

    By default, the identification rules are scheduled once a day. You can adjust the scheduling period according to business requirements. Increasing the period can reduce resource consumption but may delay the identification of sensitive data. You can select Day, Week, and Month as scheduling periods.

    When the system time zone (the time zone in User Center) is different from the scheduling time zone (the time zone configured in Management Hub > System Settings > Basic Settings), the rules will be executed according to the system time zone.

    Real-time Scan

    Default is Disabled. When Enabled, if a new table is created, the table structure changes (fields are added, fields are renamed, or the table is renamed), or table data changes (insert, delete, update executed through Dataphin), the table will be scanned once and sensitive fields will be tagged.

    Note

    After enabling real-time scanning, sensitive data can be detected and protected more promptly, but it may increase the consumption of computing resources. Please evaluate reasonably.

    Scan Range

    Select the scan range of the identification rules. The default is Filter Views, and you can switch to Include Views.

    Note
    • Batch import and manual addition of identification results are not affected by this configuration. You can directly add identification results of view objects.

    • When the scan range includes views, both rule-based automatic scanning and lineage-based automatic inheritance will classify and grade view objects.

    • View objects include physical views, logical views, data source views, data source materialized views, and materialized views.

    Concurrent Runs

    Used to control the number of identification tasks running simultaneously globally, including standard module tasks for intelligent mapping of identification features, scheduled scans, manual scans, real-time scans, and automatic inheritance scan tasks triggered by lineage updates. The default is 16, and you can configure a positive integer between 1 and 100.

    Note
    • This parameter takes effect only when automatic triggered sampling query is disabled.

    • Increasing the degree of parallelism speeds up scans but uses more cluster computing resources. To ensure system stability, configure this parameter based on your business needs.

    Sampling Configuration

    Note

    This applies to automatic sampling and temporary sampling queries that are triggered for content-based detection when automatic sampling is disabled.

    Automatic Sampling

    This is enabled if data sampling is turned on in Administration > Metadata > Sampling Configuration and the trigger scenario is set to `Security identification rule execution` or `Standard mapping rule execution`. Otherwise, it is disabled.

    When enabled, automatic data sampling is performed based on the settings in Metadata > Sampling Configuration. When an identification rule runs, the system first checks if sample values exist within the data range to determine if data sampling is needed. It then performs automatic sampling based on the automatic sampling update policy.

    Note

    Enable this feature when security identification rules involve content-based detection or when standard mapping rules are configured for intelligent mapping based on identification features. This helps prevent data from becoming outdated and avoids extra resource consumption from temporary data queries.

    Execution Space

    When no sample data is available and a temporary data query is needed for content-based detection, select the computing resources for the temporary data query node. You can modify the configuration in Administration > Metadata > Sampling Configuration > Compute Source.

    Note
    • Temporary data query nodes use some computing resources. In most cases, select the project where the data resides.

    • If you want to reduce the resource load and query costs on the data's source project (for example, by choosing a separate subscription project) and avoid interfering with regular business projects, you can also assign dedicated project resources or queues for temporary data queries.

    • Ensure the account configured for the compute engine in the selected project has read permissions for the relevant data tables.

    • If the compute engine is E-MapReduce 3.x, E-MapReduce 5.x, CDH 5.x, CDH 6.x, FusionInsight 8.x, Asiainfo DP 5.3, Cloudera Data Platform 7.x, Lindorm (compute engine), Amazon EMR, or Transwarp TDH and the data table is a lake table, you must enable the Spark node for the project's compute engine to scan data. If the data table's storage format is Kudu, you must enable the Impala node for the project's compute engine to scan data.

    Scan Disable Period

    During the specified period, automatically triggered data sampling query tasks are not initiated and will fail directly. This avoids using too many computing resources that could affect the normal operation of production environment tasks and ensures the stability of online data tasks. You can modify the configuration in Administration > Metadata > Sampling Configuration > Compute Source.

    Note
    • The concurrent runs, scan disable period, sampling configuration, resource configuration, and feature scan configuration of the data standard module in this rule runtime configuration are shared. Modifications to one will synchronously affect the others.

    • Global feature identification tasks include both standard and security module feature identification tasks.

      • Data Standard: Tasks for intelligent mapping of identification features based on mapping rules, including both manual and scheduled execution rules.

      • Asset Security: Encompasses scheduled scans, manual scans, real-time scans, and identification tasks based on lineage inheritance.

  4. Click OK to finalize the identification rule scheduling period configuration.

Automatic inheritance configuration

  1. On the Classification Rule page, click the Automatic Inheritance Configuration button.

  2. In the Lineage-based Automatic Inheritance Configuration dialog box, set the parameters.

    Parameter

    Description

    Inherited

    This feature is disabled by default. When enabled, you can configure the scenarios and rules for automatic inheritance based on field lineage.

    Note

    When this feature is enabled, automatic inheritance is based only on direct lineage. Downstream fields automatically inherit the sensitivity level from parent table fields. The system also applies the default desensitization rules to protect new data. This process reduces manual configuration costs and improves the consistency and relevance of detection results for associated data.

    Inheritance Rules

    • When there is only one inheritance result, you can choose Inherit Classification + Grading or Inherit Grading Only, Do Not Inherit Classification.

      • Inherit Classification + Grading: Allows more precise application of desensitization rules to the field.

      • Inherit Grading Only, Do Not Inherit Classification: Inherits the data grading of the direct ancestor table field. You can manually specify data classification in the identification record later.

    • When there are multiple inheritance results, you can choose Inherit Highest Grading Only, Do Not Inherit Classification or Inherit Highest Grading + Classification Corresponding To The Highest Grading Source Field.

      • Inherit Highest Grading Only, Do Not Inherit Classification: Inherits the highest data grading of the direct ancestor table field. You can manually specify data classification in the identification record later.

      • Inherit Highest Grading + Classification Corresponding To The Highest Grading Source Field: If multiple fields have the same sensitivity level but different classifications, the classification result is determined by classification priority > identification record update time > classification modification time.

    Trigger Scenarios

    Supports selecting Identification Rule Execution or Lineage Update.

    • Identification Rule Execution: For the direct downstream of the scanned object, calculate the inheritance result based on the identification result of this scanned object.

      Note
      • Each time an identification rule is executed, for the objects selected by the rule, query the downstream fields according to field lineage and generate automatic inheritance results based on rule configuration.

      • If the ancestor table fields are different but the classification and grading corresponding to the inheritance result are the same, the source field of the inheritance result will be updated. If a new classification and grading inheritance result is generated, a corresponding record will be added.

    • Lineage Update: For each updated lineage output field, calculate the inheritance result based on the input fields.

      Note
      • Each time a task is submitted to the development environment or published to the production environment, query the input table based on the output table and obtain the lineage of the input fields, generating automatic inheritance results according to rule configuration.

      • If the ancestor table fields are different but the classification and grading corresponding to the inheritance result are the same, the source field of the inheritance result will be updated. If a new classification and grading inheritance result is generated, a corresponding record will be added.

    At least one inheritance scenario must be selected.

    Note
    • For identification results that lack a specified classification, you can manually assign appropriate classifications based on the source of inheritance. It is advisable to set up the Default Desensitization Policy to ensure compatibility between the automatically inherited grading result data and the desensitization algorithm, thereby enhancing data security.

    • The priority of the final effective identification result is, from highest to lowest: manual execution, automatic identification, and automatic lineage inheritance.

  3. Click OK to complete the configuration of lineage-based automatic inheritance.