All Products
Search
Document Center

DataWorks:Configure sensitive data identification rules

Last Updated:Oct 09, 2023

DataWorks helps you identify sensitive data in your workspace by using sensitive data identification rules configured for built-in and custom sensitive field types. This topic describes how to create a sensitive field type and configure a sensitive data identification rule for this type.

Background information

DataWorks allows you to configure a sensitive data identification rule for a sensitive field type that is at a specific sensitivity level and belongs to a specific category. This helps you identify sensitive data in your workspace. You can manually correct inaccurate identification results on the Manual Check tab. The Data Recognition page displays statistics on the sensitive fields that hit the sensitive data identification rule in each workspace recently. For more information, see Manually correct sensitive data identification results and Identify sensitive data. The following figure shows the logic of how to configure a sensitive data identification rule.逻辑图

Note

Before you can identify or mask sensitive data in Cloudera's Distribution Including Apache Hadoop (CDH), you must sample data from CDH Hive tables by using the data crawler feature of DataWorks. Then, Data Security Guard identifies sensitive data from the sampled data. The sampled data is not stored in DataWorks. This helps prevent data leaks. For more information, see CDH Hive sampling crawlers.

Go to the Data Recognition Rules tab

  1. Log on to the DataWorks console and go to the Data Security Guard page. For more information, see Overview.

  2. Click Try now. The Data Security Guard homepage appears.

  3. In the left-side navigation pane, choose Rule Change > Sensitive data identification. On the Data Recognition Rules tab, you can create a sensitive field type and configure a sensitive data identification rule for this type.

Create a category

  • The first time that you use Data Security Guard, the default categories that are provided by the built-in data category and data sensitivity level template are displayed in the left-side section on the Data Recognition Rules tab. You can search for a category by entering its name in the search box. You can also click the 添加 icon to the right of a category name and select an option to create a category or a subcategory, rename the category, or delete the category.

  • If you have used Data Security Guard before, you can create categories in the left-side section on the Data Recognition Rules tab based on your business requirements. Click the 添加 icon to the right of a default category to create another category.

Note
  • The name of the category must be unique. The name must be 1 to 30 characters in length, and can contain letters and digits.

  • If you want to delete a category, make sure that all sensitive field types in this category are not published. Otherwise, the category cannot be deleted. For more information, see Unpublish multiple sensitive field types at a time.

Create a sensitive field type

  1. Specify a category for the sensitive field type that you want to create.

    In the left-side section, click the category in which you want to create a sensitive field type.

  2. Create a sensitive field type and configure a sensitive data identification rule for this type.

    Click Sensitive field type in the upper-right corner.

    1. In the Set Basic Info step, configure the parameters and click Next.敏感字段类型

      Parameter

      Description

      Sensitive field type

      The name of the sensitive field type, such as name, ID number, or phone number.

      Note

      The name must be unique. If the specified name already exists, the Duplicate sensitive field type message appears below the Sensitive field type field.

      Category

      The category in which you want to create the sensitive field type. The default value is the category that you click in Step 1. If you want to change the category, select a different category from the drop-down list.

      Level

      The sensitivity level that you specify for the sensitive field type. If the existing levels do not meet your business requirements, go to the Data classification and level page and configure sensitivity levels based on your business requirements. For more information, see Mange data sensitivity levels.

      Description

      The description of the sensitive field type. The description must be 0 to 100 characters in length and cannot contain special characters.

    2. In the Rule Change step, configure the sensitive data identification rule and use sample data to test the accuracy of the rule. After the sensitive data identification rule is configured and the sensitive field type is published, you can use the rule to identify sensitive data in an identification task.配置规则

      Note

      If you modify the rule, the identification results obtained based on the original sensitive data identification rule are cleared.

      Parameter

      Description

      Identify the rule matching conditions

      You can select a mode in which the sensitive data identification rule is hit from the drop-down list. Valid values:

      • Any one of the following conditions is met to hit the rule: The sensitive data identification rule is hit when the condition for sensitive content identification or sensitive field identification is met.

      • The following conditions are met at the same time to hit the rule: The sensitive data identification rule is hit when both the conditions for sensitive content identification and sensitive field identification are met.

      Note

      The mode takes effect only for sensitive content identification and sensitive field identification.

      Data content identification

      The identification method used to match sensitive texts. Valid values:

      Note

      Sensitive content identification identifies values of a sensitive field. For example, if the values of the name field contain the specified sensitive text, such as Tom and Bob, Tom and Bob are identified as sensitive data.

      • Regular expression: Enter a regular expression and sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule.

      • Built-in recognition rules: Select a built-in identification method from the drop-down list and enter sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule.

        Note

        The Built-in recognition rules option is available only in DataWorks Enterprise Edition or a more advanced edition.

      • Sample library: Select a sample library and enter sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule. For more information about how to create and manage sample libraries, see Identify sensitive data by using sample libraries.

      • Self-generating model: Select a custom data identification model and enter sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule. For more information about how to generate a custom data identification model, see Generate a custom data identification model.

        Note

        You can select Self-generating model only if your workspace is associated with a MaxCompute compute engine instance. The Self-generating model option is available only in DataWorks Enterprise Edition or a more advanced edition.

      Note

      The sensitive content identification condition is available only in DataWorks Professional Edition or a more advanced edition. If you are using DataWorks Basic Edition or DataWorks Standard Edition, you must upgrade your DataWorks service to the Professional Edition or a more advanced edition before you can use the sensitive content identification condition. For more information about DataWorks editions, see Billing of DataWorks advanced editions.

      Field name recognition

      The fields used to match sensitive data. You can specify multiple fields. The logical relationship between fields is OR. If data contains one of the specified fields, the data is identified as sensitive data. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples:

      • The field value abcd.efg.* indicates that all data in the efg table in the abcd workspace is identified as sensitive data.

      • The field value ab*.*.salary indicates that data in the column named salary in all tables in the workspaces whose names are prefixed with ab is identified as sensitive data.

      • The field value *cd.ef*.sa*ry indicates that data in the columns whose names are prefixed with sa and suffixed with ry is identified as sensitive data. These columns are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.

      Note

      Sensitive field identification identifies sensitive field names. For example, if the data contains a sensitive field name, the data is identified as sensitive data.

      Field comment recognition

      The field comment used to match sensitive data. For example, if the sensitive field type of the field is phone number, the comment for the field is phone number or contact method. If you identify that a field comment contains a phone number or contact method, you can determine that the field is a sensitive field whose sensitive field type is phone number. You can specify up to 10 field comments. Each field comment can be 0 to 100 characters in length. The types of characters in comments are not limited.

      Field to rule out

      The fields that you want to ignore. The fields are not used to match sensitive data and do not hit the configured sensitive data identification rule. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples:

      • The field value abcd.efg.* indicates that all data in the efg table in the abcd workspace is ignored by sensitive field identification.

      • The field value ab*.*.salary indicates that data in the column named salary in all tables in the workspaces whose names are prefixed with ab is ignored by sensitive field identification.

      • The field value *cd.ef*.sa*ry indicates that data in the columns whose names are prefixed with sa and suffixed with ry is ignored by sensitive field identification. These columns are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.

      Hit ratio configuration

      The hit ratio threshold of sensitive content identification for a column. If the ratio of identified sensitive values to non-empty values of a column exceeds the threshold, the sensitive data identification rule is hit. By default, the hit ratio threshold is set to 50%. You can also customize the hit ratio threshold based on your business requirements. The following formula is used to calculate the hit ratio of sensitive content identification for a column: 100% × Number of data records that match the condition of sensitive content identification/Total number of data records in the column.

      Note

      The hit ratio threshold takes effect only for sensitive content identification.

    3. After you check the configurations of the sensitive data identification rule, click Save drafts to save a draft. You can also click Release to use to publish the sensitive field type. After you publish a sensitive field type, the status of the sensitive field type becomes The published and an identification task is generated for the sensitive field type.

    Note

    Data in a column can hit the conditions specified in multiple sensitive data identification rules that are configured for different sensitive field types. If the number of conditions specified in each sensitive data identification rule is the same, you can identify sensitive data in the column based on the conditions in the sensitive data identification rules in the following order: sensitive field identification, sensitive content identification, and sensitive comment identification. If the number and types of conditions specified in each sensitive data identification rule are the same, you can identify sensitive data in the column based on the sensitive data identification rule that contains the sensitive field type with the highest sensitivity level.

You can search for the published sensitive field types on the Data Recognition, Data Activities, and Data Risks pages by name and sensitivity level.

Manually enable a sensitive data identification task

You can manually trigger or terminate a sensitive data identification task, and view the status and execution logs of a sensitive data identification task.

image.png
  1. In the upper-left corner of the Sensitive data identification page, click Open task. In the Open sensitive data identification task panel, you can set the Scanning range parameter to Total quantity or Custom range.

    You can view the task execution progress on the Data Recognition Rules tab and manually terminate the task.

  2. After the sensitive data identification task is complete, you can go to the Task execution record tab to view the status and execution logs of the sensitive data identification task.

Manage sensitive field types

  • Copy a sensitive field type

    Find the sensitive field type that you want to copy and click the 复制 icon in the Actions column. A new sensitive field type with the same settings is created. The name of the generated sensitive field type is suffixed with -Replica. By default, it is saved as a draft.

  • Modify a sensitive field type

    Find the sensitive field type that you want to modify and click the 编辑 icon in the Actions column. In the dialog box that appears, you can modify the sensitive data identification rule for the sensitive field type in the Rule Change step. You can modify the configurations of a custom sensitive field type. However, you cannot change the name, category, and sensitivity level of a built-in sensitive field type.

  • Delete a sensitive field type

    Find the sensitive field type that you want to delete and click the 删除 icon in the Actions column to delete the sensitive field type. In the message that appears, click OK.

    Important

    The following situations occur if a sensitive field type is deleted. Take note of the information before you delete a sensitive field type.

Publish multiple sensitive field types at a time

After you publish sensitive field types, the system identifies sensitive data based on the sensitive data identification rules for these types. For more information about the identification results, see Identify sensitive data.

  1. Click Batch release and select the sensitive field types that you want to publish.

    Note

    Sensitive field types whose status is The published cannot be selected.

  2. Click release. Then, the status of each selected sensitive field type is changed to The published.

  3. You can also click Cancel to cancel the operation.

发布

Sensitive data identification tasks

The system runs tasks at 09:00 every day to automatically identify sensitive data. You can also manually trigger a sensitive data identification task after you publish multiple sensitive field types at a time.

  1. In the upper-left corner of the Sensitive data identification page, click Open task.

  2. In the Open sensitive data identification task panel, set the Scanning range parameter to Total quantity or Custom range.

    Parameter value

    Description

    Total quantity

    The system scans all available data of RAM users to which the current tenant grants permissions.

    Custom range

    • The system scans data in all workspaces with which compute engine instances are associated. You can select only ODPS from the compute engine drop-down list on the left. You can select a workspace from the drop-down list on the right. The workspace drop-down list displays all workspaces whose metadata is obtained for the selected compute engine.

    • You can enter a table name in the Table name field. The name of the table must be 0 to 100 characters in length. The type of the character is not limited. If you leave this field unspecified, all tables are scanned. A table name supports the wildcard (.*). For example, .*name indicates that tables whose names are suffixed with name are scanned. private.* indicates that tables whose names are prefixed with private are scanned. If you specify multiple table names, separate them with commas (,).

    You can click Add custom range to add multiple custom ranges. A maximum of 10 custom ranges can be added. The data scan range is determined by the union of all added custom ranges.

  3. After you configure the Scanning range parameter, click Open to enable the task. A progress bar is displayed to the right of Task status. The following formula is used to calculate the task progress: 100% × Number of identified tables in the task/Total number of tables that you want to identify in the task. To terminate the task, click Terminate task to the right of the progress bar. In the dialog box that appears, click Confirm.

    Note

    After you modify a sensitive data identification rule, the new rule takes effect when the task is automatically run next time. If you want the new rule to take effect in real time, you must manually trigger the task to run.

  4. Click View log. You can view the latest 50 execution log entries.

  5. After the scanning is complete, No task is displayed to the right of Task status.

Unpublish multiple sensitive field types at a time

After you unpublish sensitive field types, the system stops identifying sensitive data for these types. Records of the unpublished sensitive field types on the Data Recognition page and the Manual Check tab are also deleted. Before you unpublish a sensitive field type, check whether it is referenced by data masking rules and risk identification rules. If so, change the status of the data masking rule to Inactive and remove the sensitive field type from the configurations of the risk identification rule. For more information, see Create a data masking rule and Risk identification rule management (old version).

  1. Click The batch from the shelves and select the sensitive field types that you want to unpublish.

  2. Click The shelves. Then, the status of each selected sensitive field type is changed to The draft.

  3. You can also click Cancel to cancel the operation.

下架

Task execution records

The Task execution record tab displays execution records of completed tasks in the last week. The execution records of tasks that are being executed are not displayed on this tab. Each record contains the following information: Start time, End time, Time consuming, Task type, Person liable, and Data range.