DataWorks allows you to identify sensitive data in your workspace by using the data identification rules configured for built-in and custom sensitive field types. This topic describes how to create a sensitive field type and configure the data identification rule for this type.

Background information

DataWorks allows you to configure the data identification rule for a sensitive field type that is at a specific sensitivity level and belongs to a specific category. This helps you identify sensitive data in your workspace. You can manually correct the inaccurate identification results on the Manual Check tab. The Data Recognition page displays the statistics of fields that hit the data identification rules at each sensitivity level and in each project in the last seven days. For more information, see Manually correct data and Identify sensitive data. The following figure shows the logic of how to configure a data identification rule. Logic of how to configure a data identification rule
Note Before you can identify or de-identify sensitive data in Cloudera's Distribution Including Apache Hadoop (CDH), you must sample data from CDH Hive tables by using the data crawler feature of DataWorks. Then, Data Security Guard identifies sensitive data from the sampled data. The sampled data is not stored in DataWorks. This prevents data leaks. For more information, see CDH Hive sampling crawlers.

Go to the Data Recognition Rules tab

  1. Log on to the DataWorks console and go to the Data Security Guard page. For more information, see Overview.
  2. Click Try now. The Data Security Guard homepage appears.
  3. In the left-side navigation pane, choose Rule Change > Data Recognition Rules. You can create a sensitive field type and configure the data identification rule for this type.

Create a category

  • If you use Data Security Guard for the first time, the default category is displayed in the left-side section. You can search for a category by entering its name in the search box. You can also click the Create icon icon next to the category name and select an option to create a category or a subcategory, rename the category, or delete the category.
  • If you have used Data Security Guard before, you can create categories in the left-side section as needed on the Data Recognition Rules page. Click the Create icon icon next to the default category to create a category.
Note
  • The name of the category must be unique and can contain letters and digits. It must be 1 to 30 characters in length.
  • If you want to delete a category, make sure that you have unpublished all sensitive field types in this category. Otherwise, the category cannot be deleted. For more information, see Unpublish multiple sensitive field types at a time.
Manage categories in the left-side section

Create a sensitive field type

  1. Specify a category for the sensitive field type that you want to create.

    In the left-side section, click the category in which you want to create a sensitive field type.

  2. Create a sensitive field type and configure the data identification rule for this type.
    Click Sensitive field type in the upper-right corner.
    1. In the Set Basic Info step, set the parameters as required and click Next. Sensitive field type dialog box
      Parameter Description
      Sensitive field type The name of the sensitive field type, such as name, ID number, and phone number.
      Note The name must be unique. If the specified name already exists, the Duplicate sensitive field type message appears below the Sensitive field type field.
      Category The category in which you want to create the sensitive field type. The default value is the category that you click in Step 1. If you want to change the category, select another category from the drop-down list.
      Level The sensitivity level that you specify for the sensitive field type. If the existing levels do not meet your needs, go to the Data classification and level page and configure sensitivity levels as needed. For more information, see Mange data sensitivity levels.
      Description The description of the sensitive field type. The description can be 0 to 100 characters in length. Special characters are not supported.
    2. In the Rule Change step, configure the data identification rule and use sample data to test the accuracy of the data identification rule. After the data identification rule is configured and the sensitive field type is published, you can use the rule to identify sensitive data in an identification task. Rule Change
      Parameter Description
      Identify the rule matching conditions From the drop-down list, select a mode in which the data identification rule is hit. Valid values:
      • Any one of the following conditions is met to hit the rule: The data identification rule is hit when the condition for sensitive content identification or sensitive field identification is met.
      • The following conditions are met at the same time to hit the rule: The data identification rule is hit when both the conditions for sensitive content identification and sensitive field identification are met.
      Note The mode takes effect only for sensitive content identification and sensitive field identification.
      Data content identification Specifies whether to enable sensitive content identification and the identification method used to match the sensitive text. Valid values:
      Note Sensitive content identification identifies the sensitive values of fields. For example, if the values of the name field in the data contain the specified sensitive text such as Tom and Bob, the data is identified as sensitive data.
      • Regular expression: Enter a regular expression and sample data. Then, click Test accuracy to test the accuracy of the data identification rule.
      • Built-in recognition rules: Select a built-in identification method and enter the sample data. Then, click Test accuracy to test the accuracy of the data identification rule.
      • Sample library: Select a sample library and enter the sample data. Then, click Test accuracy to test the accuracy of the data identification rule. For more information about how to create and manage sample libraries, see Create and manage sample libraries.
      • Self-generating model: Select a custom data identification model and enter the sample data. Then, click Test accuracy to test the accuracy of the data identification rule. For more information about how to generate a custom data identification model, see Generate a custom data identification model.
        Note You can select Self-generating model only if the workspace is bound to a MaxCompute project.
      Note The sensitive content identification feature is available only in DataWorks Professional Edition or a more advanced edition. If you are using DataWorks Basic Edition or DataWorks Standard Edition, upgrade your DataWorks service to the Professional Edition or a more advanced edition before you use this feature. For more information about DataWorks editions, see DataWorks advanced editions.
      Field name recognition Specifies whether to enable sensitive field identification. You can specify multiple fields to be matched. If data contains one of the specified fields, the data is identified as sensitive data. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples:
      • A value of abcd.efg.* indicates that all data in the efg table in the abcd workspace is identified as sensitive data.
      • A value of ab*.*.salary indicates that data of the salary fields in all tables in the workspaces whose names are prefixed with ab is identified as sensitive data.
      • A value of *cd.ef*.sa*ry indicates that data of the fields whose names are prefixed with sa and suffixed with ry is identified as sensitive data. These fields are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.
      Note Sensitive field identification identifies sensitive field names. For example, if the data contains the sensitive field name, the data is identified as sensitive data.
      Column Exception Rule Specifies whether to ignore specified fields. The ignored fields are not identified as sensitive data. Enter the fields to be ignored by sensitive field identification. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples:
      • A value of abcd.efg.* indicates that all data in the efg table in the abcd workspace is ignored by sensitive field identification.
      • A value of ab*.*.salary indicates that data of the salary fields in all tables in the workspaces whose names are prefixed with ab is ignored by sensitive field identification.
      • A value of *cd.ef*.sa*ry indicates that data of the fields whose names are prefixed with sa and suffixed with ry is ignored by sensitive field identification. These fields are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.
      Hit ratio configuration The hit ratio threshold of sensitive content identification for a column. If the percentage of identified sensitive values to non-empty values of a column exceeds the threshold, the data identification rule is hit. By default, the hit ratio threshold is set to 50%. You can also customize the hit ratio threshold as needed. The following formula is used to calculate the hit ratio of sensitive content identification for a column: 100% × Number of data records that match the condition of sensitive content identification/Total number of data records in the column.
      Note The hit ratio threshold takes effect only for sensitive content identification.
    3. After you check the configurations of the data identification rule, click Save drafts to save a draft. You can also click Release to use to publish the sensitive field type. After you publish a sensitive field type, the status of the sensitive field type becomes The published and an identification task is generated for the sensitive field type.

You can search for the published sensitive field types on the Data Recognition, Data Activities, and Data Risks pages by name and sensitivity level.

Manage sensitive field types

  • Copy a sensitive field type

    Find the sensitive field type that you want to copy and click the Copy icon in the Actions column. A new sensitive field type with the same settings is created. The name of the generated sensitive field type is suffixed with -Replica. By default, it is saved as a draft.

  • Modify a sensitive field type

    Find the sensitive field type that you want to modify and click the Edit icon in the Actions column. In the dialog box that appears, you can modify the data identification rule for the sensitive field type in the Rule Change step. You can modify the configurations of a custom sensitive field type. However, you cannot change the name, category, and sensitivity level of a built-in sensitive field type.

  • Delete a sensitive field type
    Find the sensitive field type that you want to delete and click the Delete icon in the Actions column to delete the sensitive field type. In the message that appears, click OK.
    Notice The following situations occur if a sensitive field type is deleted. Take note of the information before you delete the sensitive field type.
    • The identification results generated based on the data identification rule for this type are deleted. For more information, see Manually correct data.
    • The statistics of this sensitive field type are no longer displayed on the Data Recognition page. For more information, see Identify sensitive data.
    • If this sensitive field type is referenced by a risk identification rule, it will be removed from the configurations of the risk identification rule. For more information, see Manage risk identification rules.

Publish multiple sensitive field types at a time

After you publish sensitive field types, the system starts to identify sensitive data based on the data identification rules for these types. For more information about the identification results, see Identify sensitive data.

  1. Click Batch release and select the sensitive field types that you want to publish.
    Note Sensitive field types whose status is The published cannot be selected.
  2. Click release. Then, the status of each selected sensitive field type is changed to The published.
  3. You can also click Cancel to cancel the operation.
Batch release

Unpublish multiple sensitive field types at a time

After you unpublish sensitive field types, the system stops identifying sensitive data for these types. Records of the unpublished sensitive field types on the Data Recognition page and the Manual Check tab are also deleted. Before you unpublish a sensitive field type, check whether it is referenced by de-identification rules and risk identification rules. If so, change the status of the de-identification rule to Inactive and remove the sensitive field type from the configurations of the risk identification rule. For more information, see Customize de-identification rules and Manage risk identification rules.

  1. Click The batch from the shelves and select the sensitive field types that you want to unpublish.
  2. Click The shelves. Then, the status of each selected sensitive field type is changed to The draft.
  3. You can also click Cancel to cancel the operation.
The batch from the shelves