DataWorks helps you identify sensitive data in your workspace by using sensitive data identification rules configured for built-in and custom sensitive field types. This topic describes how to create a sensitive field type and configure a sensitive data identification rule for this type.
Background information

Go to the Data Recognition Rules tab
- Log on to the DataWorks console and go to the Data Security Guard page. For more information, see Overview.
- Click Try now. The Data Security Guard homepage appears.
- In the left-side navigation pane, choose . On the Data Recognition Rules tab, you can create a sensitive field type and configure a sensitive data identification rule for this type.
Create a category
- The first time that you use Data Security Guard, the default categories are displayed
in the left-side section on the Data Recognition Rules tab. You can search for a category
by entering its name in the search box. You can also click the
icon to the right of a category name and select an option to create a category or a subcategory, rename the category, or delete the category.
- If you have used Data Security Guard before, you can create categories in the left-side
section on the Data Recognition Rules tab based on your business requirements. Click
the
icon to the right of a default category to create another category.
- The name of the category must be unique. The name must be 1 to 30 characters in length, and can contain letters and digits.
- If you want to delete a category, make sure that all sensitive field types in this category are deleted. Otherwise, the category cannot be deleted. For more information, see Unpublish multiple sensitive field types at a time.

Create a sensitive field type
- Specify a category for the sensitive field type that you want to create.
In the left-side section, click the category in which you want to create a sensitive field type.
- Create a sensitive field type and configure a sensitive data identification rule for
this type.
Click Sensitive field type in the upper-right corner.
- In the Set Basic Info step, configure the parameters and click Next.
Parameter Description Sensitive field type The name of the sensitive field type, such as name, ID number, or phone number. Note The name must be unique. If the specified name already exists, the Duplicate sensitive field type message appears below the Sensitive field type field.Category The category in which you want to create the sensitive field type. The default value is the category that you click in Step 1. If you want to change the category, select a different category from the drop-down list. Level The sensitivity level that you specify for the sensitive field type. If the existing levels do not meet your business requirements, go to the Data classification and level page and configure sensitivity levels based on your business requirements. For more information, see Mange data sensitivity levels. Description The description of the sensitive field type. The description must be 0 to 100 characters in length and cannot contain special characters. - In the Rule Change step, configure the sensitive data identification rule and use sample data to test
the accuracy of the rule. After the sensitive data identification rule is configured
and the sensitive field type is published, you can use the rule to identify sensitive
data in an identification task. Note If you modify the rule, the identification results obtained based on the original sensitive data identification rule are cleared.
Parameter Description Identify the rule matching conditions You can select a mode in which the sensitive data identification rule is hit from the drop-down list. Valid values: - Any one of the following conditions is met to hit the rule: The sensitive data identification rule is hit when the condition for sensitive content identification or sensitive field identification is met.
- The following conditions are met at the same time to hit the rule: The sensitive data identification rule is hit when both the conditions for sensitive content identification and sensitive field identification are met.
Note The mode takes effect only for sensitive content identification and sensitive field identification.Data content identification The identification method used to match sensitive texts. Valid values: Note Sensitive content identification identifies values of a sensitive field. For example, if the values of the name field contain the specified sensitive text, such as Tom and Bob, Tom and Bob are identified as sensitive data.- Regular expression: Enter a regular expression and sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule.
- Built-in recognition rules: Select a built-in identification method from the drop-down list and enter sample
data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule.
Note The Built-in recognition rules option is available only in DataWorks Enterprise Edition or a more advanced edition.
- Sample library: Select a sample library and enter sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule. For more information about how to create and manage sample libraries, see Create and manage sample libraries.
- Self-generating model: Select a custom data identification model and enter sample data. Then, click Test accuracy to test the accuracy of the sensitive data identification rule. For more information
about how to generate a custom data identification model, see Generate a custom data identification model.
Note You can select Self-generating model only if your workspace is associated with a MaxCompute compute engine instance. The Self-generating model option is available only in DataWorks Enterprise Edition or a more advanced edition.
Note The sensitive content identification condition is available only in DataWorks Professional Edition or a more advanced edition. If you are using DataWorks Basic Edition or DataWorks Standard Edition, you must upgrade your DataWorks service to the Professional Edition or a more advanced edition before you can use the sensitive content identification condition. For more information about DataWorks editions, see Billing of DataWorks advanced editions.Field name recognition The fields used to match sensitive data. You can specify multiple fields. The logical relationship between fields is OR. If data contains one of the specified fields, the data is identified as sensitive data. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples: - The field value abcd.efg.* indicates that all data in the efg table in the abcd workspace is identified as sensitive data.
- The field value ab*.*.salary indicates that data in the column named salary in all tables in the workspaces whose names are prefixed with ab is identified as sensitive data.
- The field value *cd.ef*.sa*ry indicates that data in the columns whose names are prefixed with sa and suffixed with ry is identified as sensitive data. These columns are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.
Note Sensitive field identification identifies sensitive field names. For example, if the data contains a sensitive field name, the data is identified as sensitive data.Field comment recognition The field comment used to match sensitive data. For example, if the sensitive field type of the field is phone number, the comment for the field is phone number or contact method. If you identify that a field comment contains a phone number or contact method, you can determine that the field is a sensitive field whose sensitive field type is phone number. You can specify up to 10 field comments. Each field comment can be 0 to 100 characters in length. The types of characters in comments are not limited. Field to rule out The fields that you want to ignore. The fields are not used to match sensitive data and do not hit the configured sensitive data identification rule. The value must be in the project.table.column format. You can use an asterisk (*) to represent any number of characters in each section of the value. Examples: - The field value abcd.efg.* indicates that all data in the efg table in the abcd workspace is ignored by sensitive field identification.
- The field value ab*.*.salary indicates that data in the column named salary in all tables in the workspaces whose names are prefixed with ab is ignored by sensitive field identification.
- The field value *cd.ef*.sa*ry indicates that data in the columns whose names are prefixed with sa and suffixed with ry is ignored by sensitive field identification. These columns are contained in the tables whose names are prefixed with ef in the workspaces whose names are suffixed with cd.
Hit ratio configuration The hit ratio threshold of sensitive content identification for a column. If the ratio of identified sensitive values to non-empty values of a column exceeds the threshold, the sensitive data identification rule is hit. By default, the hit ratio threshold is set to 50%. You can also customize the hit ratio threshold based on your business requirements. The following formula is used to calculate the hit ratio of sensitive content identification for a column: 100% × Number of data records that match the condition of sensitive content identification/Total number of data records in the column
.Note The hit ratio threshold takes effect only for sensitive content identification. - After you check the configurations of the sensitive data identification rule, click Save drafts to save a draft. You can also click Release to use to publish the sensitive field type. After you publish a sensitive field type, the status of the sensitive field type becomes The published and an identification task is generated for the sensitive field type.
Note Data in a column can hit the conditions specified in multiple sensitive data identification rules that are configured for different sensitive field types. If the number of conditions specified in each sensitive data identification rule is the same, you can identify sensitive data in the column based on the conditions in the sensitive data identification rules in the following order: sensitive field identification, sensitive content identification, and sensitive comment identification. If the number and types of conditions specified in each sensitive data identification rule are the same, you can identify sensitive data in the column based on the sensitive data identification rule that contains the sensitive field type with the highest sensitivity level. - In the Set Basic Info step, configure the parameters and click Next.
You can search for the published sensitive field types on the Data Recognition, Data Activities, and Data Risks pages by name and sensitivity level.
Manage sensitive field types
- Copy a sensitive field type
Find the sensitive field type that you want to copy and click the
icon in the Actions column. A new sensitive field type with the same settings is created. The name of the generated sensitive field type is suffixed with -Replica. By default, it is saved as a draft.
- Modify a sensitive field type
Find the sensitive field type that you want to modify and click the
icon in the Actions column. In the dialog box that appears, you can modify the sensitive data identification rule for the sensitive field type in the Rule Change step. You can modify the configurations of a custom sensitive field type. However, you cannot change the name, category, and sensitivity level of a built-in sensitive field type.
- Delete a sensitive field type
Find the sensitive field type that you want to delete and click the
icon in the Actions column to delete the sensitive field type. In the message that appears, click OK.
Important The following situations occur if a sensitive field type is deleted. Take note of the information before you delete a sensitive field type.- The identification results generated based on the sensitive data identification rule for this type are deleted. For more information, see Manually correct sensitive data identification results.
- The statistics on this sensitive field type are no longer displayed on the Data Recognition page. For more information, see Identify sensitive data.
- If this sensitive field type is referenced by a risk identification rule, it is removed from the configurations of the risk identification rule. For more information, see Risk identification rule management (old version).
Publish multiple sensitive field types at a time
After you publish sensitive field types, the system identifies sensitive data based on the sensitive data identification rules for these types. For more information about the identification results, see Identify sensitive data.
- Click Batch release and select the sensitive field types that you want to publish.
Note Sensitive field types whose status is The published cannot be selected.
- Click release. Then, the status of each selected sensitive field type is changed to The published.
- You can also click Cancel to cancel the operation.

Sensitive data identification tasks
- In the upper-left corner of the Sensitive data identification page, click Open task.
- In the Open sensitive data identification task panel, set the Scanning range parameter to Total quantity or Custom range.
Parameter value Description Total quantity The system scans all available data of RAM users to which the current tenant grants permissions. Custom range - The system scans data in all workspaces with which compute engine instances are associated. You can select only ODPS from the compute engine drop-down list on the left. You can select a workspace from the drop-down list on the right. The workspace drop-down list displays all workspaces whose metadata is obtained for the selected compute engine.
- You can enter a table name in the Table name field. The name of the table must be 0 to 100 characters in length. The type of the character is not limited. If you leave this field unspecified, all tables are scanned. A table name supports the wildcard (.*). For example, .*name indicates that tables whose names are suffixed with name are scanned. private.* indicates that tables whose names are prefixed with private are scanned. If you specify multiple table names, separate them with commas (,).
- After you configure the Scanning range parameter, click Open to enable the task. A progress bar is displayed to the right of Task status. The following formula is used to calculate the task progress: 100% × Number of identified
tables in the task/Total number of tables that you want to identify in the task. To
terminate the task, click Terminate task to the right of the progress bar. In the dialog box that appears, click Confirm.
Note After you modify a sensitive data identification rule, the new rule takes effect when the task is automatically run next time. If you want the new rule to take effect in real time, you must manually trigger the task to run.
- Click View log. You can view the latest 50 execution log entries.
- After the scanning is complete, No task is displayed to the right of Task status.
Unpublish multiple sensitive field types at a time
After you unpublish sensitive field types, the system stops identifying sensitive data for these types. Records of the unpublished sensitive field types on the Data Recognition page and the Manual Check tab are also deleted. Before you unpublish a sensitive field type, check whether it is referenced by data masking rules and risk identification rules. If so, change the status of the data masking rule to Inactive and remove the sensitive field type from the configurations of the risk identification rule. For more information, see Create a data masking rule and Risk identification rule management (old version).
- Click The batch from the shelves and select the sensitive field types that you want to unpublish.
- Click The shelves. Then, the status of each selected sensitive field type is changed to The draft.
- You can also click Cancel to cancel the operation.

Task execution records
The Task execution record tab displays execution records of completed tasks in the last week. The execution records of tasks that are being executed are not displayed on this tab. Each record contains the following information: Start time, End time, Time consuming, Task type, Person liable, and Data range.