Generate a custom data identification model - DataWorks - Alibaba Cloud Documentation Center

DataWorks allows you to use sample fields to train models. DataWorks extracts the features of these fields and generates a rule model. You can use this rule model to identify the data that has similar features in your data assets. This topic describes how to generate a custom data identification model.

Limits

The sample fields used for model training in DataWorks must contain at least 10 data entries and must be 4 to 40 characters in length.
The sample fields used for model training in DataWorks cannot contain Chinese characters, including Chinese punctuation marks.

Create a model

Go to the Data Security Guard page.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
2. Click the icon in the upper-left corner, choose All Products > Data Governance > Data Security Guard, and then click Try now to go to the Data Security Guard page.
  Note
  - If your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.
  - If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.
In the left-side navigation pane, choose Rule Configuration > Sensitive Data Identification. The Sensitive Data Identification page appears.
Create and train a model.
1. On the Self-generated Data Identification Models tab, click Create Model.
2. In the Create Model dialog box, configure the Model Name parameter and select the sample fields used for model training.
  - Sample Fields: You can select sample fields used for model training from a specific workspace. DataWorks extracts the features of these fields and generates a rule model. Then, you can use this rule model to identify the data that has similar features in your data assets.
    Note
    The sample fields used for model training in DataWorks must contain at least 10 data entries and must be 4 to 40 characters in length.
    The sample fields used for model training in DataWorks cannot contain Chinese characters, including Chinese punctuation marks.
  - Exclude Fields: If specific fields are at the risk of being misidentified as sample fields, you can exclude these fields from the rule model. This way, the excluded fields are not hit when you use the rule model to identify data. The excluded fields are used as negative samples for training the model to improve the identification accuracy.
3. Click Next.
4. Select I agree to authorize Data Security Guard to sample data for model training and click Start Training to start training the model.
  Less than 100 data entries from each sample field that you specified are randomly selected for training the model. The training duration varies based on the number of sample fields that you specified.
  Note
  Wait until the training is complete. If you want to use other features when the model is being trained, you can close the Create Model dialog box. DataWorks trains the model in the background.
View the training results.
On the Self-generated Data Identification Models tab, you can view the training status and results of the model. You can determine whether the model is qualified for data identification in an online environment based on the training results.
- View the training status.
  - Surplushh:mm:ss: The model is being trained.
  - Training Completed: The model training is complete. You can determine whether the model can be used for data identification based on the training results.
  - Draft: The model is created but not trained. The model cannot be used for data identification.
- View the training results.
  To view the accuracy of using the extracted sample features to identify sample data, click the icon in the Actions column of the trained model. We recommend that you deploy this model to an online environment only when the accuracy reaches 100%.
  Note
  If you deploy a model whose identification accuracy on sample data is less than 100% to an online environment, large discrepancies may exist between the identified and actual results. In this case, we recommend that you increase the sample data amount to retrain the model until its identification accuracy on sample data reaches 100%. Then, deploy this model to an online environment.
Click Create to create the rule model.

What to do next

After you create the rule model, you can use this rule model for data identification on the Data Identification Rules tab. For more information, see Configure sensitive data identification rules.