All Products
Search
Document Center

DataWorks:Generate a custom data identification model

Last Updated:Aug 16, 2023

DataWorks allows you to use sample fields to train models. DataWorks extracts the features of these fields and generates a rule model. You can use this rule model to identify the data that has similar features in your data assets. This topic describes how to generate a custom data identification model.

Limits

  • The sample fields used for model training in DataWorks must contain at least 10 data entries and must be 4 to 40 characters in length.

  • The sample fields used for model training in DataWorks cannot contain Chinese characters, including Chinese punctuation marks.

Create a model

  1. Go to the Data Security Guard page.

    1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

    2. Click the More icon icon in the upper-left corner and choose All Products > Data Governance > Data Security Guard.

    3. Click Try now to go to the Data Security Guard page.

  2. In the left-side navigation pane, choose Rule Change > Sensitive data identification.

  3. Click the Self Generated Data Recognition Model tab.

  4. Create and train a model.

    1. On the Self Generated Data Recognition Model tab, click Add Model.

    2. In the dialog box that appears, set the Model Name parameter and select sample fields.

      Select sample fields
      • Select sample fields: You can select sample fields from the current workspace. DataWorks extracts the features of these fields and generates a rule model. Then, you can use this rule model to identify the data that has similar features in your data assets.

        Note
        • The sample fields used for model training in DataWorks must contain at least 10 data entries and must be 4 to 40 characters in length.

        • The sample fields used for model training in DataWorks cannot contain Chinese characters, including Chinese punctuation marks.

      • Filter fields: If specific fields are prone to be confused with the sample fields, you can exclude these fields from the rule model. This way, when you use the rule model to identify data, the excluded fields will not be hit. The excluded fields are used for training the model as negative samples to improve the identification accuracy.

    3. Click Next.

    4. Click Start Training to start training the model.

      Less than 100 data entries from each sample field that you specified are randomly selected for training the model. The training duration depends on the number of sample fields that you specified.

      Note

      Wait until the training is complete. If you want to use other features when the model is being trained, you can close the Add Model dialog box. DataWorks trains the model in the background.

  5. View the training results.

    On the Self Generated Data Recognition Model tab, you can view the training status and results of the model. You can determine whether the model is qualified for data identification in an online environment based on the training results. View the training results

    • View the training status.

      • Surplushh:mm:ss: The model is being trained.

      • Training Completed: The model is trained. You can determine whether the model can be used for data identification based on the training results.

      • Draft: The model is created but not trained. The model cannot be used for data identification.

    • View the training results.

      To view the accuracy of using the sample features extracted by this model to identify sample data, click the Edit the model icon in the Actions column of the trained model. We recommend that you deploy this model to an online environment only when the accuracy reaches 100%.

      Note

      If you deploy the model whose identification accuracy on sample data is less than 100% to an online environment, large discrepancies may exist between the identified and actual results. In this case, we recommend that you increase the sample data amount to retrain the model until its identification accuracy on sample data reaches 100%. Then, deploy this model to an online environment.

      View the identification results
  6. Click Create to create the rule model.

What to do next

After you create the rule model, you can use this rule model for data identification on the Data Recognition Rules tab. For more information, see Identify sensitive data.