All Products
Search
Document Center

DataWorks:Detect data using custom models

Last Updated:Dec 16, 2025

DataWorks lets you train models using sample fields to identify content features and generate rule models. You can use this feature to find data in your data assets that has similar content features. This topic describes how to create custom data detection models.

Limits

  • DataWorks does not support model training for sample fields with fewer than 10 entries. The data length of each entry must be between 4 and 40 characters. The sample size must be between 10 and 10,000 entries. If the total sample size of the selected fields exceeds 10,000 entries, the system randomly selects 10,000 entries for training. If the sample size is less than 10,000 entries, the system uses all available entries.

  • DataWorks supports model training only for data that contains numbers, English letters, and special characters. Model training is not supported for sample fields that contain Chinese characters or Chinese punctuation.

Create a model

  1. Go to Data Security Guard.

    1. Go to the DataStudio page.

      Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

    2. Click the 图标 icon in the upper-left corner. Then, choose All Products > Data Governance > Data Security Guard. On the page that appears, click Try Now to go to the Data Security Guard page.

      Note
      • If your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.

      • If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.

  2. In the navigation pane on the left, choose Rule Configuration > Sensitive Data Detection to go to the Sensitive Data Detection page.

  3. Create and train a model.

    1. On the Self-generated Data Detection Model tab, click Create Model.

    2. In the Create Model dialog box, configure Model Name and select training samples.

      • Positive Sample Field: Select sample fields for training from a specified workspace. DataWorks identifies the content features of these fields and generates a rule model. You can then use this rule model to find data in your data assets that has similar content features.

        Note

        DataWorks does not support model training for sample fields with fewer than 10 entries. The data length of each entry must be between 4 and 40 characters. The sample size must be between 10 and 10,000 entries. If the total sample size of the selected fields exceeds 10,000 entries, the system randomly selects 10,000 entries for training. If the sample size is less than 10,000 entries, the system uses all available entries.

        DataWorks supports model training only for data that contains numbers, English letters, and special characters. Model training is not supported for sample fields that contain Chinese characters or Chinese punctuation.

      • Negative Sample Field: To improve model accuracy, you can select negative sample fields. The system uses the data from these fields as negative samples for training. If you do not select negative samples, the system automatically generates them based on the features and number of your positive samples.

    3. Click Next.

    4. Select I accept that Data Security Guard will use samples for model training and click Start Training.

      For this training, the system randomly extracts up to 100 data entries from each selected sample field. The estimated time required for training depends on the number of sample fields.

      Note

      Model training can take a long time. You can close the training dialog box and perform other operations while the model is trained in the background.

  4. View the model training results.

    On the Self-generated Data Detection Model page, you can view the training status and results of the model. Based on the results, you can decide whether the model is ready to be published and used for data detection.查看模型训练结果

    • View the training status.

      • Remaining hh:mm:ss: The model is being trained.

      • Training Completed: The model training is complete. You can evaluate the training results to decide whether the model can be used for data detection.

      • Draft: The model is created but not yet trained. It cannot be used for data detection.

    • View the training results.

      Click the Edit model icon in the Actions column of the trained model to view the model's accuracy in identifying the sample data. We recommend that you deploy this model to an online environment only when its accuracy reaches 100%.

      Note

      If the model's accuracy is less than 100%, the detection results may contain significant errors. If this occurs, add more sample data and retrain the model. Publish the model only after its accuracy reaches 100%.

      image

  5. Click Create to create the rule model.

What to do next

After you create a rule model, go to the Data Detection Rules page to publish the model and use it to detect data. For more information about using custom models on the Data Detection Rules page, see Configure data detection rules and run detection tasks.