DataWorks lets you train models using sample fields to identify content features and generate rule models. You can use this feature to find data in your data assets that has similar content features. This topic describes how to create custom data detection models.
Limits
DataWorks does not support model training for sample fields with fewer than 10 entries. The data length of each entry must be between 4 and 40 characters. The sample size must be between 10 and 10,000 entries. If the total sample size of the selected fields exceeds 10,000 entries, the system randomly selects 10,000 entries for training. If the sample size is less than 10,000 entries, the system uses all available entries.
DataWorks supports model training only for data that contains numbers, English letters, and special characters. Model training is not supported for sample fields that contain Chinese characters or Chinese punctuation.
Create a model
Go to Data Security Guard.
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Click the
icon in the upper-left corner. Then, choose . On the page that appears, click Try Now to go to the Data Security Guard page. NoteIf your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.
If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.
In the navigation pane on the left, choose to go to the Sensitive Data Detection page.
Create and train a model.
On the Self-generated Data Detection Model tab, click Create Model.
In the Create Model dialog box, configure Model Name and select training samples.
Positive Sample Field: Select sample fields for training from a specified workspace. DataWorks identifies the content features of these fields and generates a rule model. You can then use this rule model to find data in your data assets that has similar content features.
NoteDataWorks does not support model training for sample fields with fewer than 10 entries. The data length of each entry must be between 4 and 40 characters. The sample size must be between 10 and 10,000 entries. If the total sample size of the selected fields exceeds 10,000 entries, the system randomly selects 10,000 entries for training. If the sample size is less than 10,000 entries, the system uses all available entries.
DataWorks supports model training only for data that contains numbers, English letters, and special characters. Model training is not supported for sample fields that contain Chinese characters or Chinese punctuation.
Negative Sample Field: To improve model accuracy, you can select negative sample fields. The system uses the data from these fields as negative samples for training. If you do not select negative samples, the system automatically generates them based on the features and number of your positive samples.
Click Next.
Select I accept that Data Security Guard will use samples for model training and click Start Training.
For this training, the system randomly extracts up to 100 data entries from each selected sample field. The estimated time required for training depends on the number of sample fields.
NoteModel training can take a long time. You can close the training dialog box and perform other operations while the model is trained in the background.
View the model training results.
On the Self-generated Data Detection Model page, you can view the training status and results of the model. Based on the results, you can decide whether the model is ready to be published and used for data detection.

View the training status.
Remaining hh:mm:ss: The model is being trained.
Training Completed: The model training is complete. You can evaluate the training results to decide whether the model can be used for data detection.
Draft: The model is created but not yet trained. It cannot be used for data detection.
View the training results.
Click the
icon in the Actions column of the trained model to view the model's accuracy in identifying the sample data. We recommend that you deploy this model to an online environment only when its accuracy reaches 100%.NoteIf the model's accuracy is less than 100%, the detection results may contain significant errors. If this occurs, add more sample data and retrain the model. Publish the model only after its accuracy reaches 100%.

Click Create to create the rule model.
What to do next
After you create a rule model, go to the Data Detection Rules page to publish the model and use it to detect data. For more information about using custom models on the Data Detection Rules page, see Configure data detection rules and run detection tasks.