Machine Learning Platform for AI (PAI) provides a text content moderation solution to help you identify high-risk content during content production for online business. This topic describes how to use AI algorithms to develop a content moderation model based on your business requirements. You can use the model to identify and block high-risk content.
Background information
No limits are imposed on the scope of generated content, such as comments, blogs, and product introductions. Therefore, you must identify high-risk content and block the content at the earliest opportunity. PAI provides the following solution based on AI algorithms to help you quickly identify high-risk content.
Solution
Label the sample text and manage the sample text by using iTAG and the dataset management feature of PAI.
Fine-tune the model in Machine Learning Designer based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT) transfer learning model that is provided by PAI to meet the requirements of content moderation scenarios and develop a natural language processing (NLP) content moderation model for specific scenarios.
Deploy the model in Elastic Algorithm Service (EAS) as an end-to-end service to identify high-risk content.
Prerequisites
Pay-as-you-go resources of Machine Learning Designer, Data Science Workshop (DSW), and EAS are activated. For more information, see Purchase.
MaxCompute is activated to store the prediction data. For more information, see Activate MaxCompute and DataWorks.
An AI workspace is created and MaxCompute computing resources are added to the workspace. For more information, see Manage workspaces.
An Object Storage Service (OSS) bucket is created. The OSS bucket is used to store raw data, labels, and trained models. For more information, see Create buckets.
A dedicated resource group in EAS is created. In this example, the trained model is deployed in the dedicated resource group. For more information, see Create a resource group.
Procedure
To create a text content moderation solution based on PAI, perform the following steps:
Prepare a training dataset and a test dataset for model training and upload the training dataset and test dataset to MaxCompute by running Tunnel commands on the MaxCompute client.
Create a text classification model
Develop a text classification model for text content moderation scenarios based on the NLP transfer learning model that is pre-trained by using large amounts of data in Machine Learning Designer.
Deploy and call the model service
Deploy the text content moderation model as an online service by using EAS and call the model service in production environments for inference.
Prepare data
Prepare a training dataset and a test dataset for model training and upload the training dataset and test dataset to MaxCompute by running Tunnel commands on the MaxCompute client. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands. In this example, the following commands are used:
# The command used to create tables.
CREATE TABLE nlp_risk_train(content STRING, qince_result STRING);
# The command used to upload data.
tunnel upload /Users/tongxin/xxx/nlp_risk_train.csv nlp_risk_train;
Create a text classification model
Go to the Machine Learning Designer page.
Log on to the Machine Learning Platform for AI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose to go to the Machine Learning Designer page.
Select the pipeline template that is used for text classification.
On the Visualized Modeling (Designer) page, click the Preset Templates tab.
Click the NLP tab.
In the Text classification based on BERT model section, click Create.
Configure the required parameters to create a text classification pipeline.
In the Create Pipeline dialog box, configure the Pipeline Name, Description, Visibility, and Pipeline Data Path parameters, and then click OK to create the pipeline.
Go to the pipeline details page and configure the component parameters.
On the Visualized Modeling (Designer) page, click the Pipelines tab.
Select the pipeline that you created and click Open.
View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.
Component
Description
①
Configure the training dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model training. Example: nlp_risk_train.
②
Configure the evaluation dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model evaluation. Example: nlp_risk_dev.
③
Configure the parameters for text classification training. For more information about how to configure the Text Classification Training component, see the Configure the text classification component section in this topic.
NoteThis component supports only data of the BOOLEAN, BIGINT, DOUBLE, STRING, or DATETIME type.
④
Configure the prediction dataset for the pipeline. Set the Table Name parameter of the Read Table component to the MaxCompute table for prediction. Example: nlp_risk_dev.
⑤
Use the trained text classification model to make predictions based on the prediction dataset. For information about how to configure the Text Classification Prediction (MaxCompute) component, see the Configure the text classification prediction component section of this topic.
Table 1. Configure the text classification component Tab
Parameter
Description
Example
Fields Setting
TextColumn
The column that stores the data for text classification in the input table.
content
LabelColumn
The label column in the input table.
qince_result
LabelEnumerateValues
The values of labels. Separate multiple values with commas (,).
Normal, Vulgar, Adult, Porn, Other Risks
Sample weight Column Name
Optional. Data enhancement of samples in a specific column.
N/A
ModelSavePath
The OSS path of the trained model.
oss://tongxin-lly.oss-cn-hangzhou-internal.aliyuncs.com/pai/text_spam_rb/
Parameters Setting
OptimizerType
The type of the optimizer. Valid values:
adam
adagrad
lamb
adam
batchSize
The number of samples that are processed at the same time. If the model is trained on multiple servers that use multiple GPUs, this parameter specifies the number of samples that are processed by each GPU at the same time.
32
sequenceLength
The maximum sequence length. Valid values: 1 to 512.
64
numEpochs
The number of epochs for model training.
1
LearningRate
The learning rate during model training.
1e-5
Model selection
The language model that is used for pre-training.
pai-bert-base-zh
AdditionalParameters
The user-defined parameter. You can configure a pre-trained model by specifying pretrain_model_name_or_path. Valid values:
base-roberta
base-bert
tiny-roberta
tiny-bert
Sorted based on model precision:
base-roberta > base-bert > tiny-roberta > tiny-bert
.Sorted based on model speed:
base-roberta = base-bert < tiny-roberta = tiny-bert
.N/A
Tuning
WorkerCount
The number of distributed servers. Default value: 1.
1
NumCPU
The number of GPUs for each worker.
1
NumGPU
The number of GPUs for each worker.
1
DistributionStrategy
The distribution policy. Valid values:
MirroredStrategy(single-worker-multi-GPUs): one server with multiple CPUs or GPUs.
ExascaleStrategy(single-worker-multi-GPUs): multiple servers with multiple CPUs or GPUs.
MirroredStrategy
Table 2 Configure the text classification prediction component Tab
Parameter
Description
Example
Parameters Setting
First Text Column
The column that stores the data for text classification in the input table.
content
OutputSchema
The output column that stores the prediction result. Separate multiple columns with commas (,).
predictions,probabilities,logits
Prediction Threshold
The probability threshold for making predictions.
N/A
Append Columns
The column in the input table that you want to add to the output table. Separate multiple columns with commas (,).
content, qince_result
batch Size
The number of samples to be processed at the same time during the prediction process.
32
Use User-defined Model
If no upstream component exists, you can use a trained model that is stored in an OSS directory to make predictions. In this example, the model is trained by the upstream component. You do not need to configure this parameter.
No
Tuning
WorkerCount
The number of distributed servers. Default value: 1.
1
NumCPU
The number of GPUs for each worker.
1
NumGPU
The number of GPUs for each worker.
1
Deploy and call the model service
You can deploy the trained content moderation model as an online service by using EAS and call the service in the production environment for inference.
Go to the EAS-Online Model Services page.
Log on to the Machine Learning Platform for AI console.
In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose to go to the EAS-Online Model Services page.
Deploy the model service.
On the EAS-Online Model Services page, click Deploy Service.
On the Deploy Service page, configure the parameters and click Deploy. The following table describes only the key parameters. For information about other parameters, see the "Upload and deploy models in the console" section in Model service deployment by using the PAI console.
Parameter
Description
Service Name
The name of the model service. To differentiate model services, we recommend that you specify a name based on your business requirements.
Deployment Method
Deploy Service by Using Model and Processor
Processor Type
Select EasyTransfer(CPU).
Model Type
Select Text Classification.
Model Files
Select Mount OSS Path. In this example, the trained model is stored in OSS.
Select the model folder in the deployment folder in the path that is specified by the ModelSavePath parameter of the Text Classification Training component. The model folder contains the following files and folders: variables, config.json, saved_model.pb, vocab.txt, and label_mapping.json. In this example, the model directory in the following figure is used. You must select the path of the /deployment/ folder as the OSS path.
Resource Deployment Information
Specify the resource information based on the service that you purchased and the processor type that you selected.
Debug the model service.
On the EAS-Online Model Services page, find the service that you want to debug and click Online Debugging in the Actions column.
In the Request Parameter Online Tuning section of the debugging page, enter the following content in the text editor below Request Body.
{"id": "113","first_sequence": "拥抱编辑3.0时代! 内容升级为产品海内外媒体规划下个十年。","sequence_length": 128}
Click Send Request and view the prediction result in the Debugging Info section, as shown in the following figure.
View the public endpoint and token that are used to access a specific model service.
On the EAS-Online Model Services page, find the model service that you want to access and click Invocation Method in the Service Type column.
In the Invocation Method dialog box, click the Public Endpoint tab to view the public endpoint and Token that are used to access the model service.
Use a script to periodically call the model service.
Create a Python script named eas_nlp_risk.py to call the model service.
#!/usr/bin/env python # encoding=utf-8 from eas_prediction import PredictClient from eas_prediction import StringRequest if __name__ == '__main__': # Specify the public endpoint that is used to call the model service. client = PredictClient('http://1664xxxxxxxxxxx.cn-hangzhou.pai-eas.aliyuncs.com', 'nlp_risk_cls002') # Specify the token that you want to use. client.set_token('MTgxNjE1NGVmMDdjNDRkY2Q5NWE4xxxxxxxxxxxxxxxxxxxxxxx') client.init() # Construct the request based on the model that you want to use. In this example, the input and output are of the STRING type. request = StringRequest('[{"id": "110","first_sequence": "想赢勇士想到发疯? 格林新发现吓呆众人","sequence_length": 128},{"id": "112","first_sequence": "骗人的,千万别买,谁买谁后悔?商家就是欺诈。 垃圾商家。 买了之后想退货门都没有,以各种手段不退货。 买者慎重。","sequence_length": 128},{"id": "113","first_sequence": "很不错的,正品,很给力,男性同胞的福音,改善的效果特别的好,效果真的是不错的。 是能增大2cm","sequence_length": 128}]') for x in range(0, 50000): resp = client.predict(request) # print(str(resp.response_data, 'utf8')) print("test endding")
Upload the eas_nlp_risk.py Python script to your client and run the following command to call the model service in the directory where the script is stored:
python3 <eas_nlp_ris.py>
Replace <eas_nlp_ris.py> with the name of the Python script that you want to use.
View service metrics.
After you call the model service, you can view the service metrics, such as the queries per second (QPS), response time (RT), CPU utilization, GPU utilization, and memory usage.
On the EAS-Online Model Services page, click the
icon in the Service Monitoring column of the service that you called.
On the Service Monitoring tab, you can view the metrics of the service. The metrics of your service may vary based on the actual business scenario.