Text content moderation solution - Platform For AI - Alibaba Cloud Documentation Center

Platform for AI (PAI) provides a text content moderation solution to help you identify high-risk content during content production for online business. This topic describes how to use AI algorithms to develop a content moderation model based on your business requirements. You can use the model to identify and block high-risk content.

Background information

In content creation scenarios, such as writing comments, blogs, and product introductions, no restrictions are imposed on the scope of the produced content. Therefore, you need to detect and filter high-risk content at the earliest opportunity. PAI provides the following solution that leverages AI algorithms to help you identify high-risk content.

Solution
1. Label the raw text and export the results to datasets by using iTAG, and manage the datasets on PAI.
2. Use the prepared datasets to train the Bidirectional Encoder Representations from Transformers (BERT) model that is provided by PAI, and fine-tune the pre-trained model in Machine Learning Designer to meet your content moderation requirements.
3. Deploy the fine-tuned model in Elastic Algorithm Service (EAS) as an end-to-end service to automatically identify high-risk content.
Architecture
The following figure shows the architecture of the text content moderation solution.

Prerequisites

The pay-as-you-go resources of Machine Learning Designer and EAS are activated. For more information, see Purchase.
MaxCompute is activated to store the prediction data. For more information, see Activate MaxCompute and DataWorks.
An AI workspace is created and MaxCompute computing resources are added to the workspace. For more information, see Manage workspaces.
An Object Storage Service (OSS) bucket is created to store raw data, labels, and trained models. For more information, see Create buckets.

Procedure

To create a text content moderation solution based on PAI, perform the following steps:

Prepare data
Prepare a training dataset and a test dataset for model training and upload the datasets to MaxCompute by running Tunnel commands on the MaxCompute client.
Build a text classification model
Build a text classification model for your text content moderation scenario by using Machine Learning Designer. This model is based on the pre-trained model obtained by using transfer learning on top of the BERT model.
Deploy and call the model service
Deploy your text content moderation model as an online service by using EAS. After deployment, you can call the model service in production environments for inference tasks.

Prepare data

Prepare a training dataset and a test dataset for model training, and upload the datasets to MaxCompute by running Tunnel commands on the MaxCompute client. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For information about Tunnel commands, see Tunnel commands. In this example, the following commands are used:

# The command used to create tables. 
CREATE TABLE nlp_risk_train(content STRING, qince_result STRING);
# The command used to upload data. 
tunnel upload /Users/xxx/xxx/nlp_risk_train.csv nlp_risk_train;
# The command used to create tables. 
CREATE TABLE nlp_risk_dev(content STRING, qince_result STRING);
# The command used to upload data. 
tunnel upload /Users/xxx/xxx/nlp_risk_dev.csv nlp_risk_dev;

Create a text classification model

Go to the Machine Learning Designer page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the workspace page, choose Model Development and Training > Visual Modeling (Designer) to go to the Machine Learning Designer page.
Select the pipeline template that is used for text classification.
1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.
2. Click the NLP tab.
3. In the Text classification based on BERT model section, click Create.
Configure the required parameters to create a text classification pipeline.
In the Create Pipeline dialog box, configure the Pipeline Name, Description, Visibility, and Pipeline Data Path parameters, and then click OK to create the pipeline.

Go to the pipeline details page and configure the component parameters.

On the Visualized Modeling (Designer) page, click the Pipelines tab.
Select the pipeline that you created and click Open.

View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

图像分类实验

Component	Description
1	Configure the training dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model training. Example: nlp_risk_train.
2	Configure the evaluation dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model evaluation. Example: nlp_risk_dev.
3	Configure the parameters for text classification training. For information about how to configure the Text Classification Training component, see Table 1. Configure the text classification component in this topic. Note This component supports only data of the BOOLEAN, BIGINT, DOUBLE, STRING, and DATETIME types.
4	Configure the prediction dataset for the pipeline. Set the Table Name parameter of the Read Table component to the MaxCompute table for prediction. Example: nlp_risk_dev.
5	Apply the trained text classification model to the prediction dataset. For information about how to configure the Text Classification Prediction (MaxCompute) component, see the Configure the text classification prediction component section of this topic.

Table 1. Configure the text classification component
Tab	Parameter	Description	Example
Field Setting	TextColumn	The column that stores the data for text classification in the input table.	content
	LabelColumn	The label column in the input table.	qince_result
	LabelEnumerateValues	The values of labels. Separate multiple values with commas (,).	`Normal, Vulgar, Adult, Porn, Other Risks`
	Sample weight Column Name	Optional. Data enhancement of samples in a specific column.	N/A
	ModelSavePath	The OSS path of the trained model.	`oss://tongxin-lly.oss-cn-hangzhou-internal.aliyuncs.com/pai/text_spam_rb/`
Parameters Setting	OptimizerType	The type of the optimizer. Valid values: adam adagrad lamb	adam
	batchSize	The number of samples that are processed at the same time in a model training task. If the model is trained on multiple servers that use multiple GPUs, this parameter specifies the number of samples that are processed by each GPU at the same time.	32
	sequenceLength	The maximum length of a sequence. Valid values: 1 to 512.	64
	numEpochs	The number of epochs for model training.	1
	LearningRate	The learning rate during model training.	1e-5
	Model selection	The language model that is used for pre-training.	pai-bert-base-zh
	AdditionalParameters	The user-defined parameter. You can configure a pre-trained model by specifying pretrain_model_name_or_path. Valid values: base-roberta base-bert tiny-roberta tiny-bert Sorted based on model precision: `base-roberta > base-bert > tiny-roberta > tiny-bert`. Sorted based on model speed: `base-roberta = base-bert < tiny-roberta = tiny-bert`.	N/A
Tuning	WorkerCount	The number of distributed servers. Default value: 1.	1
	NumCPU	The number of CPUs for each worker node.	1
	NumGPU	The number of GPUs for each worker node.	1
	DistributionStrategy	The distribution policy. Valid values: MirroredStrategy(single-worker-multi-GPUs): one server with multiple GPUs. ExascaleStrategy(single-worker-multi-GPUs): multiple servers with multiple GPUs.	MirroredStrategy

Table 2. Configure the text classification prediction component
Tab	Parameter	Description	Example
Parameters Setting	First Text Column	The column that stores the data for text classification in the input table.	content
	OutputSchema	The output column that stores the prediction result. Separate multiple columns with commas (,).	`predictions,probabilities,logits`
	Prediction Threshold	The probability threshold for making predictions.	N/A
	Append Columns	The column in the input table that you want to add to the output table. Separate multiple columns with commas (,).	`content, qince_result`
	batch Size	The number of samples that are processed at the same time in a model prediction task.	32
	Use User-defined Model	If no upstream components exist, you can use a trained model that is stored in an OSS directory to make predictions. In this example, the model is trained by the upstream component. You do not need to configure this parameter.	No
Tuning	WorkerCount	The number of distributed servers. Default value: 1.	1
	NumCPU	The number of CPUs for each worker node.	1
	NumGPU	The number of GPUs for each worker node.	1

Deploy and call the model service

You can deploy the trained content moderation model as an online service by using EAS, and call the service in production environments to perform inference tasks.

Go to the EAS-Online Model Services page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS). The Elastic Algorithm Service (EAS) page appears.

Deploy the model service.

On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Create Service page, configure the parameters and click Deploy. The following table describes the key parameters. For information about other parameters, see the "Upload and deploy models in the console" section in the Model service deployment by using the PAI console topic.

Parameter	Description
Service Name	The name of the model service. We recommend that you specify a name that can help you identify the model service.
Deployment Method	Deploy Service by Using Model and Processor
Processor Type	Select EasyTransfer(CPU).
Model Type	Select Text Classification.
Model Files	Select Mount OSS Path. In this example, the trained model is stored in OSS. Select the model folder in the deployment folder in the path that is specified by the ModelSavePath parameter of the Text Classification Training component. The model folder contains the following files and folders: variables, config.json, saved_model.pb, vocab.txt, and label_mapping.json. In this example, the model directory in the following figure is used. You must select the path of the /deployment/ folder as the OSS path.
Resource Deployment Information	Specify the resource information based on the service that you purchased and the processor type that you selected.

Debug the model service.
1. On the Elastic Algorithm Service (EAS) page, find the service that you want to debug and click Online Debugging in the Actions column.
2. In the Request Parameter Online Tuning section of the debugging page, enter the following content in the text editor below Request Body.
```
{"id": "113","first_sequence": "Embrace the Era of Editing 3.0! Elevating Product Content for the Next Decade's Media Strategy. ","sequence_length": 128} 
```
3. Click Send Request and view the prediction result in the Debugging Info section, as shown in the following figure.
View the public endpoint and token that are used to access a specific model service.
1. On the Elastic Algorithm Service (EAS) page, find the desired service and click Invocation Method in the Service Type column.
2. In the Invocation Method dialog box, click the Public Endpoint tab to view the public endpoint and Token that are used to access the model service.

Use a script to periodically call the model service.

Create a Python script named eas_nlp_risk.py to call the model service.

#!/usr/bin/env python
# encoding=utf-8
from eas_prediction import PredictClient
from eas_prediction import StringRequest
if __name__ == '__main__':
    # Specify the public endpoint that is used to call the model service. 
    client = PredictClient('http://1664xxxxxxxxxxx.cn-hangzhou.pai-eas.aliyuncs.com', 'nlp_risk_cls002')
    # Specify the token that you want to use. 
    client.set_token('MTgxNjE1NGVmMDdjNDRkY2Q5NWE4xxxxxxxxxxxxxxxxxxxxxxx')
    client.init()
    # Construct the request based on the model that you want to use. In this example, the input and output are of the STRING type. 
    request = StringRequest('[{"id": "110","first_sequence": "Desperate to Beat the Warriors? Green's Startling Discovery Leaves Everyone Speechless","sequence_length": 128},{"id": "112","first_sequence": "Do not fall for this, absolutely do not buy or you'll regret it! The seller is a complete scam. The seller is committing fraud. Terrible seller. Once you've bought it, forget about returning it-they'll use every trick in the book to avoid giving refunds. Buyers, beware.","sequence_length": 128}]')
    for x in range(0, 50000):
        resp = client.predict(request)
        # print(str(resp.response_data, 'utf8'))
print("test endding")

Upload the eas_nlp_risk.py Python script to your client and run the following command to call the model service in the directory where the script is stored:
```
python3 <eas_nlp_ris.py>
```
Replace <eas_nlp_ris.py> with the name of the Python script that you want to use.

View service metrics.
After you call the model service, you can view the service metrics, such as the queries per second (QPS), response time (RT), CPU utilization, GPU utilization, and memory usage.
1. On the Elastic Algorithm Service (EAS) page, click the icon in the Service Monitoring column of the service that you called.
2. On the Service Monitoring tab, you can view the metrics of the service. The metrics of your service may vary based on the actual business scenario.

References

Machine Learning Designer provides preset templates from which you can create pipelines. After you create a pipeline from a template, you can modify specific components or component settings of a pipeline to build models. For more information, see Create a pipeline from a preset template.
After you deploy a model as an online service in EAS, you can use multiple methods to perform model inference. For more information, see Methods for calling services.
For more information about EAS, see EAS overview.
For more information about Machine Learning Designer, see Overview of Machine Learning Designer.