All Products
Search
Document Center

Platform For AI:Text content moderation solution

Last Updated:Feb 26, 2024

Platform for AI (PAI) provides a text content moderation solution to help you identify high-risk content during content production for online business. This topic describes how to use AI algorithms to develop a content moderation model based on your business requirements. You can use the model to identify and block high-risk content.

Background information

In content creation scenarios, such as writing comments, blogs, and product introductions, no restrictions are imposed on the scope of the produced content. Therefore, you need to detect and filter high-risk content at the earliest opportunity. PAI provides the following solution that leverages AI algorithms to help you identify high-risk content.

  • Solution

    1. Label the raw text and export the results to datasets by using iTAG, and manage the datasets on PAI.

    2. Use the prepared datasets to train the Bidirectional Encoder Representations from Transformers (BERT) model that is provided by PAI, and fine-tune the pre-trained model in Machine Learning Designer to meet your content moderation requirements.

    3. Deploy the fine-tuned model in Elastic Algorithm Service (EAS) as an end-to-end service to automatically identify high-risk content.

  • Architecture

    The following figure shows the architecture of the text content moderation solution.

    image

Prerequisites

  • The pay-as-you-go resources of Machine Learning Designer and EAS are activated. For more information, see Purchase.

  • MaxCompute is activated to store the prediction data. For more information, see Activate MaxCompute and DataWorks.

  • An AI workspace is created and MaxCompute computing resources are added to the workspace. For more information, see Manage workspaces.

  • An Object Storage Service (OSS) bucket is created to store raw data, labels, and trained models. For more information, see Create buckets.

Procedure

To create a text content moderation solution based on PAI, perform the following steps:

  1. Prepare data

    Prepare a training dataset and a test dataset for model training and upload the datasets to MaxCompute by running Tunnel commands on the MaxCompute client.

  2. Build a text classification model

    Build a text classification model for your text content moderation scenario by using Machine Learning Designer. This model is based on the pre-trained model obtained by using transfer learning on top of the BERT model.

  3. Deploy and call the model service

    Deploy your text content moderation model as an online service by using EAS. After deployment, you can call the model service in production environments for inference tasks.

Prepare data

Prepare a training dataset and a test dataset for model training, and upload the datasets to MaxCompute by running Tunnel commands on the MaxCompute client. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For information about Tunnel commands, see Tunnel commands. In this example, the following commands are used:

# The command used to create tables. 
CREATE TABLE nlp_risk_train(content STRING, qince_result STRING);
# The command used to upload data. 
tunnel upload /Users/xxx/xxx/nlp_risk_train.csv nlp_risk_train;
# The command used to create tables. 
CREATE TABLE nlp_risk_dev(content STRING, qince_result STRING);
# The command used to upload data. 
tunnel upload /Users/xxx/xxx/nlp_risk_dev.csv nlp_risk_dev;

Create a text classification model

  1. Go to the Machine Learning Designer page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.

  2. Select the pipeline template that is used for text classification.

    1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.

    2. Click the NLP tab.

    3. In the Text classification based on BERT model section, click Create.

  3. Configure the required parameters to create a text classification pipeline.

    In the Create Pipeline dialog box, configure the Pipeline Name, Description, Visibility, and Pipeline Data Path parameters, and then click OK to create the pipeline.

  4. Go to the pipeline details page and configure the component parameters.

    1. On the Visualized Modeling (Designer) page, click the Pipelines tab.

    2. Select the pipeline that you created and click Open.

    3. View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

      图像分类实验

      Component

      Description

      1

      Configure the training dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model training. Example: nlp_risk_train.

      2

      Configure the evaluation dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model evaluation. Example: nlp_risk_dev.

      3

      Configure the parameters for text classification training. For information about how to configure the Text Classification Training component, see Table 1. Configure the text classification component in this topic.

      Note

      This component supports only data of the BOOLEAN, BIGINT, DOUBLE, STRING, and DATETIME types.

      4

      Configure the prediction dataset for the pipeline. Set the Table Name parameter of the Read Table component to the MaxCompute table for prediction. Example: nlp_risk_dev.

      5

      Apply the trained text classification model to the prediction dataset. For information about how to configure the Text Classification Prediction (MaxCompute) component, see the Configure the text classification prediction component section of this topic.

      Table 1. Configure the text classification component

      Tab

      Parameter

      Description

      Example

      Field Setting

      TextColumn

      The column that stores the data for text classification in the input table.

      content

      LabelColumn

      The label column in the input table.

      qince_result

      LabelEnumerateValues

      The values of labels. Separate multiple values with commas (,).

      Normal, Vulgar, Adult, Porn, Other Risks

      Sample weight Column Name

      Optional. Data enhancement of samples in a specific column.

      N/A

      ModelSavePath

      The OSS path of the trained model.

      oss://tongxin-lly.oss-cn-hangzhou-internal.aliyuncs.com/pai/text_spam_rb/

      Parameters Setting

      OptimizerType

      The type of the optimizer. Valid values:

      • adam

      • adagrad

      • lamb

      adam

      batchSize

      The number of samples that are processed at the same time in a model training task. If the model is trained on multiple servers that use multiple GPUs, this parameter specifies the number of samples that are processed by each GPU at the same time.

      32

      sequenceLength

      The maximum length of a sequence. Valid values: 1 to 512.

      64

      numEpochs

      The number of epochs for model training.

      1

      LearningRate

      The learning rate during model training.

      1e-5

      Model selection

      The language model that is used for pre-training.

      pai-bert-base-zh

      AdditionalParameters

      The user-defined parameter. You can configure a pre-trained model by specifying pretrain_model_name_or_path. Valid values:

      • base-roberta

      • base-bert

      • tiny-roberta

      • tiny-bert

      Sorted based on model precision: base-roberta > base-bert > tiny-roberta > tiny-bert.

      Sorted based on model speed: base-roberta = base-bert < tiny-roberta = tiny-bert.

      N/A

      Tuning

      WorkerCount

      The number of distributed servers. Default value: 1.

      1

      NumCPU

      The number of CPUs for each worker node.

      1

      NumGPU

      The number of GPUs for each worker node.

      1

      DistributionStrategy

      The distribution policy. Valid values:

      • MirroredStrategy(single-worker-multi-GPUs): one server with multiple GPUs.

      • ExascaleStrategy(single-worker-multi-GPUs): multiple servers with multiple GPUs.

      MirroredStrategy

      Table 2. Configure the text classification prediction component

      Tab

      Parameter

      Description

      Example

      Parameters Setting

      First Text Column

      The column that stores the data for text classification in the input table.

      content

      OutputSchema

      The output column that stores the prediction result. Separate multiple columns with commas (,).

      predictions,probabilities,logits

      Prediction Threshold

      The probability threshold for making predictions.

      N/A

      Append Columns

      The column in the input table that you want to add to the output table. Separate multiple columns with commas (,).

      content, qince_result

      batch Size

      The number of samples that are processed at the same time in a model prediction task.

      32

      Use User-defined Model

      If no upstream components exist, you can use a trained model that is stored in an OSS directory to make predictions. In this example, the model is trained by the upstream component. You do not need to configure this parameter.

      No

      Tuning

      WorkerCount

      The number of distributed servers. Default value: 1.

      1

      NumCPU

      The number of CPUs for each worker node.

      1

      NumGPU

      The number of GPUs for each worker node.

      1

Deploy and call the model service

You can deploy the trained content moderation model as an online service by using EAS, and call the service in production environments to perform inference tasks.

  1. Go to the EAS-Online Model Services page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.

  2. Deploy the model service.

    1. On the PAI-EAS Model Online Service page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

    2. On the Deploy Service page, configure the parameters and click Deploy. The following table describes the key parameters. For information about other parameters, see the "Upload and deploy models in the console" section in the Model service deployment by using the PAI console topic.

      Parameter

      Description

      Service Name

      The name of the model service. We recommend that you specify a name that can help you identify the model service.

      Deployment Method

      Deploy Service by Using Model and Processor

      Processor Type

      Select EasyTransfer(CPU).

      Model Type

      Select Text Classification.

      Model Files

      Select Mount OSS Path. In this example, the trained model is stored in OSS.

      Select the model folder in the deployment folder in the path that is specified by the ModelSavePath parameter of the Text Classification Training component. The model folder contains the following files and folders: variables, config.json, saved_model.pb, vocab.txt, and label_mapping.json. In this example, the model directory in the following figure is used. You must select the path of the /deployment/ folder as the OSS path.模型文件路径

      Resource Deployment Information

      Specify the resource information based on the service that you purchased and the processor type that you selected.

  3. Debug the model service.

    1. On the EAS-Online Model Services page, find the service that you want to debug and click Online Debugging in the Actions column.

    2. In the Request Parameter Online Tuning section of the debugging page, enter the following content in the text editor below Request Body.

      {"id": "113","first_sequence": "Embrace the Era of Editing 3.0! Elevating Product Content for the Next Decade's Media Strategy. ","sequence_length": 128} 
    3. Click Send Request and view the prediction result in the Debugging Info section, as shown in the following figure.image.png

  4. View the public endpoint and token that are used to access a specific model service.

    1. On the EAS-Online Model Services page, find the model service that you want to access and click Invocation Method in the Service Type column.

    2. In the Invocation Method dialog box, click the Public Endpoint tab to view the public endpoint and Token that are used to access the model service.

  5. Use a script to periodically call the model service.

    1. Create a Python script named eas_nlp_risk.py to call the model service.

      #!/usr/bin/env python
      # encoding=utf-8
      from eas_prediction import PredictClient
      from eas_prediction import StringRequest
      if __name__ == '__main__':
          # Specify the public endpoint that is used to call the model service. 
          client = PredictClient('http://1664xxxxxxxxxxx.cn-hangzhou.pai-eas.aliyuncs.com', 'nlp_risk_cls002')
          # Specify the token that you want to use. 
          client.set_token('MTgxNjE1NGVmMDdjNDRkY2Q5NWE4xxxxxxxxxxxxxxxxxxxxxxx')
          client.init()
          # Construct the request based on the model that you want to use. In this example, the input and output are of the STRING type. 
          request = StringRequest('[{"id": "110","first_sequence": "Desperate to Beat the Warriors? Green's Startling Discovery Leaves Everyone Speechless","sequence_length": 128},{"id": "112","first_sequence": "Do not fall for this, absolutely do not buy or you'll regret it! The seller is a complete scam. The seller is committing fraud. Terrible seller. Once you've bought it, forget about returning it-they'll use every trick in the book to avoid giving refunds. Buyers, beware.","sequence_length": 128}]')
          for x in range(0, 50000):
              resp = client.predict(request)
              # print(str(resp.response_data, 'utf8'))
      print("test endding")
      
    2. Upload the eas_nlp_risk.py Python script to your client and run the following command to call the model service in the directory where the script is stored:

      python3 <eas_nlp_ris.py>

      Replace <eas_nlp_ris.py> with the name of the Python script that you want to use.

  6. View service metrics.

    After you call the model service, you can view the service metrics, such as the queries per second (QPS), response time (RT), CPU utilization, GPU utilization, and memory usage.

    1. On the EAS-Online Model Services page, click the 服务监控图标 icon in the Service Monitoring column of the service that you called.

    2. On the Service Monitoring tab, you can view the metrics of the service. The metrics of your service may vary based on the actual business scenario.

      image.png

References

  • Machine Learning Designer provides preset templates from which you can create pipelines. After you create a pipeline from a template, you can modify specific components or component settings of a pipeline to build models. For more information, see Create a pipeline from a preset template.

  • After you deploy a model as an online service in EAS, you can use multiple methods to perform model inference. For more information, see Methods for calling services.

  • For more information about EAS, see EAS overview.

  • For more information about Machine Learning Designer, see Overview of Machine Learning Designer.