All Products
Search
Document Center

Machine Learning Platform for AI:Text content moderation solution

Last Updated:Sep 11, 2023

Machine Learning Platform for AI (PAI) provides a text content moderation solution to help you identify high-risk content during content production for online business. This topic describes how to use AI algorithms to develop a content moderation model based on your business requirements. You can use the model to identify and block high-risk content.

Background information

No limits are imposed on the scope of generated content, such as comments, blogs, and product introductions. Therefore, you must identify high-risk content and block the content at the earliest opportunity. PAI provides the following solution based on AI algorithms to help you quickly identify high-risk content.

  • Solution

    1. Label the sample text and manage the sample text by using iTAG and the dataset management feature of PAI.

    2. Fine-tune the model in Machine Learning Designer based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT) transfer learning model that is provided by PAI to meet the requirements of content moderation scenarios and develop a natural language processing (NLP) content moderation model for specific scenarios.

    3. Deploy the model in Elastic Algorithm Service (EAS) as an end-to-end service to identify high-risk content.

Prerequisites

  • Pay-as-you-go resources of Machine Learning Designer, Data Science Workshop (DSW), and EAS are activated. For more information, see Purchase.

  • MaxCompute is activated to store the prediction data. For more information, see Activate MaxCompute and DataWorks.

  • An AI workspace is created and MaxCompute computing resources are added to the workspace. For more information, see Manage workspaces.

  • An Object Storage Service (OSS) bucket is created. The OSS bucket is used to store raw data, labels, and trained models. For more information, see Create buckets.

  • A dedicated resource group in EAS is created. In this example, the trained model is deployed in the dedicated resource group. For more information, see Create a resource group.

Procedure

To create a text content moderation solution based on PAI, perform the following steps:

  1. Prepare data

    Prepare a training dataset and a test dataset for model training and upload the training dataset and test dataset to MaxCompute by running Tunnel commands on the MaxCompute client.

  2. Create a text classification model

    Develop a text classification model for text content moderation scenarios based on the NLP transfer learning model that is pre-trained by using large amounts of data in Machine Learning Designer.

  3. Deploy and call the model service

    Deploy the text content moderation model as an online service by using EAS and call the model service in production environments for inference.

Prepare data

Prepare a training dataset and a test dataset for model training and upload the training dataset and test dataset to MaxCompute by running Tunnel commands on the MaxCompute client. For information about how to install and configure the MaxCompute client, see MaxCompute client (odpscmd). For more information about Tunnel commands, see Tunnel commands. In this example, the following commands are used:

# The command used to create tables. 
CREATE TABLE nlp_risk_train(content STRING, qince_result STRING);
# The command used to upload data. 
tunnel upload /Users/tongxin/xxx/nlp_risk_train.csv nlp_risk_train;

Create a text classification model

  1. Go to the Machine Learning Designer page.

    1. Log on to the Machine Learning Platform for AI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.

  2. Select the pipeline template that is used for text classification.

    1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.

    2. Click the NLP tab.

    3. In the Text classification based on BERT model section, click Create.

  3. Configure the required parameters to create a text classification pipeline.

    In the Create Pipeline dialog box, configure the Pipeline Name, Description, Visibility, and Pipeline Data Path parameters, and then click OK to create the pipeline.

  4. Go to the pipeline details page and configure the component parameters.

    1. On the Visualized Modeling (Designer) page, click the Pipelines tab.

    2. Select the pipeline that you created and click Open.

    3. View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

      图像分类实验

      Component

      Description

      Configure the training dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model training. Example: nlp_risk_train.

      Configure the evaluation dataset for the pipeline. Set the Table Name parameter of the Read Table component to the name of the MaxCompute table for model evaluation. Example: nlp_risk_dev.

      Configure the parameters for text classification training. For more information about how to configure the Text Classification Training component, see the Configure the text classification component section in this topic.

      Note

      This component supports only data of the BOOLEAN, BIGINT, DOUBLE, STRING, or DATETIME type.

      Configure the prediction dataset for the pipeline. Set the Table Name parameter of the Read Table component to the MaxCompute table for prediction. Example: nlp_risk_dev.

      Use the trained text classification model to make predictions based on the prediction dataset. For information about how to configure the Text Classification Prediction (MaxCompute) component, see the Configure the text classification prediction component section of this topic.

      Table 1. Configure the text classification component

      Tab

      Parameter

      Description

      Example

      Fields Setting

      TextColumn

      The column that stores the data for text classification in the input table.

      content

      LabelColumn

      The label column in the input table.

      qince_result

      LabelEnumerateValues

      The values of labels. Separate multiple values with commas (,).

      Normal, Vulgar, Adult, Porn, Other Risks

      Sample weight Column Name

      Optional. Data enhancement of samples in a specific column.

      N/A

      ModelSavePath

      The OSS path of the trained model.

      oss://tongxin-lly.oss-cn-hangzhou-internal.aliyuncs.com/pai/text_spam_rb/

      Parameters Setting

      OptimizerType

      The type of the optimizer. Valid values:

      • adam

      • adagrad

      • lamb

      adam

      batchSize

      The number of samples that are processed at the same time. If the model is trained on multiple servers that use multiple GPUs, this parameter specifies the number of samples that are processed by each GPU at the same time.

      32

      sequenceLength

      The maximum sequence length. Valid values: 1 to 512.

      64

      numEpochs

      The number of epochs for model training.

      1

      LearningRate

      The learning rate during model training.

      1e-5

      Model selection

      The language model that is used for pre-training.

      pai-bert-base-zh

      AdditionalParameters

      The user-defined parameter. You can configure a pre-trained model by specifying pretrain_model_name_or_path. Valid values:

      • base-roberta

      • base-bert

      • tiny-roberta

      • tiny-bert

      Sorted based on model precision: base-roberta > base-bert > tiny-roberta > tiny-bert.

      Sorted based on model speed: base-roberta = base-bert < tiny-roberta = tiny-bert.

      N/A

      Tuning

      WorkerCount

      The number of distributed servers. Default value: 1.

      1

      NumCPU

      The number of GPUs for each worker.

      1

      NumGPU

      The number of GPUs for each worker.

      1

      DistributionStrategy

      The distribution policy. Valid values:

      • MirroredStrategy(single-worker-multi-GPUs): one server with multiple CPUs or GPUs.

      • ExascaleStrategy(single-worker-multi-GPUs): multiple servers with multiple CPUs or GPUs.

      MirroredStrategy

      Table 2 Configure the text classification prediction component

      Tab

      Parameter

      Description

      Example

      Parameters Setting

      First Text Column

      The column that stores the data for text classification in the input table.

      content

      OutputSchema

      The output column that stores the prediction result. Separate multiple columns with commas (,).

      predictions,probabilities,logits

      Prediction Threshold

      The probability threshold for making predictions.

      N/A

      Append Columns

      The column in the input table that you want to add to the output table. Separate multiple columns with commas (,).

      content, qince_result

      batch Size

      The number of samples to be processed at the same time during the prediction process.

      32

      Use User-defined Model

      If no upstream component exists, you can use a trained model that is stored in an OSS directory to make predictions. In this example, the model is trained by the upstream component. You do not need to configure this parameter.

      No

      Tuning

      WorkerCount

      The number of distributed servers. Default value: 1.

      1

      NumCPU

      The number of GPUs for each worker.

      1

      NumGPU

      The number of GPUs for each worker.

      1

Deploy and call the model service

You can deploy the trained content moderation model as an online service by using EAS and call the service in the production environment for inference.

  1. Go to the EAS-Online Model Services page.

    1. Log on to the Machine Learning Platform for AI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspace list page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS) to go to the EAS-Online Model Services page.

  2. Deploy the model service.

    1. On the EAS-Online Model Services page, click Deploy Service.

    2. On the Deploy Service page, configure the parameters and click Deploy. The following table describes only the key parameters. For information about other parameters, see the "Upload and deploy models in the console" section in Model service deployment by using the PAI console.

      Parameter

      Description

      Service Name

      The name of the model service. To differentiate model services, we recommend that you specify a name based on your business requirements.

      Deployment Method

      Deploy Service by Using Model and Processor

      Processor Type

      Select EasyTransfer(CPU).

      Model Type

      Select Text Classification.

      Model Files

      Select Mount OSS Path. In this example, the trained model is stored in OSS.

      Select the model folder in the deployment folder in the path that is specified by the ModelSavePath parameter of the Text Classification Training component. The model folder contains the following files and folders: variables, config.json, saved_model.pb, vocab.txt, and label_mapping.json. In this example, the model directory in the following figure is used. You must select the path of the /deployment/ folder as the OSS path.模型文件路径

      Resource Deployment Information

      Specify the resource information based on the service that you purchased and the processor type that you selected.

  3. Debug the model service.

    1. On the EAS-Online Model Services page, find the service that you want to debug and click Online Debugging in the Actions column.

    2. In the Request Parameter Online Tuning section of the debugging page, enter the following content in the text editor below Request Body.

      {"id": "113","first_sequence": "拥抱编辑3.0时代! 内容升级为产品海内外媒体规划下个十年。","sequence_length": 128}
    3. Click Send Request and view the prediction result in the Debugging Info section, as shown in the following figure.image.png

  4. View the public endpoint and token that are used to access a specific model service.

    1. On the EAS-Online Model Services page, find the model service that you want to access and click Invocation Method in the Service Type column.

    2. In the Invocation Method dialog box, click the Public Endpoint tab to view the public endpoint and Token that are used to access the model service.

  5. Use a script to periodically call the model service.

    1. Create a Python script named eas_nlp_risk.py to call the model service.

      #!/usr/bin/env python
      # encoding=utf-8
      from eas_prediction import PredictClient
      from eas_prediction import StringRequest
      if __name__ == '__main__':
          # Specify the public endpoint that is used to call the model service. 
          client = PredictClient('http://1664xxxxxxxxxxx.cn-hangzhou.pai-eas.aliyuncs.com', 'nlp_risk_cls002')
          # Specify the token that you want to use. 
          client.set_token('MTgxNjE1NGVmMDdjNDRkY2Q5NWE4xxxxxxxxxxxxxxxxxxxxxxx')
          client.init()
          # Construct the request based on the model that you want to use. In this example, the input and output are of the STRING type. 
          request = StringRequest('[{"id": "110","first_sequence": "想赢勇士想到发疯? 格林新发现吓呆众人","sequence_length": 128},{"id": "112","first_sequence": "骗人的,千万别买,谁买谁后悔?商家就是欺诈。 垃圾商家。 买了之后想退货门都没有,以各种手段不退货。 买者慎重。","sequence_length": 128},{"id": "113","first_sequence": "很不错的,正品,很给力,男性同胞的福音,改善的效果特别的好,效果真的是不错的。 是能增大2cm","sequence_length": 128}]')
          for x in range(0, 50000):
              resp = client.predict(request)
              # print(str(resp.response_data, 'utf8'))
      print("test endding")
      
    2. Upload the eas_nlp_risk.py Python script to your client and run the following command to call the model service in the directory where the script is stored:

      python3 <eas_nlp_ris.py>

      Replace <eas_nlp_ris.py> with the name of the Python script that you want to use.

  6. View service metrics.

    After you call the model service, you can view the service metrics, such as the queries per second (QPS), response time (RT), CPU utilization, GPU utilization, and memory usage.

    1. On the EAS-Online Model Services page, click the 服务监控图标 icon in the Service Monitoring column of the service that you called.

    2. On the Service Monitoring tab, you can view the metrics of the service. The metrics of your service may vary based on the actual business scenario.

      image.png