This topic describes how to use the text analysis components that Machine Learning Platform for AI (PAI) provides to build a news classification model.

Background information

News classification is a common scenario of text mining. Many media or content producers classify news by manually labeling news, which is labor intensive. You can use the intelligent text mining algorithms that PAI provides to automate news classification tasks. The tasks include word segmentation, part-of-speech conversion, stop word filtering, topic modeling, and clustering. The experiment described in this topic uses the Partially Labeled Dirichlet Allocation (PLDA) algorithm to perform topic modeling. Then, the experiment clusters topics based on their weight to classify news.
Note The data used in the experiment is for experimental use only.

Dataset

The following table describes the fields in the dataset that the experiment uses.
Field Data type Description
category STRING The type of the news. Common news types are sports, women, society, science and technology, and military.
title STRING The title of the news.
content STRING The content of the news.

Procedure

  1. Go to the Machine Learning Studio console.
    1. Log on to the PAI console.
    2. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization.
    3. On the PAI Visualization Modeling page, find the project in which you want to create an experiment and click Machine Learning in the Operation column.Machine Learning
  2. Create an experiment.
    1. In the left-side navigation pane, click Home.
    2. In the Templates section, click Create below [Text Analysis] News Classification.
    3. In the New Experiment dialog box, set the experiment parameters. You can use the default values of the parameters.
      Parameter Description
      Name The name of the experiment. Default value: [Text Analysis] News Classification. The name must be 1 to 32 characters in length. Enter a name that meets this requirement, for example, News Classification.
      Project The project in which you want to create the experiment. You cannot change the value of this parameter.
      Description The description of the experiment. Default value: Use topic models to achieve text classification.
      Save To The directory for storing the experiment. Default value: My Experiments.
    4. Click OK.
    5. Optional:Wait about 10 seconds. Then, click Experiments in the left-side navigation pane.
    6. Optional:Click News Classification_XX under My Experiments.
      My Experiments is the directory for storing the experiment that you created and News Classification_XX is the name of the experiment. In the experiment name, _XX is the ID that the system automatically creates for the experiment.
    7. View the components of the experiment on the canvas, as shown in the following figure. The system automatically creates the experiment based on the preset template.
      Experiment on news classification
      Area No. Description
      1 The Add ID Column-1 component adds an ID column to the data that is read from the dataset. In the dataset, each data record is a single piece of news. You must add an ID column to uniquely identify each data record. This facilitates computation for the subsequent algorithms.
      2 The components in this area divide the content of the news into words and count the number of occurrences of each word. The Word Splitting-1 component divides the content of the news, which is the value of the content field, into words. The Word Frequency Statistics-1 component counts the number of occurrences for each word in the text from which stop words have been filtered out.
      3 The Deprecated Word Filter-1 component filters out stop words from the content of the news. Stop words include punctuation marks and grammatical particles that do not contribute to the meaning of the news.
      4 The components in this area perform topic modeling.
      1. The Convert Row, Column, and Value to KV Pair-1 component converts word frequency data to the format that is supported by the PLDA-1 component, where text words are converted to numbers.
        The result data contain the following fields:
        • append_id: the unique ID of the news.
        • key_value: the key-value pairs that indicate the word frequency. The number before the colon (:) is the numeral ID of a word, whereas the number after the colon (:) is the number of occurrences of the word.
      2. The PLDA-1 component trains the topic model.

        The PLDA algorithm is a topic modeling algorithm. It can find words that indicate the topic of each piece of news. A total of 50 topics are configured in this experiment. The fifth output port of the PLDA-1 component generates data that shows the probability that each piece of news belongs to each of the 50 topics.

      5 The components in this area analyze and evaluate the classification result. After the preceding steps are performed, the topics of the news are converted to vectors. You can perform clustering based on the distances among vectors to classify news.
  3. Run the experiment and view the result.
    1. In the top toolbar of the canvas, click Run.
    2. After the experiment is run, right-click K-means Clustering-1 on the canvas and choose View Data > View Output Port 1 to view the classification result.
      Classification resultThe classification result contains the following fields:
      • cluster_index: the name of the category.
      • append_id: the ID of the news. For example, the news that are identified by 115, 292, 248, and 166 belong to Category 0.
    3. Right-click Filtering and Mapping-1 on the canvas and select View Data to view the news that are identified by 115, 292, 248, and 166.
      The classification result of this experiment is not satisfying. For example, two pieces of sports news, one piece of financial news, and one piece of science and technology news are classified into the same category. You can improve the accuracy of the news classification result by using the following methods:
      • Use a larger dataset for the experiment.
      • Perform feature engineering or parameter tuning on the dataset.
      The Filter Criteria parameter of the Filtering and Mapping-1 component is preset in this experiment to display the news that are identified by 115, 292, 248, and 166. To view other news, you can set the Filter Criteria parameter base on the format of the following example:
      append_id=292 or append_id=115  or append_id=248 or append_id=166 ;