All Products
Search
Document Center

Platform For AI:Classify news based on text analysis

Last Updated:Oct 13, 2023

This topic describes how to use the text analysis components that are provided by Machine Learning Platform for AI (PAI) to build a news classification model.

Background information

News classification is a common scenario in text mining. Many media or content producers classify news by manually labeling news, which is labor-intensive. You can use the intelligent text mining algorithms that are provided by PAI to automate news classification tasks. The tasks include word segmentation, part-of-speech conversion, stop word filtering, topic modeling, and clustering. The pipeline described in this topic uses the Partially Labeled Dirichlet Allocation (PLDA) algorithm to perform topic modeling and the clusters topics based on their weights to classify news.

Note

The dataset that is used in this topic is only for experimental use.

Prerequisites

Classify news based on text analysis

  1. Go to the Machine Learning Designer page.

    1. Log on to the Machine Learning Platform for AI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.

  2. Create a pipeline.

    1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.

    2. In the Text Analysis-News Classification section of the Preset Templates tab, click Create.

    3. In the Create Pipeline dialog box, configure the parameters. You can use their default values.

      The value specified for the Pipeline Data Path parameter is the Object Storage Service (OSS) bucket path of the temporary data and models generated during the runtime of the pipeline.

    4. Click OK.

      It requires about 10 seconds to create the pipeline.

    5. In the pipeline list, double-click the Text Analysis-News Classification pipeline to open the pipeline.

    6. View the components of the pipeline on the canvas as shown in the following figure. The system automatically creates the pipeline based on the preset template.

      新闻分类实验

      Component

      Description

      The Append Id component adds an ID column to the data that is read from the dataset. Each data record in the dataset is a single piece of news. You must add an ID column to uniquely identify each data record. This facilitates computation for subsequent algorithms.

      The components that are displayed in this section divide the content of the news into words and count the number of occurrences of each word. The Split Word component divides the content of the news, which is the value of the content field, into words. The Doc Word Stat component counts the number of occurrences of each word in the text from which stop words are filtered out.

      The Filter Noise component filters out stop words from the content of the news. Stop words include punctuation marks and grammatical particles that do not contribute to the meaning of the news.

      The components that are displayed in this section perform topic modeling.

      1. The Triple to KV component converts word frequency data to the format that is supported by the PLDA component. The format converts text words into numbers.

        Parameters:

        • append_id: the unique ID of the news.

        • key_value: the key-value pairs that indicate the word frequency. The number before the colon (:) is the numeral ID of a word. The number after the colon (:) is the number of occurrences of the word.

      2. The PLDA component trains the topic model.

        The PLDA algorithm is a topic modeling algorithm. The algorithm can find words that indicate the topic of each piece of news. Fifty topics are configured in this pipeline. The fifth output port of the PLDA component generates data that indicates the probability that each piece of news belongs to one of the 50 topics.

      The components that are displayed in this section analyze and evaluate the classification results. After the preceding steps are performed, the topics of the news are converted to vectors. You can perform clustering based on the distances between vectors to classify news.

  3. Run the pipeline and view the results.

    1. In the upper-left corner of the canvas, click the Run icon.

    2. After you run the pipeline, right-click KMeans on the canvas and choose View Data > Output Clustering Table to view the classification result.

      分类结果
      • cluster_index: the name of the category.

      • append_id: the unique ID of the news.

    3. Right-click Sql Mapping on the canvas and choose View Data > Output Port to view the news that are identified by the append_id 115, 292, 248, and 166.

      The classification results of this pipeline are not satisfactory. For example, two pieces of sports news, one piece of financial news, and one piece of science and technology news are classified into the same category. You can use the following methods to improve the results:

      • Use a larger dataset for the pipeline.

      • Perform feature engineering or parameter tuning on the dataset.

      In the pipeline, the Filter Criteria parameter of the Sql Mapping component is preset to display the news that are identified by append_id115, 292, 248, and 166. To view other news, you can configure the Filter Criteria parameter based on the format of the following example:

      append_id=292 or append_id=115  or append_id=248 or append_id=166 ;