By Garvin Li
News classification is a common scenario in the field of text mining. At present, many media or content producers often use manual tagging for news text classification, which consumes a lot of human resources. This article classifies news texts through smart text mining algorithms. It is completely realized by the machine without any manual tagging.
In this article, automatic news classification is implemented through the PLDA algorithm and clustering topic weights. It includes processes such as word breaking, word type conversion, disabled-word filtering, topic mining, and clustering. We will be doing this using the Alibaba Cloud Machine Learning Platform.
Note: The data in this article is fictitious and is only used for experimental purposes.
The data screenshot is shown below.
The detailed fields are as follows:
The experiment flow chart is as follows.
The experiment is roughly divided into the following 5 steps:
The data source of this experiment is based on a single news unit. It is necessary to add an ID column as a unique identifier for each news unit, which is convenient for computing the following algorithm.
These two steps are the most common practices in the field of text mining.
The word splitting component is first used to break the content field (news content). After removing filtered words (filtered words are generally punctuation and auxiliary words), then the word frequency is analyzed. The results are shown in the following figure.
The disabled-word filter component is used to filter the input disabled-word lexicon, generally filter punctuations and auxiliary words that have less influence on the article.
Using the PLDA text mining component requires first converting the text to a ternary form (text to numeral), as shown in the following figure.
append_id is the unique identifier for each news unit.
The number in front of the colon in the key_value field indicates the numeral identifier that the word is abstracted into, and the colon is followed by the frequency at which the corresponding word appears.
Use the PLDA algorithm for the data.
The PLDA algorithm is also known as topic model, which can locate words that represent the topic of each article. This experiment sets 50 topics. PLDA has 6 output piles, and the 5th output pile outputs the probability of each topic corresponding to each article, as shown in the following figure.
The above steps represent the article as a vector from the dimension of the topic.
Then article classification can be achieved by clustering the distances of the vectors. The classification results of the K-means clustering component are shown in the figure below.
The 4 articles 115, 292, 248, and 166 are queried through the filtering and mapping component. The results are shown in the following figure.
The experimental result is not perfect. In the above figure, most of the articles are sorted correctly, with the exception of a financial news unit, a technology news unit and two sports news units being grouped together.
The main reasons are as follows:
To learn more about Alibaba Cloud Machine Learning Platform for Artificial Intelligence (PAI), visit www.alibabacloud.com/product/machine-learning
Alibaba Clouder - July 18, 2018
Alibaba Clouder - July 17, 2019
Alibaba Clouder - June 19, 2018
GarvinLi - February 28, 2019
GarvinLi - February 28, 2019
GarvinLi - December 27, 2018
An end-to-end platform that provides various machine learning algorithms to meet your data mining and analysis requirements.Learn More
A secure solution to migrate TB-level or PB-level data to Alibaba Cloud.Learn More
A premium, serverless, and interactive analytics serviceLearn More
Data Integration is an all-in-one data synchronization platform. The platform supports online real-time and offline data exchange between all data sources, networks, and locations.Learn More
More Posts by GarvinLi