The data in this topic is fictitious and is only used for experimental purposes.
This experiment is intended to introduce text components. To improve the final results, please contact us. We will provide you with complete solutions and business cooperation.
News classification is a common scenario in the field of text mining. At present, many media or content producers often use manual tagging for news text classification, which consumes a lot of human resources. This topic classifies news text through smart text mining algorithms. It is completely implemented by the machine without any manual tagging.
In this document, automatic news classification is implemented through the PLDA algorithm and topic weights clustering. It includes processes such as word breaking, word type conversion, deprecated word filtering, topic mining, and clustering.
The data screenshot is shown as follows.
The following table describes the fields:
|category||News type||string||Sports, women, society, military, and technology.|
The following figure shows the experiment process.
The experiment is roughly divided into the following five steps:
- 1: Add an ID column
- 2：Perform word breaking and word frequency analysis
- 3：Filter deprecated words
- 4：Mine text topics
- 5：Analyze and evaluate results
The data source of this experiment is based on a single news unit. It is necessary to add an ID column as a unique identifier for each news unit, which is convenient for computing the following algorithm.
This step is a common practice in the field of text mining.
Use the Split Word component to break the content field (news content). Filtered words include punctuation marks and auxiliary words. The following figure shows the result.
Use the Deprecated Word Filtering component to filter the input deprecated-word lexicon. This typically filters punctuation and auxiliary words that have less impact on the news content.
- Before using the PLDA component, convert the text to a ternary form (text to numeral), as shown in the following figure.
- append_id is the unique identifier of each news unit.
- The number preceding the colon in the key_value field indicates the numeral identifier that the word is abstracted into, and the colon is followed by the frequency at which the corresponding word appears.
Apply the PLDA algorithm to the data.
The PLDA algorithm is also known as topic model, which can locate words that represent the topic of each news unit. This experiment sets 50 topics. PLDA has six output piles, and the fifth output pile outputs the probability of each topic corresponding to each news unit, as shown in the following figure.
The preceding steps represent the news unit as a vector from the dimension of the topic.
News units can be classified by clustering the distances of the vectors. The classification results of the K-means Clustering component are shown in the following figure.
- cluster_index indicates the name of each class.
- Find class 0. There are a total of 4 news units with the docid of 115, 292, 248, and 166.
The 4 news units 115, 292, 248, and 166 are queried through the Filtering and Mapping component. The following figure shows the result.
The experiment results are unsatisfactory. In the preceding figure, a financial news unit, a technology news unit, and two sports news units are grouped together.
The main reasons are as follows.
- There is no detailed optimization.
- There is no feature engineering for the data.
- The data volume is too small.