News websites often push headlines to viewers. As a news website publishes a large amount of news in real time,
it is necessary to mine headlines from the latest news. This determines the quality of news recommendations.
Machine learning algorithms are required to identify potential headlines from among massive amounts of news. A conventional approach is to download historical consultations that are collected every day, train a model offline based on the collected data, and push the created headline discovery model online for use on the next day. However, this offline trained model lacks timeliness because it predicts daily headlines that are reported in real time based on historical data.
To solve this problem, Machine Learning Platform for AI (PAI) introduces the Online Learning solution that combines streaming and offline algorithms. The solution processes massive data through offline training and updates real-time models by using streaming algorithms of machine learning. This enables simultaneous running of different batches of streams. The following experiment shows how to use the Online Learning solution of PAI to mine headline news.
Currently, the Online Learning solution of PAI is in public review. If you want to use the solution, please fill out the questionnaire:
After activating the Online Learning solution, click Try New Version to start the trial.
Note: The offline computation components of PAI are marked in blue, and the stream computation components are marked in green. The stream components are interconnected to form a computing group and must all be in the running or stopped state.
This experiment uses 30,000 news items from UCI open datasets.
The used data includes the URLs and publication time of news, 58 features, and 1 target value. The target value “share” indicates the number of news item shares. During the modeling process, use the SQL Script component to perform binary classification based on the “share” field and classify news into headlines (with more than 10,000 shares) and non-headlines (with less than 10,000 shares).
The following figure shows the feature composition.
Use the logistic regression model to train and create a binary classification model, which is used to evaluate whether a piece of news will become a headline.
Note: Currently, the Online Learning solution of PAI only supports the logistic regression algorithm.
Use the Model Conversion component to convert the offline logistic regression model to a streaming model that can be read by streaming algorithms.
Step 3 and subsequent steps involve streaming algorithms. PAI provides multiple streaming data sources. This experiment uses Datahub as an example.
Datahub URL: https://datahub.console.aliyun.com/datahub
Datahub is a type of streaming data queue that supports multiple languages such as Java and Python. You can use Datahub to link user-created real-time data and the training service of PAI. Note: The data streams imported by Datahub must be in the same format as the fields of the data streams used for offline training so that offline models can be updated in real time.
The Follow the Regularized Leader (FTRL) algorithm is basically equivalent to the streaming logistic regression algorithm. Set parameters based on the logistic regression algorithm. Pay attention to the Model Save Time Interval parameter, which determines the time interval at which models are created through real-time computing.
Export the classification model in the PMML format and write the model to Object Storage Service (OSS). The write interval is the same as the model creation interval. Model write example:
If streaming evaluation data is available, the system can store real-time model evaluation metrics together with the model in OSS.
After the headline prediction model is created and stored in OSS, you can deploy the model through Elastic Algorithm Service (EAS) of PAI or download the model to be used by the local prediction engine. Perform feature engineering on incoming news data based on the instructions in “Step 1: Train an offline model.” Enter the feature engineering result in Headline Mining Service, and you can see whether the news is a potential headline.