Practice of Active Learning Method: Make the Model "Active"
1 What is active learning?
After searching on the Internet, most of the answers are "How to let children/students learn actively". Alas, I can't help you. That's not what I'm talking about.
Active Learning (AL) here is a machine learning algorithm. It actively finds the most valuable training sample to add to the training set. If the sample is not labeled, it will automatically require manual labeling and then be used for model training. In short, it is to train a model with the highest performance with fewer training samples.
2. What problems should be solved?
• The cost of data annotation is high, especially in the field of expertise.
• The amount of data is huge, and it is difficult to train in full, or the training machine/time is limited.
3 What is the value?
Active learning can reduce the amount of sample labeling to save costs, including labeling costs, and training resource costs; And active learning can improve the performance of the model under the same amount of data.
For example, there are 1000w samples in the unlabeled sample pool. Using active learning to select 200w samples for labeled training, a model with performance equivalent to 1000w training can be trained. It is generally believed that active learning can reduce more than half of the sample annotation amount. Sometimes active learning selects subsets of data sets for training, and the model performance can exceed the full training. This is exactly the case in our business scenario.
4 What are the business problems we face?
• Improve model performance: CRO content security team's risk control model has improved performance every month.
• Utilizing reflow data: new violation methods are constantly generated in various business scenarios. We need to prevent and control new violation samples. However, the amount of online reflow data is huge. It is inefficient to directly label and train the reflow data, and the cost of labeling, training machine cost, and training time is huge.
• Processing dirty data: Although the image labeling standard is uniquely determined, each label operator has different understanding of the definition of the standard, which is prone to wrong marking, and there is a certain proportion of dirty data in the data set.
II. Introduction to active learning algorithm
In the current active learning algorithm, the most widely used method is the information based method, which selects the sample with the greatest uncertainty to the current model as the training data. Another type of method that is widely concerned is the method based on committee query, which selects the most uncertain samples for different models as training data. Now we mainly analyze these two types of methods.
1 Uncertainty based method: (uncertainty) 
The strategy based on information is widely used because of its practicability.
The greater the uncertainty, the greater the amount of information contained, and the more valuable the training.
The marked data subset is used to train the model, and the model is used to predict the remaining unmarked samples. Based on the prediction results, the most uncertain samples are found by using the uncertainty measurement standard, which is submitted to the marking personnel for marking. The training set training model is added, and then the model is used to select data and iterate repeatedly.
There are three representative uncertainty measurement methods, which are derived into three active learning strategies. It is worth noting that the amount of comparative information can be replaced by the value of comparative probability.
Focus on samples with high confidence value and still low "reliability" in model prediction. The disadvantage is that they do not pay attention to the easily confused samples.
Where, image.png is the probability that the sample image.png is classified as image.png, which is usually the score value of the model output, and image.png is the category index.
Focus on the sample with the maximum confidence and the minimum margin between the two values, that is, the sample that is easy to be confused. This scheme is an improvement on the shortcomings of LC.
Focus on the sample with the largest amount of comprehensive information.
Taking the three classification problems as an example, the above three methods focus on the comparison of sample types:
2. Query By Committee (QBC)
The optimized ML model is regarded as a version space search. QBC finds the best ML model by compressing the search scope of the version space.
The same training set trains multiple models with the same structure. The model votes to select the dispute samples. After the dispute samples are marked, the model is trained and iterated repeatedly.
The Committee composed of 2 to 3 models can achieve good performance. diversity is the key to the performance of the integrated model :
• H.S. Seung  proposed QBC method for the first time in 1992, and voted on the classification results through multiple models to select the sample with the most inconsistent voting results.
• DAS , published in arXive in 2019, QBC method based on deep neural network version: two VGG-16 networks with the same structure are trained on the same data set to pick out the samples that are inconsistent.
• Active Decorate  method: mining new data based on QBC, adding a training set to train a new classifier after marking, directly integrating it with the existing model as a committee, and then continuing to select new data based on the new committee, iterating repeatedly.
The formula for calculating the training value of QBC method is as follows:
One of these two methods is to measure uncertainty based on the current state of the model; The other is to measure uncertainty through many different models. These two types of methods are relatively universal and efficient in active learning, and are valuable for our business.
III. Method based on business scenario design
Although we have accumulated a large amount of labeled data in business scenarios, directly increasing the amount of training data cannot improve the model performance. In addition, it also faces harsh requirements when improving the performance of the model. For example, when the model is large, the training set is large, and the training model is relatively time-consuming, the mining efficiency of the active learning algorithm is demanding, and the way of training the model needs to be improved; If it is necessary to mine millions of level training data, the evaluation method of training value of the sample also needs to be improved accordingly, otherwise it is easy to over fit the training subset. The following details our active learning algorithm based on business scenarios.
1 Excavation efficiency
The existing active learning methods can either mine one sample at a time or a batch sample at a time. Let's calculate how many cycles will it take to complete the data of 500w sample, for example?
Don't forget about Saonian. Resources are limited, and it is impossible to calculate it.
Business scenarios require our active learning algorithm to mine tens of thousands or hundreds of thousands of samples at a time, which is different from active learning in mainstream paper. And we already have a lot of marked data, so the business scenario needs algorithms that can:
• Make full use of information.
• Mining tens of thousands or hundreds of thousands of samples at a time and adding them to the training set to minimize the training iteration cycle.
Therefore, methods like QUIRE  that do not support batch to select samples can be passed directly. Based on the method supporting batch to select samples, we adjusted the amount of data to improve the mining efficiency.
2 Data balance
Can we improve performance by mining valuable samples in batches and putting them into the training set training model?
too young, too simple ！ (= =)|||
The data proportion among each category of training samples is balanced, which has a great impact on the model performance. Many active learning methods, such as QBC method and entropy method, do not consider the problem of data balance, and only select the most valuable samples under their consideration. Our experience is that, on the premise of controlling category balance, we consider the value of samples through active learning methods, and mine valuable training samples from each category in a reasonable proportion.
3 Dirty data elimination
Can we improve performance by mining valuable samples in batches and putting them into the training set training model and controlling data balance?
too young, too simple, sometimes naive ！ orz
Ambiguous samples are more valuable for training and have a higher probability of being mined back. However, in the problems we face, this batch of data is often mislabeled.
We have tried different dirty data elimination methods, including open-source algorithms and algorithms designed by ourselves according to tasks. Finally, the following experiences are summarized and shared:
• Dirty data elimination is a must, because it has a great impact on model performance.
• Dirty data cannot be completely eliminated because it is expensive to completely eliminate it.
• If noise learning techniques are not taken into account, a model with good performance can still be obtained by only making dirty data lower than a certain proportion through data selection.
4 Sample difficulty
Can we improve performance by mining valuable samples in batches and putting them into the training set training model, controlling data balance and eliminating dirty data?
Maybe. May God bless you !
If we mine twice the training data and double the training set directly, the models are almost all biased (at least in our scenarios). This method of determining large quantities of training data in one step based on the judgment of a certain state of the model is easy to introduce "bias", and the interface found by the trained model is not the best interface.
To solve this problem, we move from the perspective of training set sample composition, so that the selected data set is not all hard samples near the current model interface, and there is a certain proportion of easy samples. We use the prediction model with weak performance to select data in the active learning algorithm to achieve this goal.
5 HW active learning method
Our business scenarios require algorithms with the following characteristics:
• Bulk mining data.
• Control data equalization.
• Have the ability to eliminate dirty data.
• The difficulty of controlling samples is appropriate, and some samples near the non interface are introduced.
• Information can be annotated with existing data.
Based on the least confidence method, we design an active learning method to adapt our risk control business and design a HW active learning algorithm. For the training value of each sample in the sample set image.png, we have the following calculation methods:
1 Small data set validation experiment
We use business data to establish a training set with a total of 30w to quickly verify the active learning algorithm. Select the most appropriate method and use it on a larger dataset to improve the performance of the business model.
We use different active learning methods to select 10w data at one time, add training sets, and train models. The performance comparison of each model is as follows. We use the corresponding TPR value when FPR=1% as the indicator:
ROC curve is:
HW method and QBC method perform well in various business scenarios. To achieve QBC, multiple models need to be trained, so HW is better at achieving efficiency, especially when the data volume becomes larger.
2 Business Practice
As mentioned above, we apply the proposed HW active learning method to the yellow identification service model and get the following results. We use the corresponding TPR value when FPR=1% as the indicator:
Active learning can improve the performance of the model by selecting the most valuable training samples to add to the training set. Starting from the idea of active learning, we designed a set of active learning method HW that can mine hundreds of thousands of samples at one time, which improves the performance of the business model.
Six Problems and Prospects
There are still many outstanding issues in this practice, such as:
• Is the method and experience of "dirty data proportion" universal? Can the application of noise learning method further improve the performance of the service model?
• In addition to mining enough training data in one step, can we sacrifice point selection efficiency through multiple rounds of mining and training in order to further improve the model performance.
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Knowledge Base Team
Explore More Special Offers
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00