Reborn Generation Model
Rearrangement in commodity sorting
The purpose of product sorting is to make efficient products have better display opportunities and match the needs of users. A mainstream idea is that a product can be good or bad for a user's request. From the point of view of the display location, the more advanced the products are, the better the exposure opportunities will be. Therefore, the idea of grading goods through the model naturally emerges, and it seems that as long as the model scores the goods accurately and the AUC is high enough, the goods can be sorted from high to low according to this score! In this classic framework, there is a serious defect that most rearrangements are expected to solve: the context of the product will have an impact on the results. Considering the model before the rearrangement, it is very difficult to feed the context information into the model because the set of candidate products is too large. Therefore, it is a common idea to use rearrangement to solve the context impact. We can see that almost all rearrangement models will consider the context when grading goods. However, in fact, it is one task to score the products to determine whether they will be transformed, and another task to rank the candidate products to maximize the income. Although the two tasks are related, they are quite different. If we just sort by score, the original order will be disrupted, and we can't restore the new order label from the original order label, so we can't determine how the new order behaves. Unless we believe that the transformation behavior of goods will not change because of the context, this is obviously a problem.
Regardless of the context, is the model accurate enough to grade the goods?
We designed three models for experiments to show the impact of the degree of context information description of the model on the scoring results. The first model is the most commonly used pointwise scoring model at present. The input of the model is the product characteristics and user information. We call it Simple DNN for short. The second is the online rearrangement model AE rerank, which performs well in the rearrangement stage. It adds the global statistical information of the context goods on the basis of Simple DNN, such as the variance, mean and extreme value of some characteristic points. The third model is designed to further describe the context information. On the basis of AE rerank, CNN and RNN modules are added to capture the information of commodity sequence. The model structure is shown in the following figure: (Simple DNN does not include blue CNN, RNN and green global statistical feature input, and AE rerank does not include blue CNN and RNN).
We used 54 million daily data as the training set, and trained 5 epochs of the above three models respectively (all converged). Then we calculated their performance results in the test set (about 2.5 million data), as shown in the following table. The AEFS here is the alias name of the total score of fine rehearsal generated in the fine rehearsal stage. The data in the table answers our question. The pointwise model, Simple DNN, scores the goods slightly less than the AE rerank with context information and the new model. From this, we can conclude that the product scoring should be affected by the context in which it is located. Only by integrating the context information can we estimate more accurately.
Regardless of the change of order, is it reliable to rank according to the original order?
This formula is applicable to the new model that fully considers the context. However, for the first two models, Simple DNN and AE rerank, even if the products are randomly mixed, their scoring or estimated probability for each product is constant, so it must be inaccurate to add them directly as the scoring of the product series. In general, we can naturally avoid the above situation by taking the estimated probability of each product multiplied by the sum of the location discount as the ranking result in a way similar to DCG, as shown below:
We have tested the effectiveness of this method to judge whether PV is good or bad for Simple DNN and AE rerank. The experiment is divided into two parts: click conversion and transaction conversion. Under ideal circumstances, a good model should be able to more accurately determine which of the two rankings is more likely to generate click behavior, or which is more likely to generate transaction behavior. After excluding the sort pairs with the same label, the results are shown in the following table:
Performance of each evaluation model (overall judgment ability of sequence)
From the above test data, we can see that all models can roughly distinguish the good from the bad. From the definition of, it is really the best sequence under evaluation to sort directly according to the score of goods from high to low. However, is the ranking result really effective? Is the conclusion consistent from the perspective of the new model with the strongest evaluation ability? Therefore, we send these newly generated "best" sequences and the original sequence of data to the new model for evaluation, as shown in the following table. Each value indicates how much of the newly generated sequence is better than the original sequence.
According to the new model, the strategy of sorting directly according to the high and low score can only be better than the original sequence by about 66% at the highest, and the strategy of sorting directly according to the total score of fine sorting AEFS is much worse than the original sequence (this is because most of the original sequences in offline data are generated by the AE rerank model). Based on this result, we believe that the direct ranking of goods according to the grade of goods is probably not the best method.
So, how can we find the best sequence if we can't directly sort by scoring? The most direct way is to enumerate all possible sequence results, let the new model score them one by one, and then select the best sorting as the final result. However, because the online computing power is far from enough for us to enumerate by force, we need to optimize the set of enumerations. Heuristic greedy search beam search can be used as a solution to cut out most useless states during the exploration process. This method needs a tradeoff between accuracy and speed. Generally, it is difficult to achieve satisfactory speed under the condition of ensuring certain accuracy. In our work, we hope to generate sequences in a more straightforward way.
Product sequence generator
The commodity sequence generator is a model for generating sequences. Its input is a candidate commodity and its output is a product with sequence. It is a natural idea to use the point network. It needs to step by step. In each step, select a product according to the current status, and finally get the order of products. We designed the network structure based on the point network, and the final design model is as follows:
The model designed in this way has two advantages. The first is that it does not need to specify the value of sum. The parameters of the model are independent of the value of sum. When the model is trained or predicted, the value of sum can be set as needed. The second point is that after our transformation, the original pointnetwork structure with two LSTMs has become the current one, and the rest are replaced by DNN, which is not only simple to implement, but also reduces the prediction time by half (the online CPU environment runs LSTM slowly).
With the generation model, we hope that it can generate a sort that can make the new model score very well, so that it can do better than the traditional method with a large probability. The traditional method can supervise the learning through the original order label, but because changing the order may change the label, it is not perfect to complete this task through the supervised learning method. An alternative method is that we can sample some candidate sequences in a random way, and then select the best sequence as the target of supervised learning by scoring the new model; You can also take the best sequence as the result after generating some alternative sequences. In our work, we use reinforcement learning to link the two links of generation and evaluation.
Strengthen the way of learning
Because the evaluator is predicting the expected number of transactions in the sequence, if we regard this result as a reward in reinforcement learning, it seems that the generator and evaluator are naturally connected! In the process of generator training, each time you select a product, you can get the reward signal sent by the evaluator. Finally, you only need to maximize the total reward of the whole track.
However, there is still a problem here. It is calculated by using the context information. However, we do not know how to sort the following products when selecting products in each step. Therefore, we need to simplify and modify it so that it only depends on the previously selected products and has nothing to do with the sorting results. In other words, it is assumed that the user browses the goods from top to bottom, and his purchase probability of the current goods only depends on the previously browsed product sequence and has nothing to do with the arrangement of the following products. The specific modification of the evaluator is to remove the CNN module.
After the evaluator model is simplified, several key elements for reinforcement learning are already available. The commodity generator is an agent to be trained. It will make the actions it thinks are good according to the current state, and then receive the reward from the evaluator. Then it will move to the next state and continue to do the actions to get the reward. In this way, N triples can be obtained. The specific definition is as follows:
Product sequence and remaining candidate product set that have been arranged before the status of each step
The remaining product in the action candidate set of each step is the discrete action of which product to select
On the premise of considering the above, the length of the probability track predicted by the evaluator is, the attenuation coefficient
We mainly consider using the method of policybased class to optimize the above generator model. The reason for not using the valuebased method is that we found in the experiment that it may be difficult to accurately estimate the V value or Q value under this problem.
Please note that reinforcement learning should maximize the total return, and also maximize the total return of each step. The total return of each step is the sum of rewards from the current product to the Nth product. It is called Q value, and is defined as follows:
So let's start with the most basic algorithm, the reinforce algorithm, and the loss function is:
The results of this relatively simple model experiment are not ideal. It is easy to find one of them: if some highquality candidates have high total returns no matter how they act, and there are also some inferior candidate sets with low total returns no matter how they act, it is difficult for the model to find useful information for optimization. Therefore, we can consider modifying the reward, such as subtracting the mean value of the candidate set, or even subtracting the change amount of the mean value divided by the mean value (change rate), to achieve the purpose of judging whether the action is good or bad.
One way to calculate the mean value is to directly use the score of the original sequence in the evaluator. Here we will explain our training data. For example, in the experiment, we mainly focus on 30 rows of 17, so we will retrieve the first 30 items that are actually displayed on each query online as candidates for our training data, and the generator will select 17 items from these 30 items. We can know that the first 17 of the 30 candidate products are the real online display sequences. Therefore, the original sequence can get the score i in the evaluator, which is the same as the above dimensions and represents the total return or Q value of each step. Therefore, the loss function of our above reinforce algorithm can be transformed into:
Readers who know more about reinforcement learning may have an idea that the mean value can be estimated by the actorcritic method. This is the second method to estimate the mean. In this paper, we use a powerful PPO algorithm, so that we can use critical to estimate the value and calculate the value of the advantage function to update the strategy. However, there is a big problem hidden here: how to design the network to estimate the value?
The value is the average number of total rewards that can be obtained by using the current strategy from the current state. The current status is the currently scheduled goods and the remaining candidate goods. We need to predict the accurate value while maintaining the excellent properties of the previous model (i.e. the number of candidate goods set is variable). Therefore, we adopt a design similar to the previous model structure, that is, score the remaining candidate products, and then add up the scores. The model structure is as follows:
The effect of the algorithm and the information_ 3. The algorithm is not much different. We believe that the poor performance of the ppo algorithm is due to the difficulty in estimating the value, so can we keep the actor update method in the ppo and change the critical module? Here we will talk about the third method of estimating the mean, Monte Carlo sampling. In order to achieve convenience and efficiency, we also made some improvements to the original Monte Carlo sampling.
For each piece of data or group of candidate commodity sets, we let the model generate multiple permutations based on its current parameters, and then the average value of the total rewards obtained by these permutations in the evaluator can be approximately considered as the current strategy
So far, we have mentioned five reinforcement learning algorithms, from reinforcement algorithm to PPO algorithm and its improved version. Their training effects are shown in the following figure. In order to increase the exploration ability of the algorithm, the five algorithms have added the same entropy reward to the loss. The figure on the left shows the entropy change of the output action probability of the model. The lower it is, the more convergent it is. The figure on the right shows how much of the permutation generated by the model in the test set is better than the total score of the original permutation in the evaluator. We call this percentage the superior percentage or the better percentage, or BP for short, and the rearrangement in Lazada is also called the replacement rate. In our view, when strengthening learning and training, we do not look at loss, but at the entropy and total return.
Performance comparison of reinforcement learning algorithm (left: entropy; right: closing BP)
As can be seen from the comparison diagram of entropy on the left, our PPO_ MC algorithm is the fastest and best convergence; As can be seen from the BP comparison chart on the right, it is also our PPO_ MC model can achieve the best results. The original PPO model is not only difficult to converge on the entropy, but also the generated sequence is only better than the original sequence, rather than the information_ The algorithm converges well.
We can also look at PPO more popularly_ MC algorithm will obtain multiple sequences with different performance through sampling before training, and then update the model to be more inclined to the sequences with good performance. This algorithm actually has room for improvement. From the above training chart, we can see that the convergence of the prophase entropy is very fast. We can also imagine that the model in the prophase is very random, and the sampling sequence is diverse, and it is easy to find useful gradient update directions. However, the model has become very small or basically convergent in the later stage, and the sequence obtained by sampling is basically the same, and it is difficult to carry out useful updates. It can be said that the sampling at this time is a huge waste of computational power, and has done a lot of useless work. So must this sampling be based on its current model parameters? Not really. To illustrate, the purpose of sampling here is to calculate the mean value, which has nothing to do with whether the algorithm is onpolicy or offpolicy. The update model still follows the original algorithm logic. In principle, we need a sampling strategy with rich diversity and slightly better performance than the current strategy. Random sampling is definitely not good, because the sequence it collects may be very poor. How to get a better sampling strategy? This is also a very popular research area in the intensive learning circle, and it is left to the judges to give full play to their intelligence.
Comparison of two groups of implementation details
Data organization Due to the retention mechanism of offline training data, if the user has only browsed the first page, then we can only get 20 items from the first page. Only when the user browsed the second page, can we get 40 items from the first two pages. The distribution of these two kinds of data (only the first page and the first two pages) is significantly different. The former is that users only click or buy on the first page. It may be that the sequence has been relatively good. The latter is that users have been browsing to the second page. It may be caused by the poor sequence of top 20. So, how should we balance the quality and quantity of data?
We have made three test models. One is to train the model with data containing 17 candidate products; The other is training with data of 30 commodities; The last is to train with these two data at the same time (both of them are about 1.6 million). Finally, because our model needs to take time into account when it goes online, only 17 items on the first page are listed, so we let the three test models be tested in the four environments of 17 rows 17, 30 rows 17, 50 rows 17100 rows 17 respectively. Compare with the original sequence, and observe whether the new sequence generated by them is better from our evaluator. As shown in the following table.
It can be seen from the data in the above table that the training effect of the data set on the first page is slightly better when the size of the two training data sets is the same and the training time is the same. Of course, the difference is not very big. From the perspective of saving computing and storage resources, it is more convenient to use the data on the first page directly. And through the above table, we can find that the larger the set of candidate products, the better, as long as the time is acceptable. Considering that our model takes twice as long as the ordinary scoring and sorting, and can quickly predict small situations, but it may be unacceptable for hundreds of thousands of goods.
In the above content, our optimization goal is to maximize the expected number of purchases, and we can achieve the effect of optimizing other goals by modifying the definition of reward. For example, if a model wants to improve the overall GMV, the methods in the past can be roughly divided into the following categories:
1) The target of the model is unchanged, and the conversion rate is multiplied by the price factor, which is close to the expected GMV. This kind of method usually makes a big change to the sorting result, and has obvious price increase effect, but the online effect is relatively unstable.
2) The model uses price to weight or change the loss of samples during training. This kind of method has little impact on the sorting results. In some cases, it may increase the unit price, but it is usually not significant.
3) Take some noncontracted samples (such as additional purchase) as transaction samples for training. Take the additional purchase sample as an example. From the statistical data of AE, the average price of these commodities is several times higher than the transaction sample. This is a data enhancement method that naturally adds a certain degree of confidence and high price samples to the transaction samples. We can see that the above methods are indirectly improving GMV, and one of the strengths of the reinforcement learning model is that we can almost directly model GMV. We can modify reward to:
Where, represents the price of commodity. In the actual training, we do not use the original price of the commodity as the original price, but use it in the price quantile of the commodity candidate set. By changing the price, we can prevent the adverse reactions caused by the excessive price difference among commodities, but also lose some of the original information of commodity prices. How to better incorporate price factors into the model is also a problem that needs to be considered more carefully in the future. We call the model considering the expected number of transactions as the PAY version, the model considering the expected GMV as the GMV version, and one is the weighted combination of the two_ In the GMV version, the optimization goal becomes:
Image. png in the above formula is an adjustable coefficient. In the case of offline evaluation, the training effects of the three models are as follows
The models in the figure are all trained with the data of 30 commodity candidate sets, of which PAY_ GMV version. As expected, the PAY version has an advantage in transaction conversion, and the GMV version has an advantage in GMV_ The GMV version will get a balanced result.
Online deployment and real effect
After all, there are deviations in the offline evaluation. In order to verify that the generator model can really improve the online effect, we will train the model every day and reupload the model through the scheduled scheduling strategy. During model training, offline data of all bts experimental buckets in the last two weeks will be collected, and they will be used to train the evaluator for 5 hours, then the generator will be trained, and the trained new generator model will be released after 6 hours. The evaluator uses the incremental training mode, and the generator model will reinitialize the training every time to avoid losing the exploration ability.
In terms of online experiment, we first achieved online positive effect on National Day. At that time, the enhanced learning rearrangement model of GMV version was launched, which achieved better unit price improvement compared with the benchmark barrel. In the online experiment, we found that the GMV version of the model would bring about a loss of conversion rate. Therefore, in order to balance the uv conversion rate and uv value, PAY was launched after the National Day_ The enhanced learning rearrangement model of GMV version has improved the uv conversion rate and uv value. It is worth mentioning that this version of the model has brought about certain unit price and conversion rate replacement, lost part of the unit price, increased a certain conversion rate, and benefited the overall GMV.
Online daily performance
Before the promotion of the 11th National Congress of the People's Republic of China, users' desire to buy highpriced goods decreased sharply, and they were more inclined to wait and buy more. Therefore, we train the evaluator with the additional purchase samples and transaction samples as transaction samples, and then use this evaluator to train the generator model. Here we hope that the more products we buy, the better. And because users prefer lowprice products, we use the above PAY generator model. Five days before the big promotion, compared with the benchmark bucket, the collection and purchase rate of our model increased by about 12.4%, while in the experimental bucket without reinforcement learning, the highest purchase rate was only about 11.5%, with about 1% gap. On the first day of November 11, the bucket with reinforcement learning rearrangement model was the most prominent, with the transaction amount increased by 6.81%. On the second day of the 11th, the performance of the reinforcement learning rearrangement model decreased slightly, and the transaction amount increased by 5.65%. Compared with the realtime rearrangement bucket (only the rearrangement is different, the realtime rearrangement is the realtime version of AE rerank), the overall GMV has also increased by more than 1%.
Stage summary and future outlook
In the past five months, we have successfully built the reinforcement learning rearrangement model from scratch, and have passed the "Double 11" test, demonstrating its flexibility, feasibility and great potential. In this section, we will look forward to the future of rearrangement.
GAIL reinforcement learning rearrangement
At present, the training of our rearrangement model depends entirely on the evaluator, whose accuracy determines the quality of the generator, while our evaluator is only trained by a small amount of data (compared with the whole space). Therefore, we guess that the evaluator is relatively reliable in the distribution near the training data, but may not be very reliable in the distribution with a large difference. So, can we make the sequence generated by the generator not only good but also fall into the distribution of training data as much as possible? So, with the help of GAIL's idea, the generator should not only receive the reward from the previously trained evaluator, but also receive the reward from the discriminator trained at the same time. Here, the goal of the discriminator is to score the generated sequence as low as possible and the original sequence as high as possible. The results of training are shown in the figure below.
GAIL reinforcement learning rearrangement performance (left: entropy; middle: transaction BP; right: discriminator AUC)
The blue line is the reinforcement learning rearrangement model of the previous PAY version. It can be seen in the second figure that it is the highest among the indicators of transaction advantage. Nearly 90% of the generated sequence is rated higher from the evaluator, and the discriminator can also distinguish it from the original sequence with very high accuracy in the right figure. The orange line is the process of training the generator using only the discriminator, which is the original GAIL. The goal is to make the generated sequence similar to the original sequence. It is not difficult to find that in the second figure, the sequence generated by it is not highly rated in the evaluator. The green line is the version that takes into account both of the above, and the generated sequence should be scored higher in the evaluator as well as in the discriminator.
From the first picture above, we can see that the convergence of the two models with discriminators is very low, which is because we have added the strategy of random exploration. Specifically, in each previous step, a candidate product is selected according to the probability predicted by the strategy, and in these two models, there is a probability of 0.2 to randomly select a candidate product. The experiment found that this simple exploration strategy is conducive to the training of GAIL.
According to the online results of the three days from November 25 to November 27, GAIL rearrangement can increase the average order volume by 3.22% and the total transaction volume by 3.81% compared with the original rearrangement, which is a good improvement.
The purpose of product sorting is to make efficient products have better display opportunities and match the needs of users. A mainstream idea is that a product can be good or bad for a user's request. From the point of view of the display location, the more advanced the products are, the better the exposure opportunities will be. Therefore, the idea of grading goods through the model naturally emerges, and it seems that as long as the model scores the goods accurately and the AUC is high enough, the goods can be sorted from high to low according to this score! In this classic framework, there is a serious defect that most rearrangements are expected to solve: the context of the product will have an impact on the results. Considering the model before the rearrangement, it is very difficult to feed the context information into the model because the set of candidate products is too large. Therefore, it is a common idea to use rearrangement to solve the context impact. We can see that almost all rearrangement models will consider the context when grading goods. However, in fact, it is one task to score the products to determine whether they will be transformed, and another task to rank the candidate products to maximize the income. Although the two tasks are related, they are quite different. If we just sort by score, the original order will be disrupted, and we can't restore the new order label from the original order label, so we can't determine how the new order behaves. Unless we believe that the transformation behavior of goods will not change because of the context, this is obviously a problem.
Regardless of the context, is the model accurate enough to grade the goods?
We designed three models for experiments to show the impact of the degree of context information description of the model on the scoring results. The first model is the most commonly used pointwise scoring model at present. The input of the model is the product characteristics and user information. We call it Simple DNN for short. The second is the online rearrangement model AE rerank, which performs well in the rearrangement stage. It adds the global statistical information of the context goods on the basis of Simple DNN, such as the variance, mean and extreme value of some characteristic points. The third model is designed to further describe the context information. On the basis of AE rerank, CNN and RNN modules are added to capture the information of commodity sequence. The model structure is shown in the following figure: (Simple DNN does not include blue CNN, RNN and green global statistical feature input, and AE rerank does not include blue CNN and RNN).
We used 54 million daily data as the training set, and trained 5 epochs of the above three models respectively (all converged). Then we calculated their performance results in the test set (about 2.5 million data), as shown in the following table. The AEFS here is the alias name of the total score of fine rehearsal generated in the fine rehearsal stage. The data in the table answers our question. The pointwise model, Simple DNN, scores the goods slightly less than the AE rerank with context information and the new model. From this, we can conclude that the product scoring should be affected by the context in which it is located. Only by integrating the context information can we estimate more accurately.
Regardless of the change of order, is it reliable to rank according to the original order?
This formula is applicable to the new model that fully considers the context. However, for the first two models, Simple DNN and AE rerank, even if the products are randomly mixed, their scoring or estimated probability for each product is constant, so it must be inaccurate to add them directly as the scoring of the product series. In general, we can naturally avoid the above situation by taking the estimated probability of each product multiplied by the sum of the location discount as the ranking result in a way similar to DCG, as shown below:
We have tested the effectiveness of this method to judge whether PV is good or bad for Simple DNN and AE rerank. The experiment is divided into two parts: click conversion and transaction conversion. Under ideal circumstances, a good model should be able to more accurately determine which of the two rankings is more likely to generate click behavior, or which is more likely to generate transaction behavior. After excluding the sort pairs with the same label, the results are shown in the following table:
Performance of each evaluation model (overall judgment ability of sequence)
From the above test data, we can see that all models can roughly distinguish the good from the bad. From the definition of, it is really the best sequence under evaluation to sort directly according to the score of goods from high to low. However, is the ranking result really effective? Is the conclusion consistent from the perspective of the new model with the strongest evaluation ability? Therefore, we send these newly generated "best" sequences and the original sequence of data to the new model for evaluation, as shown in the following table. Each value indicates how much of the newly generated sequence is better than the original sequence.
According to the new model, the strategy of sorting directly according to the high and low score can only be better than the original sequence by about 66% at the highest, and the strategy of sorting directly according to the total score of fine sorting AEFS is much worse than the original sequence (this is because most of the original sequences in offline data are generated by the AE rerank model). Based on this result, we believe that the direct ranking of goods according to the grade of goods is probably not the best method.
So, how can we find the best sequence if we can't directly sort by scoring? The most direct way is to enumerate all possible sequence results, let the new model score them one by one, and then select the best sorting as the final result. However, because the online computing power is far from enough for us to enumerate by force, we need to optimize the set of enumerations. Heuristic greedy search beam search can be used as a solution to cut out most useless states during the exploration process. This method needs a tradeoff between accuracy and speed. Generally, it is difficult to achieve satisfactory speed under the condition of ensuring certain accuracy. In our work, we hope to generate sequences in a more straightforward way.
Product sequence generator
The commodity sequence generator is a model for generating sequences. Its input is a candidate commodity and its output is a product with sequence. It is a natural idea to use the point network. It needs to step by step. In each step, select a product according to the current status, and finally get the order of products. We designed the network structure based on the point network, and the final design model is as follows:
The model designed in this way has two advantages. The first is that it does not need to specify the value of sum. The parameters of the model are independent of the value of sum. When the model is trained or predicted, the value of sum can be set as needed. The second point is that after our transformation, the original pointnetwork structure with two LSTMs has become the current one, and the rest are replaced by DNN, which is not only simple to implement, but also reduces the prediction time by half (the online CPU environment runs LSTM slowly).
With the generation model, we hope that it can generate a sort that can make the new model score very well, so that it can do better than the traditional method with a large probability. The traditional method can supervise the learning through the original order label, but because changing the order may change the label, it is not perfect to complete this task through the supervised learning method. An alternative method is that we can sample some candidate sequences in a random way, and then select the best sequence as the target of supervised learning by scoring the new model; You can also take the best sequence as the result after generating some alternative sequences. In our work, we use reinforcement learning to link the two links of generation and evaluation.
Strengthen the way of learning
Because the evaluator is predicting the expected number of transactions in the sequence, if we regard this result as a reward in reinforcement learning, it seems that the generator and evaluator are naturally connected! In the process of generator training, each time you select a product, you can get the reward signal sent by the evaluator. Finally, you only need to maximize the total reward of the whole track.
However, there is still a problem here. It is calculated by using the context information. However, we do not know how to sort the following products when selecting products in each step. Therefore, we need to simplify and modify it so that it only depends on the previously selected products and has nothing to do with the sorting results. In other words, it is assumed that the user browses the goods from top to bottom, and his purchase probability of the current goods only depends on the previously browsed product sequence and has nothing to do with the arrangement of the following products. The specific modification of the evaluator is to remove the CNN module.
After the evaluator model is simplified, several key elements for reinforcement learning are already available. The commodity generator is an agent to be trained. It will make the actions it thinks are good according to the current state, and then receive the reward from the evaluator. Then it will move to the next state and continue to do the actions to get the reward. In this way, N triples can be obtained. The specific definition is as follows:
Product sequence and remaining candidate product set that have been arranged before the status of each step
The remaining product in the action candidate set of each step is the discrete action of which product to select
On the premise of considering the above, the length of the probability track predicted by the evaluator is, the attenuation coefficient
We mainly consider using the method of policybased class to optimize the above generator model. The reason for not using the valuebased method is that we found in the experiment that it may be difficult to accurately estimate the V value or Q value under this problem.
Please note that reinforcement learning should maximize the total return, and also maximize the total return of each step. The total return of each step is the sum of rewards from the current product to the Nth product. It is called Q value, and is defined as follows:
So let's start with the most basic algorithm, the reinforce algorithm, and the loss function is:
The results of this relatively simple model experiment are not ideal. It is easy to find one of them: if some highquality candidates have high total returns no matter how they act, and there are also some inferior candidate sets with low total returns no matter how they act, it is difficult for the model to find useful information for optimization. Therefore, we can consider modifying the reward, such as subtracting the mean value of the candidate set, or even subtracting the change amount of the mean value divided by the mean value (change rate), to achieve the purpose of judging whether the action is good or bad.
One way to calculate the mean value is to directly use the score of the original sequence in the evaluator. Here we will explain our training data. For example, in the experiment, we mainly focus on 30 rows of 17, so we will retrieve the first 30 items that are actually displayed on each query online as candidates for our training data, and the generator will select 17 items from these 30 items. We can know that the first 17 of the 30 candidate products are the real online display sequences. Therefore, the original sequence can get the score i in the evaluator, which is the same as the above dimensions and represents the total return or Q value of each step. Therefore, the loss function of our above reinforce algorithm can be transformed into:
Readers who know more about reinforcement learning may have an idea that the mean value can be estimated by the actorcritic method. This is the second method to estimate the mean. In this paper, we use a powerful PPO algorithm, so that we can use critical to estimate the value and calculate the value of the advantage function to update the strategy. However, there is a big problem hidden here: how to design the network to estimate the value?
The value is the average number of total rewards that can be obtained by using the current strategy from the current state. The current status is the currently scheduled goods and the remaining candidate goods. We need to predict the accurate value while maintaining the excellent properties of the previous model (i.e. the number of candidate goods set is variable). Therefore, we adopt a design similar to the previous model structure, that is, score the remaining candidate products, and then add up the scores. The model structure is as follows:
The effect of the algorithm and the information_ 3. The algorithm is not much different. We believe that the poor performance of the ppo algorithm is due to the difficulty in estimating the value, so can we keep the actor update method in the ppo and change the critical module? Here we will talk about the third method of estimating the mean, Monte Carlo sampling. In order to achieve convenience and efficiency, we also made some improvements to the original Monte Carlo sampling.
For each piece of data or group of candidate commodity sets, we let the model generate multiple permutations based on its current parameters, and then the average value of the total rewards obtained by these permutations in the evaluator can be approximately considered as the current strategy
So far, we have mentioned five reinforcement learning algorithms, from reinforcement algorithm to PPO algorithm and its improved version. Their training effects are shown in the following figure. In order to increase the exploration ability of the algorithm, the five algorithms have added the same entropy reward to the loss. The figure on the left shows the entropy change of the output action probability of the model. The lower it is, the more convergent it is. The figure on the right shows how much of the permutation generated by the model in the test set is better than the total score of the original permutation in the evaluator. We call this percentage the superior percentage or the better percentage, or BP for short, and the rearrangement in Lazada is also called the replacement rate. In our view, when strengthening learning and training, we do not look at loss, but at the entropy and total return.
Performance comparison of reinforcement learning algorithm (left: entropy; right: closing BP)
As can be seen from the comparison diagram of entropy on the left, our PPO_ MC algorithm is the fastest and best convergence; As can be seen from the BP comparison chart on the right, it is also our PPO_ MC model can achieve the best results. The original PPO model is not only difficult to converge on the entropy, but also the generated sequence is only better than the original sequence, rather than the information_ The algorithm converges well.
We can also look at PPO more popularly_ MC algorithm will obtain multiple sequences with different performance through sampling before training, and then update the model to be more inclined to the sequences with good performance. This algorithm actually has room for improvement. From the above training chart, we can see that the convergence of the prophase entropy is very fast. We can also imagine that the model in the prophase is very random, and the sampling sequence is diverse, and it is easy to find useful gradient update directions. However, the model has become very small or basically convergent in the later stage, and the sequence obtained by sampling is basically the same, and it is difficult to carry out useful updates. It can be said that the sampling at this time is a huge waste of computational power, and has done a lot of useless work. So must this sampling be based on its current model parameters? Not really. To illustrate, the purpose of sampling here is to calculate the mean value, which has nothing to do with whether the algorithm is onpolicy or offpolicy. The update model still follows the original algorithm logic. In principle, we need a sampling strategy with rich diversity and slightly better performance than the current strategy. Random sampling is definitely not good, because the sequence it collects may be very poor. How to get a better sampling strategy? This is also a very popular research area in the intensive learning circle, and it is left to the judges to give full play to their intelligence.
Comparison of two groups of implementation details
Data organization Due to the retention mechanism of offline training data, if the user has only browsed the first page, then we can only get 20 items from the first page. Only when the user browsed the second page, can we get 40 items from the first two pages. The distribution of these two kinds of data (only the first page and the first two pages) is significantly different. The former is that users only click or buy on the first page. It may be that the sequence has been relatively good. The latter is that users have been browsing to the second page. It may be caused by the poor sequence of top 20. So, how should we balance the quality and quantity of data?
We have made three test models. One is to train the model with data containing 17 candidate products; The other is training with data of 30 commodities; The last is to train with these two data at the same time (both of them are about 1.6 million). Finally, because our model needs to take time into account when it goes online, only 17 items on the first page are listed, so we let the three test models be tested in the four environments of 17 rows 17, 30 rows 17, 50 rows 17100 rows 17 respectively. Compare with the original sequence, and observe whether the new sequence generated by them is better from our evaluator. As shown in the following table.
It can be seen from the data in the above table that the training effect of the data set on the first page is slightly better when the size of the two training data sets is the same and the training time is the same. Of course, the difference is not very big. From the perspective of saving computing and storage resources, it is more convenient to use the data on the first page directly. And through the above table, we can find that the larger the set of candidate products, the better, as long as the time is acceptable. Considering that our model takes twice as long as the ordinary scoring and sorting, and can quickly predict small situations, but it may be unacceptable for hundreds of thousands of goods.
In the above content, our optimization goal is to maximize the expected number of purchases, and we can achieve the effect of optimizing other goals by modifying the definition of reward. For example, if a model wants to improve the overall GMV, the methods in the past can be roughly divided into the following categories:
1) The target of the model is unchanged, and the conversion rate is multiplied by the price factor, which is close to the expected GMV. This kind of method usually makes a big change to the sorting result, and has obvious price increase effect, but the online effect is relatively unstable.
2) The model uses price to weight or change the loss of samples during training. This kind of method has little impact on the sorting results. In some cases, it may increase the unit price, but it is usually not significant.
3) Take some noncontracted samples (such as additional purchase) as transaction samples for training. Take the additional purchase sample as an example. From the statistical data of AE, the average price of these commodities is several times higher than the transaction sample. This is a data enhancement method that naturally adds a certain degree of confidence and high price samples to the transaction samples. We can see that the above methods are indirectly improving GMV, and one of the strengths of the reinforcement learning model is that we can almost directly model GMV. We can modify reward to:
Where, represents the price of commodity. In the actual training, we do not use the original price of the commodity as the original price, but use it in the price quantile of the commodity candidate set. By changing the price, we can prevent the adverse reactions caused by the excessive price difference among commodities, but also lose some of the original information of commodity prices. How to better incorporate price factors into the model is also a problem that needs to be considered more carefully in the future. We call the model considering the expected number of transactions as the PAY version, the model considering the expected GMV as the GMV version, and one is the weighted combination of the two_ In the GMV version, the optimization goal becomes:
Image. png in the above formula is an adjustable coefficient. In the case of offline evaluation, the training effects of the three models are as follows
The models in the figure are all trained with the data of 30 commodity candidate sets, of which PAY_ GMV version. As expected, the PAY version has an advantage in transaction conversion, and the GMV version has an advantage in GMV_ The GMV version will get a balanced result.
Online deployment and real effect
After all, there are deviations in the offline evaluation. In order to verify that the generator model can really improve the online effect, we will train the model every day and reupload the model through the scheduled scheduling strategy. During model training, offline data of all bts experimental buckets in the last two weeks will be collected, and they will be used to train the evaluator for 5 hours, then the generator will be trained, and the trained new generator model will be released after 6 hours. The evaluator uses the incremental training mode, and the generator model will reinitialize the training every time to avoid losing the exploration ability.
In terms of online experiment, we first achieved online positive effect on National Day. At that time, the enhanced learning rearrangement model of GMV version was launched, which achieved better unit price improvement compared with the benchmark barrel. In the online experiment, we found that the GMV version of the model would bring about a loss of conversion rate. Therefore, in order to balance the uv conversion rate and uv value, PAY was launched after the National Day_ The enhanced learning rearrangement model of GMV version has improved the uv conversion rate and uv value. It is worth mentioning that this version of the model has brought about certain unit price and conversion rate replacement, lost part of the unit price, increased a certain conversion rate, and benefited the overall GMV.
Online daily performance
Before the promotion of the 11th National Congress of the People's Republic of China, users' desire to buy highpriced goods decreased sharply, and they were more inclined to wait and buy more. Therefore, we train the evaluator with the additional purchase samples and transaction samples as transaction samples, and then use this evaluator to train the generator model. Here we hope that the more products we buy, the better. And because users prefer lowprice products, we use the above PAY generator model. Five days before the big promotion, compared with the benchmark bucket, the collection and purchase rate of our model increased by about 12.4%, while in the experimental bucket without reinforcement learning, the highest purchase rate was only about 11.5%, with about 1% gap. On the first day of November 11, the bucket with reinforcement learning rearrangement model was the most prominent, with the transaction amount increased by 6.81%. On the second day of the 11th, the performance of the reinforcement learning rearrangement model decreased slightly, and the transaction amount increased by 5.65%. Compared with the realtime rearrangement bucket (only the rearrangement is different, the realtime rearrangement is the realtime version of AE rerank), the overall GMV has also increased by more than 1%.
Stage summary and future outlook
In the past five months, we have successfully built the reinforcement learning rearrangement model from scratch, and have passed the "Double 11" test, demonstrating its flexibility, feasibility and great potential. In this section, we will look forward to the future of rearrangement.
GAIL reinforcement learning rearrangement
At present, the training of our rearrangement model depends entirely on the evaluator, whose accuracy determines the quality of the generator, while our evaluator is only trained by a small amount of data (compared with the whole space). Therefore, we guess that the evaluator is relatively reliable in the distribution near the training data, but may not be very reliable in the distribution with a large difference. So, can we make the sequence generated by the generator not only good but also fall into the distribution of training data as much as possible? So, with the help of GAIL's idea, the generator should not only receive the reward from the previously trained evaluator, but also receive the reward from the discriminator trained at the same time. Here, the goal of the discriminator is to score the generated sequence as low as possible and the original sequence as high as possible. The results of training are shown in the figure below.
GAIL reinforcement learning rearrangement performance (left: entropy; middle: transaction BP; right: discriminator AUC)
The blue line is the reinforcement learning rearrangement model of the previous PAY version. It can be seen in the second figure that it is the highest among the indicators of transaction advantage. Nearly 90% of the generated sequence is rated higher from the evaluator, and the discriminator can also distinguish it from the original sequence with very high accuracy in the right figure. The orange line is the process of training the generator using only the discriminator, which is the original GAIL. The goal is to make the generated sequence similar to the original sequence. It is not difficult to find that in the second figure, the sequence generated by it is not highly rated in the evaluator. The green line is the version that takes into account both of the above, and the generated sequence should be scored higher in the evaluator as well as in the discriminator.
From the first picture above, we can see that the convergence of the two models with discriminators is very low, which is because we have added the strategy of random exploration. Specifically, in each previous step, a candidate product is selected according to the probability predicted by the strategy, and in these two models, there is a probability of 0.2 to randomly select a candidate product. The experiment found that this simple exploration strategy is conducive to the training of GAIL.
According to the online results of the three days from November 25 to November 27, GAIL rearrangement can increase the average order volume by 3.22% and the total transaction volume by 3.81% compared with the original rearrangement, which is a good improvement.
Related Articles

A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team

What Does IOT Mean
Knowledge Base Team

6 Optional Technologies for Data Storage
Knowledge Base Team

What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers

Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00