How Taobao embraces the era of short video

1 Introduction

1.1 The rapid development of short video business

Short videos generally refer to video content with a length of less than 5 minutes. Due to the fragmentation of user time, the rapid popularization of mobile Internet and smart phones, and the low production threshold, short videos have attracted widespread attention in recent years. From its inception in 2011, to the sudden emergence of Kuaishou and Meipai in 2015, to the rapid development of Douyin and Volcano Video in 2016, short videos are grabbing more and more users’ attention and traffic, reshaping users’ information Get used to it.

So far, the monthly active users of short videos have reached about 400 million, and the average daily viewing time per capita has exceeded 60 minutes. It covers a wide range of user types and is highly sticky, which obviously squeezes the time spent by users on social networking, audio-visual, games, news and other applications. .

1.2 Current Situation of Taobao’s Short Videos

At present, the video volume in Taobao has reached 260 million+. A large number of Taobao/Tmall product header images have changed from static images to short videos, showing how to use and details of products in an all-round way; a large number of buyer shows have adopted the form of video, giving users more reliable and intuitive purchase reference; live video The big anchors in the website have amazing ability to attract fans and deliver goods; the expert videos are exquisitely produced and have various forms, giving users a more pleasant browsing experience. The rapid increase in the amount of videos and user demands has posed higher challenges to video recommendation algorithms.

This article mainly briefly describes some practices of video recommendation algorithms in the past six months in the two scenarios of wow video and homepage guessing you like video.
3.jpg
Wow Video is a video product aimed at fashionable young users with upper-middle purchasing power, covering clothing, beauty, food, cute babies, cute pets, digital, fitness and other fields. The content pool is composed of hundreds of thousands of talent videos and hundreds of thousands of product videos, all of which have been screened for quality and reviewed by people, with various styles and exquisite production.

Guessing your favorite video is an important part of the cloud theme. In Guessing your favorite waterfall, videos will be delivered at a certain frequency according to user preferences. The content pool is mainly composed of millions of product videos, which are also screened by quality rules, and mainly undertake the function of guiding users to place orders.

In the waterfall flow of the above two scenarios, clicking on the video will jump to the full-screen takeover page, where you can browse related products, enter the talent page, like, comment, forward, and scroll down to continue browsing.

1.3 Video Recommendation Algorithm Framework

The general framework of the video recommendation algorithm is basically similar to that of product recommendation, and consists of the following parts:

Recall is to obtain thousands of candidate sets based on the user's recent behavior.
Sorting is generally divided into preliminary sorting and fine sorting. The preliminary ranking is a rough sorting model, which will initially score the candidate set, and then truncate it to several hundred. Fine sorting is a sorting model with high complexity. It may be the fusion of multiple different target models. The candidate set will be scored more finely, and then truncated to about 100.
Business strategy, that is, some deduplication and dispersal strategies based on user experience.
This article will first introduce the video feature system, and then introduce video recommendation in blocks according to this framework, focusing on the parts that video recommendation is different from product recommendation.

2. Video feature system
The video feature system we built mainly consists of id features, product generalization features, video statistics features, video content features, label features, etc.

The id class feature consists of video id, author id, etc.
The generalization feature of the product mainly comes from the nature of the product loaded in the video. For example, product id, category id, virtual category id, store id, brand id, gender of the product, purchasing power of the product, product label, etc.
Video statistical features, that is, the statistical playback rate, average playback duration, and effective playback rate of videos in different dimensions such as scenes, categories, and authors.
Video content features, that is, key frame image features and audio features, can describe video content/style more finely
The feature of video tags is a multi-category model-based, scalable mesh-type tag architecture, which covers related tags of somatosensory categories and pan-content dimensions to which videos belong. The combination of these tags is a condensed description of video content. At present, it mainly produces the label system in the field of clothing, which is being continuously improved.

3. Video-based recall

3.1 Rank I2V Recall

The characteristic of Taobao video is that most of the videos are affiliated with related products. Therefore, the original recall of video recommendation is to directly use the i2i recall in product recommendation, and expand it to i2i2v recall, that is, use the product set that the user has recently clicked/collected/purchased/purchased as the trigger item set to find candidates similar to it item set, and then find the candidate video set according to the relationship between the video and the product, and recall the videos with higher popularity first. Similarly, c2i can also be extended to c2i2v, seller2i can be extended to seller2i2v, and so on. The advantage of this version is that it can make full use of the existing framework and data and go online quickly. But the disadvantages are also obvious: the same product may be associated with many videos, and the value of these videos to a specific user cannot be evaluated; interest in a product does not mean interest in related videos, which has a large gap; user behavior information on videos It is not used at all, such as viewing time, likes, attention, etc., and this information is the most valuable.

Therefore, we implement the RankI2V recall method to recall videos directly from commodities. The main idea is: construct a sample with the full-screen playback log, use the duration of the stay as the label, combine the characteristics of the trigger item, video-related products, and video itself, and use the gbdt model to directly score the relationship between the item and the video.

1) Construct a sample:

On the basis of the aforementioned i2i2v recall, the trigger item, the video watched, and the viewing time can be obtained from the log each time the user watches the video
After cleaning out the abnormal data, use the short viewing time as the negative sample (equivalent to not watching), the long playing time as the positive sample, and the actual playing time as the positive sample weight
Resampling, to ensure that for the same user and the same trigger item, there are both positive samples and negative samples in the sample set
2) Construct a feature set, which mainly has three parts:

The characteristics of the trigger item, such as category, price, popularity, dynamic rating and other natural attributes, as well as its exposure click indicators in different time slices of the whole site, exposure click indicators of sellers, categories and other dimensions, etc.
The characteristics of the video are counted separately in the whole site/Wow video scene, by category, publisher, and different time dimensions, such as exposure, clicks, effective playback rate, completion rate, average time/per capita playback time, etc.
The similar relationship between item and video, such as whether they belong to the same category, whether they belong to the same seller, the similarity score between the item and the video-related item, etc.

3) model:

Use gbdt model, pairwise loss to train
In the online experiments of the two scenarios, RankI2V is significantly better than the product-based recall method (i2i2v, etc.) in the first version. The experimental data are as follows:

When the waterfall page of Wow Video was launched, the average playing time increased by nearly 10%, the per capita playing time increased by more than 5%, pctr increased by more than 5%, uctr increased significantly, and the positive effect was obvious.

When the full-screen takeover page is launched, the average playback time per time is significantly increased, and the per-capita playback time is significantly increased, and the positive effect is obvious (full-screen page scene does not focus on ctr, only on playback time)

Guessing that you like the video online, ctr has risen by nearly 10%, and the positive effect is obvious.

3.2 Rank V2V Recall

The advantage of using product recall videos is that users have rich and diverse behaviors on products, and a better trigger set can be obtained. However, a more natural way for videos is to recall videos with videos. After all, users' preferences in products are not necessarily consistent with their preferences for videos. Referring to the idea of RankI2V, we also implemented the RankV2V recall algorithm.

First of all, in order to obtain the relationship between the trigger video and the recall video, a version of the CF v2v recall model based on the entire network video playback log and the target of the effective viewing time is launched, which is calculated by collaborative filtering. After going online for a period of time, enough playback behaviors have been accumulated to start training the rankv2v model. The main idea is: use the full-screen receiving page to play logs to construct a sample, use the duration of stay as a label, combine the characteristics of trigger video and recall video, and use the gbdt model to score the relationship between video and video.
1) Construct a sample:

On the basis of the aforementioned cf v2v recall, the trigger video, the watched video, and the viewing time can be obtained from the log each time the user watches the video
After cleaning out the abnormal data, use the short viewing time as the negative sample (equivalent to not watching), the long playing time as the positive sample, and the actual playing time as the positive sample weight
Resampling, to ensure that for the same user and the same trigger video, there are both positive samples and negative samples in the sample set
2) Construct a feature set, which mainly has three parts:

The characteristics of the trigger video are counted separately in the whole site/Wow video scene, by category, publisher, and different time dimensions, such as exposure, clicks, effective playback rate, completion rate, average time/per capita playback time, etc.
Recall the characteristics of video, similar to trigger video
The similar relationship between the trigger and the recalled video, such as whether they belong to the same category, whether they belong to the same seller, etc.
3) model:

Use gbdt model, pairwise loss to train
In the online experiment, although the number of RankV2V recall results is less than that of RankI2V (the reason is that the user's behavior on the video is much less than that on the commodity, so the trigger set is much smaller), but the index is obviously better than RankI2V. As the amount of video leaked across the network increases and users' video behaviors are further enriched, it can be expected that the performance of rankV2V will also increase accordingly.

Guessing that you like the video online, ctr soared by nearly 15%. Among them, the click-through rate of rankV2V alone is nearly 50% higher than that of the benchmark bucket, and the positive effect is significant.

3.3 Realtime-based Interest

Compared with product recommendation, real-time performance is more important for video recommendation. Especially in a scene like a full-screen takeover page, the user is only immersed in one video at a time. Once two or three videos in a row fail to capture the user's interest, it is very easy to lose them. In response to this psychology, we introduced the following four real-time triggers on the full-screen reception page:

In the waterfall flow, the user enters a full-screen page by clicking a certain video, which strongly reflects the user's browsing intention at the moment, and this video is added to the trigger set.
Videos clicked in the waterfall stream also have affiliated product information. There is a high probability that this product is also the product that the user is interested in at the moment, so add this item to the trigger set.
When scrolling down on the full-screen page to enter the next page, the user has already played, liked, commented, followed, etc. the video recommended on the previous page, which can reflect the user's current browsing interest and expectations for the results displayed on the next page . Add the video that the user has had positive behavior to the trigger set. This is a method with the strongest real-time performance.
Based on real-time logs to collect users' video playback behavior on the entire network (with a delay ranging from tens of seconds to a few minutes), videos with positive behavior will be added to the trigger set.
After experiments, the final recall priority is determined as follows: the four Realtime-based Interest recall methods are equal and greater than other recall methods.

When the full-screen succession page was launched, the average viewing time per view increased by more than 10%, and the per capita viewing time increased by nearly 15%, showing a significant positive effect.

When the Wow video stream was launched, pctr increased significantly, uctr remained flat, the average stay time per visit increased significantly, and the per capita stay time increased significantly, showing a significant positive effect.

From the experimental results, the effect of immersive full-screen pages is significantly higher than that of waterfall pages, because the number of waterfalls displayed at a time is larger, and users have more room for choice, while full-screen pages are displayed one at a time, which has higher requirements for hitting user interests at all times . At the same time, among the four real-time recall methods, the index of the third recall method above is better than the others, indicating that the stronger the timeliness, the better the recall efficiency, which is consistent with our initial hypothesis.

4. Video sorting

4.1 Coarse row model

After multi-channel recall, the system can obtain a collection of videos that the user is potentially interested in. At the same time, in order to further improve the accuracy of the recall, it is necessary to sort the multi-channel recalled videos uniformly to simplify the recalled video collection. This process is performed after BE multi-way recall by the model with reduced features.

The video rough sorting model uses ctr or dwell time as the target for model training and prediction. GBDT modeled as pointwise or pairwise.

The GBDT model with ctr as the main optimization target uses whether the user clicks on the video as a label, and trains in a pointwise manner. The features mainly include context features, user features, video attribute features, video feedback statistics features and other dimensions. Among the statistical features of video feedback, the click-through rate feature calculated through time information has a higher weight.
The video rough sorting model with the goal of staying time, takes the discrete staying time as the training goal, trains in a pairwise way, and adds more statistical features of the staying time category to the features.
Experimental effect

Guess you like the video. Connect the GBDT model of ctr to replace the LR model of ctr, and the ctr is increased by more than 10%
Guess you like the GBDT model of the immersion page access duration, the PV playback duration is increased by nearly 30%, and the UV playback duration is increased by more than 30%.
The GBDT model of the stay time of the full-screen video page is connected, and the average playing time of each time is increased by nearly 10%, and the average playing time of each person is significantly improved

4.2 Fine layout model

Fine Ranking scores videos with models of different objectives. Compared with coarse sorting, fine sorting is richer and more complete in terms of features, and at the same time, the same model of different targets, or different models of the same target are combined to calculate the score together.

The video fine-tuning model mainly includes three models: XFTRL and GBDT targeting ctr, and GBDT targeting dwell time. The final score combines the predictions of the three models.

The XFTRL model of the ctr optimization target, the features mainly include context dimension features, user dimensions, product dimensions, ID class and discrete attribute class features of the video dimension, and the combined cross features of these dimensions. Among them, user ID, video ID, user most the feature formed by combining the favorite category IDtop3 in the past x days with the category ID of the product corresponding to the video has a good effect in the model.

In the GBDT model aimed at optimizing CTR, in terms of features, in addition to the existing context features, user features, video attribute features, and video feedback statistical features in rough sorting, add videos in the last x days of the entire scene, divided by author/category The statistical features of the video, and at the same time supplement the statistical features of the video feedback with more statistical features. The features are concentrated, and the ctr class weight of the video in the scene is relatively high. It can be seen that the model tends to recommend videos with high popularity in this scene.

Compared with the CTR model, the GBDT model with the goal of optimizing the stay time has increased the video stay time for the last x days, and the statistical characteristics of author/category. The label is the discrete stay time. In the feature set, the playback popularity of the video in this scene and the average playback popularity of the video author have higher weights in the model.
Experimental effect

Guess you like videos. After the XFTRL model of ctr is optimized, the ctr is significantly improved. After accessing ranki2v, the increase is close to 10%, and the total increase is close to 15%;
Guess you like video ctr After the GBDT model is optimized, the ctr is significantly improved;
Guess you like the GBDT model of video access dwell time, the dwell time increases by more than 10%

5. Business Strategy
While the algorithm of the recommendation system optimizes the click-through rate and the length of stay, it also needs to ensure a good user experience and controllable traffic.

5.1 Experience Optimization

In order to achieve a good user experience, video recommendation has been richly optimized. At the same time, the three dimensions of concentration, similarity, and video discoverability are added as the observation indicators for experience optimization. The concentration and similarity mainly measure the in-page experience, and the video discoverability measures the inter-page experience.

In-page rematch: Each request rematches virtual categories, leaf categories, and video tags in the page to ensure that videos under the same virtual category/leaf category/label do not appear continuously on the page. After accessing the page rematch, the pv drops slightly, the click-through rate increases, the video concentration and similarity increase significantly, and the diversification of users' perceptions is significantly improved.

Inter-page suppression: collect real-time video exposure and clicks, and properly filter virtual categories/leaf categories/labels that have not been clicked in the last x days. After accessing inter-page suppression, pv and ctr are basically the same, the exposure discoverability is improved, and the multi-day exposure phenomenon of similar videos is reduced.

Real-time blacklist filtering: For videos with poor public opinion feedback and self-experience, the authors and titles are added to the blacklist and filtered in real time online. After accessing, the experience is improved, and the related public opinion is less.

Purchase filtering: Filter the videos under the same category of products that the user has purchased. After accessing the purchase filter, pv and ctr dropped slightly, and the diversification of user experience was further improved.

Exposure filtering: Since there may be a few minutes of delay in the collection of real exposures, the real-time database is used to record the fake exposure logs, and the real exposure and fake exposures are used to filter the video. After accessing, the ctr dropped slightly, and the phenomenon of continuous repeated exposure was obviously eliminated.

6. Conclusion

Compared with product recommendation, there is still a lot of room for development of video recommendation algorithms, such as multi-tag classification currently being tried, multimodal video embedding using image and audio features, and modeling of RealTime Exploration Interest. Interested students are welcome to communicate with us and make more corrections.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us