Application of Object Detection Algorithms in Video


Business background

Advertisements on traditional video websites are mainly pre-rolled (pre-, mid-, post-, etc.) advertisements. At present, users of most video websites, except members, cannot skip advertisements. With the improvement of user requirements for viewing experience, the impact The hard and wide experience of patches may lead to the loss of users. At the same time, more and more users will tend to become paid members of the website to enjoy a higher-quality viewing experience. The increase in membership will inevitably lead to a decrease in the exposed capacity of traditional film.

Content advertisements do not take up the user's viewing time and are visible to members. In terms of advertising quality, the combination of advertising content, video content and scenes will bring users a different advertising experience. Therefore, video content advertisement will become the mainstream form of video advertisement in the future. The picture below shows the scene advertisement of "Coca-Cola", and the scenes targeted by the advertiser are "family reunion" and "dinner".

Content Identification Status

At present, there are two main sources of scene advertising points: subtitle words and scene recognition. The subtitles are understood through ocr block recognition + nlp to get the subtitle words. Advertisers purchase relevant subtitle words according to the advertisement content, such as "mobile phone", "calling", "eating", etc.; scene recognition is to identify the video scene according to the content of the video clip ( Such as dinners and weddings, etc.).

During the process of placing content advertisements, it was found that subtitle advertisements had the following two problems:

1. The content of the advertisement does not match the meaning of the subtitles. For example, the "Coke" purchased by Coca-Cola will be served in the following conversations, and the content of the two is irrelevant.

2. There is no relevant content in the segment where the subtitle word is located. For example, "mobile phone" in the subtitle does not necessarily have mobile phone content. At this time, the mobile phone advertisement is relatively blunt.

Another source of content points is the recognition of video scenes. Although scene recognition can solve the two problems of the above subtitle words, the modeling of scene recognition is more complicated, the definition of scenes is more subjective, the cost of manual labeling is high, and the collection of training samples is difficult. At the same time, because it is difficult to predict the characteristics of the future advertiser industry and the scenarios it needs, it is impossible to quickly support new scenarios. Therefore, the scalability of scene recognition is poor, and it is difficult to meet the needs of flexible and changeable advertising forms.

Problems Solved by Video Object Detection

Video objects are content between subtitles and scenes. In some video scenes, the subtitle information associated with them may not necessarily appear. Therefore, in terms of content description, objects are stronger than subtitles. At the same time, the object is the basic element of the scene, so the video object can describe the video scene to a certain extent. In general, video object detection addresses the following problems:

Solve the problem of image-text inconsistency and content continuity in subtitles. Object detection is based on the content that appears in the video, which ensures the accuracy of the content. At the same time, because the coherence of the object content is stronger than that of subtitles, the probability of content mutation between frames is relatively small, which can avoid the problem of subtitles flashing by. Suitable for content advertising.

Address the limitations of scene recognition. Compared with scene recognition, object detection modeling is simpler, and the cost of labeling training samples is low, which can quickly expand object categories. At the same time, the content of the object is not limited by the scene. For example, "wine glass" can be associated with food and drink scenes to place advertisements for drinks and beverages, and it can also be associated with multiple scenes to place advertisements. For example, "pet dog" can be placed in addition to dog food. Advertisements for allergy medicine, which cannot be done by scene recognition. The picture below shows the advertisement of "Fushuliang" medicine. In this case, there is an indescribable scene, and we can only rely on object detection to detect pets such as "cats" and "dogs". Test results.

Rapid expansion of content points. Technically, the current detection algorithm is relatively mature, the model iteration efficiency is high, and it supports the detection of thousands of objects, and can quickly batch produce video content points in a short period of time.

Introduction to Detection Technology

Object detection has long been a classic problem in computer vision. Before the explosion of deep learning, the representative method was Ross Girshick's DPM[1], which improved the traditional features to a certain level in the field of target detection. Since 2012, deep convolutional networks have become popular, and deep learning techniques have also been applied to object detection. In the next 6 years, various deep convolution detection models emerged in an endless stream. These models are mainly based on two major ideas, one is the two stage detection framework represented by fast/faster rcnn, and the other is represented by yolo and ssd One stage detection framework. The two detection frameworks have their own advantages and disadvantages. The advantage of the two stage lies in the accuracy, and the one stage is more powerful in terms of speed. Afterwards, there will be innovations and optimizations such as top-down, feature pyramid network, retinanet, etc. The more challenges are small objects, occluded objects, and the realization of fast, accurate and small networks. The development history of object detection technology is as follows:

Algorithm selection

Based on the above business background and the status quo in the field of target detection, it is urgent to choose a suitable business from among many algorithm models. Considering business needs and machine resources, pay more attention to the following two points: (a) higher accuracy; (b) faster speed. Comparing various methods of target detection, under the trade-off of accuracy and speed, yolov3[2] was finally selected. The relevant data given by it are as follows:

Model overview and optimization


There are 3 versions of the yolo series of algorithms so far, and the overall object detection idea has basically remained unchanged. There are some upgrades mainly in the backbone, detection and loss functions. As shown in the figure below, it is the frame diagram of yolov3 under inputsize=416 and the number of predicted categories=274:

The backbone is darknet53, which is composed of a set of ordinary convolution + 23 residual_units.

detection is composed of 3 branches. From the bottom feature map to the high-level feature map, they are respectively responsible for the detection of small, medium and large objects. It is worth noting that the high-level features are passed back to the bottom layer for fusion to enhance the semantic information of the low-level features.

The feature map output dimension is [3*(4+1+274)]×N×N, and the classification scores of different anchors are not shared.

The yolov3 loss function is as follows:

The first part corresponds to the bounding box coordinates and the length and width prediction loss, the second part corresponds to the prediction loss of whether there is an object, and the third part is the classification loss when the bounding box has an object. It is worth noting that the third part classifies each bounding box with logistic regression and supports multi-label classification.

algorithm optimization

The training set of our initial model comes from openimages, and the number of original categories is 600. On this basis, we have added part of the ImageNet dataset. In practical applications, we selected 274 types of objects with commercial value and retrained the model. In the new model, we also made the following optimizations:

■ Data optimization
There is mainly an imbalance problem in the data. The figure below shows the data distribution of 274 categories and 30 categories. The x-axis represents the category, and the y-axis represents the number of bounding boxes.

To solve the problem of data imbalance, we have taken the following two measures to solve it:

Resampling: Some categories in the training set are 1/10000 of the normal number of samples, and these samples are resampled, and data enhancements such as rotation, Gaussian noise, and image stitching are performed on the data at the same time.

Try to use focal loss[3] to optimize the loss function, but the effect is not good. The training process has completely converged, and the test effect is not good, which may be related to the logistic regression used for the yolov3 loss function or the focal loss hyperparameter tuning is not in place. The following are the loss functions used in the experiments:

Additionally, we use some data and training details to optimize:

Multi-scale training: Randomly select a size from [320, 416, 608] as the input_size of the model during the training process, which can effectively enhance the robustness of the model.

Kmeans Clustering Candidate Box: Clustering is performed on the frame distribution of the objects with business value we screened in the training set, and 9 initial anchor boxes are obtained. It is conducive to the convergence of the model and the acceleration of training. The distance formula is as follows:

Label smoothing: Label smoothing is proposed in Inception-v2[4], mainly to prevent over-fitting problems caused by over-confidence in model training. Its expression is as follows:

■ Model optimization
Model optimization generally considers the backbone and loss function according to the optimization goal. In the experiment we tried the following methods:

Considering that the deformation probability of non-rigid bodies such as characters in the video is relatively high, the deformable conv [5, 6] proposed by Jifeng Dai et al. is used to enhance the modeling ability of traditional convolution. As shown in the figure below, deformabel conv adds two horizontal and vertical offsets to each pixel on the traditional convolution to represent the deformation.

Considering that the business wants to detect as many objects as possible, the requirement for location accuracy is slightly lower, so adjust the weight ratio of location and category in the loss function to make the model pay more attention to the accuracy of object categories.

■ Post-processing optimization
By observing and calculating the overlap of similar labeling boxes in the test set, we found that there are many labeling boxes with iou>0.5. Therefore, we use soft-nms[7] for post-processing. ) to deal with overlapping in a smoother way, including Linear kernel and Gaussian kernel. The following formula:

We encapsulate the trained model through the interface, package it with docker, and publish it to the online scheduling platform. At present, our video content recognition platform can process about 600 feature videos per day on a single machine, and the processing capacity of the cluster can reach 1w+ feature videos per day. The model detection results are exported to the offline database, and then imported to the online management platform after screening and classification. , enter the delivery system after being approved on the RBI platform, and the whole process is shown in the figure below:

Algorithm optimization results
We compared the mAP-50 indicators before and after optimization on the 274-class and 93-class object models:

The figure below shows our optimization path on the 93-category model, and finally reached a map of 0.596.

Advertising business landing
In terms of video categories, we detected the feature films on the three main channels of movies, TV dramas, and variety shows, and filtered out costumes, thrillers, fairy tales, palaces, and animations that are not suitable for content advertising. So far, our model has processed 50,000+ videos and produced 90 million+ frame images. In terms of business applications, we classified the 274 selected objects with commercial value. In practical applications, we matched the scenes that objects of related categories can be associated with according to brand content attributes and advertising ideas. The distribution of test results for each category is as follows:

We compared the placement effects of subtitles and object detection points on mobile phone orders. The actual display effect is as follows:

In order to compare the delivery data, we compared the click-through rate difference between the object detection point and the subtitle word point on the video where the "mobile phone" point has been detected. The exposure of the object detection point is 3.5 times that of the subtitle word. Under the condition of double, the click-through rate can still be increased by 20%, and the quality and quantity have been greatly improved.

Summary and Outlook


In the past 6 years, with the rapid development of deep learning, the object detection technology has been very mature. The work in this paper tries to apply the object detection technology to video content advertisements, and has achieved good results. In terms of algorithm, we chose the yolov3 algorithm, combined with our business background, to model objects with high commercial value in a targeted manner, and the detection speed and accuracy of detection results have been greatly improved.

In the application of actual business scenarios, the object detection results have added a large number of placement points to the existing scene advertising business. While ensuring the relevance of advertising content and video content, it also brings exposure and click-through rate. Huge improvement.


At present, the static image detection method is technically used. Next, we will optimize the frame truncating strategy, and use LSTM and Optical Flow to fuse video context information [8, 9, 10] to speed up object detection and reduce single-frame images. Due to the difficulty of detection caused by motion blur and small object area during detection, the detection accuracy and speed are improved. In the future, object detection algorithms will also be applied to other scenarios, such as clothing detection in the same video.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us