How to make arbitrary cuts in the video

Video Object Segmentation (VOS for short), as the name implies, is to completely segment the object area of interest from all images of the video. In order to facilitate everyone's understanding, first give a result of our own video object segmentation:

Example of video object segmentation, the segmented human body regions are highlighted in red

The result of video object segmentation is an important material for secondary content creation. For example, the currently popular "naked-eye 3D video" uses changes in skin occlusion to produce 3D effects based on the distance between the main object in the video and the audience. Its core point is to segment the foreground object from the video, which will take more than 99% of the creator's time.

Therefore, for video websites like Youku, video object segmentation is a very valuable algorithm that can empower content producers and improve content production efficiency. In particular, the interactive video object segmentation algorithm can use a small amount of user interaction to gradually improve the accuracy of video object segmentation and improve the user's perception experience. This is beyond the reach of any unsupervised video object segmentation algorithm.
At present, the research of CV academia on video object segmentation is mainly divided into three directions:

Semi-supervised video object segmentation
Interactive video object segmentation
Unsupervised video object segmentation
These three research directions correspond to the three tracks in Davis Challenge 2019 on Video Object Segmentation [1]. Among them, the academic community is more inclined to study semi-supervised video object segmentation, because this is the most basic algorithm for video object segmentation, and it is also a relatively pure research point. Next, I will first introduce the three research directions of video object segmentation, and then share the latest applications in the video field based on the exploration of Ali Entertainment Moku Lab.

1. Semi-supervised video object segmentation
Semi-supervised video object segmentation, also known as single-shot video object segmentation (OSVOS for short). In semi-supervised video object segmentation, given the segmented region of the user's object of interest on the first frame of the video, the algorithm obtains the segmented region of the object on subsequent frames. There can be one or more objects. In video, there are object and background motion changes, illumination changes, object rotation changes, occlusion, etc. Therefore, the focus of research on semi-supervised video object segmentation algorithms is how the algorithm can adaptively acquire changing object appearance information. An example is shown in the image below:

In Figure 1, the first line is the RGB image of the sequence, and the second line is the object area of interest. Where (a) is the first frame image of the video, and the camel area is the ground-truth of the given object. (b) (c) and (d) are the subsequent 20th, 40th, and 60th frames. The subsequent images only have RGB images, and algorithms are required to estimate the area of ​​the object. The difficulty with this example is:

The foreground and background colors are very similar;
With the movement of the target camel, a new camel appears in the background, and these two different camel regions need to be segmented.
At present, semi-supervised video object segmentation algorithms are divided into two categories: with online learning and without online learning.

The algorithm based on online learning uses the one-shot learning strategy to fine-tune the segmentation model according to the ground-truth of the first frame object. Classic online learning algorithms include Lucid datadreaming[2], OSVOS[3], PreMVOS[4], etc. The online learning algorithm trains the model individually for each object, which can achieve a high segmentation accuracy rate. But online learning itself is the fine-tuning of deep learning models, which requires a lot of computing time. Before 2019, online learning algorithms were the mainstream. This year, there are many algorithms without online learning. Its model is pre-trained and does not need to be fine-tuned for samples, which has better timeliness, such as FEELVOS[5] of CVPR2019, Space-time memory network[ 6] etc.

The most important result evaluation standard for semi-supervised video object segmentation is the average Jaccard and F-measurement. The average Jaccard value is the average value of the Jaccard of the segmentation accuracy of all objects on all frames. F-measurement is the accuracy of the edge of the segmented region. Semi-supervised video object segmentation cannot be directly applied to practical applications because it requires the ground-truth of the object region in the first frame. But it is a core component of interactive and unsupervised video object segmentation algorithms.

2. Interactive Video Object Segmentation

Interactive video object segmentation is a more practical video object segmentation method that has emerged since last year. In interactive video object segmentation, the input is not the ground-truth of the object in the first frame, but the user interaction information of the object in any frame of the video. The interaction information can be the object boundingbox, the scribble of the object area, the extreme points of the outer edge, etc.

Interactive video object segmentation usually includes the following 5 steps:

The user enters interactive information and marks the object of interest, such as the bounding box of the object, scribble information, edge points, etc.;
According to the interactive information input by the user, use the interactive image object segmentation algorithm to segment the object area of the object on the frame image;
According to the object area of the previous frame, the semi-supervised video object segmentation algorithm is used to transfer to other frame images of the video frame by frame, and the object segmentation is performed to obtain the object area on all frame images. Then, the user checks the segmentation results, and gives new interactive information on poorly segmented frames;
The algorithm modifies the segmentation result on the frame image according to the new interaction information;
Repeat steps 3 and 4 until the video object segmentation results satisfy the user.
Interactive video object segmentation is not a single algorithm, but a solution of organic fusion of multiple algorithms, including interactive image object segmentation, semi-supervised video object segmentation, interactive video object area transfer algorithm, etc. Its main evaluation methods are Jaccard&F-measurement@60s (J&F@60s for short) and Area Under Curve (AUC for short) proposed in Davis Challenge on Video Object Segmentation. The Davis competition proposes to limit 8 user interactions, and establish a curve of accuracy over time. The area under the curve is AUC, and the curve interpolation at t=60s is J&F@60s. The figure below is a graph of J&F versus time.

It can be seen from the evaluation indicators that interactive video object segmentation emphasizes the timeliness of the segmentation algorithm, and users cannot be kept waiting for a long time. Therefore, semi-supervised video object segmentation algorithms based on online learning methods are generally not used in interactive video object segmentation. There is currently no open source code for interactive video object segmentation. However, the interactive video object segmentation algorithm is of great significance to the industry for the following reasons:

1) Semi-supervised video object segmentation requires the ground-truth of the first frame of the object, which is more troublesome to obtain in practice. The interactive video object segmentation only requires simple user interaction, which is very easy to achieve;

2) Interactive video object segmentation can achieve a very high segmentation accuracy rate through multiple interactions. High-precision segmentation results can provide better user experience, which is the result that users need.

3. Unsupervised Video Object Segmentation
Unsupervised video object segmentation is fully automatic on video objects without any other input except RGB video. Its purpose is to segment out salient object regions in the video. Among the above three directions, unsupervised video object segmentation is the latest research direction.

For the first time this year, the Davis and Youtube VOS competition has an unsupervised track. From the algorithm level, unsupervised video object segmentation needs to add a salient object detection module, and other core algorithms remain unchanged.

In semi-supervised and interactive video object segmentation, objects are specified in advance without any ambiguity. In unsupervised video object segmentation, object saliency is a subjective concept, and there are certain ambiguities between different people. Therefore, in Davis VOS, participants are required to provide a total of video segmentation results of N objects (in Davis Unsupervised VOS 2019, N=20), and calculate the corresponding relationship with the L salient object sequences marked by the ground-truth dataset. Corresponding objects and missing objects participate in the calculation of the mean of J&F. Excess objects among N objects are not penalized.

4. Research Status of Ali Entertainment Moku Lab

At present, many semi-supervised video object segmentation algorithms have good academic innovations, but the practical results are not good. We have counted the papers of this year's CVPR. On the Davis 2017 val data set, there is no J&F>0.76 in the regular papers. Algorithms such as FEELVOS[5] and siamMask[7] are good in theory, but there are many problems in practice. There is no open source code for interactive video object segmentation.

Therefore, Ali Entertainment Moku Lab has been engaged in the research of semi-supervised and interactive video object segmentation algorithms since the end of March 2019.

In May 2019, we completed a version of the basic semi-supervised video object segmentation algorithm and interactive video object segmentation solution, and participated in the DAVIS Challenge on Video Object Segmentation 2019, and won the fourth place in the interactive video object segmentation track name.

The VOS with robust tracking strategy [8] we proposed can greatly improve the robustness of the basic algorithm. On the Davis 2017 validation set, the accuracy of our interactive video object segmentation algorithm J&F@60s improved from 0.353 at the end of March to 0.761 at the beginning of May. Now, our semi-supervised video object segmentation algorithm also achieves J&F=0.763. It can be said that our results on this set are close to the industry's first-class level. Some examples of segmentation results are as follows:

Example of our interactive video object segmentation results

5. The follow-up plan of Ali Entertainment Moku Lab

Currently, we are continuing to explore the application of algorithms in complex scenarios, including small objects, highly similar foreground and background, fast moving objects or rapid changes in appearance, and severe object occlusion. In the future, we plan to focus on strategies such as online learning, space-time network, region proposal and verification to improve the segmentation accuracy of video object segmentation algorithms in complex scenes.

In addition, image object segmentation algorithms and multi-target object tracking algorithms are also important foundations of video object segmentation algorithms, and we will continue to improve accuracy in these areas.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us