Random talk on visual object tracking: from principle to application

I. What is visual object tracking

Definition of visual object tracking

In the field of computer vision, there is no unique definition of visual object tracking (tracking for short, the same below). Generally speaking, the target of tracking is an area or object in a video frame or image, without its semantic information (category, etc.). This concept is vividly described as "tracking everything". At the same time, there are also some special cases, which are usually used to track the known types of objects in some specific scenes, such as the tracking of certain specific products (such as parts) in the factory assembly line monitoring.

Many scholars have different interpretations of tracking, including: "tracking is the process of identifying regions of interest in video sequences" [1], or "tracking is to estimate the state of a given target in a frame of video (position, size, etc.) in subsequent frames" [2]. These definitions may seem quite different, but they have a lot in common. By extracting these common points, we define the tracking problem as:

Tracking is the process of finding the object of interest defined in the current frame in the subsequent frames of a video.

It can be found that the above definition mainly focuses on three aspects of tracking, namely "find", "object of interest" and "subsequent frame". Note that the current frame here can be any frame in the video. Generally speaking, tracking starts from the second frame of the video, and the first frame is used to mark the initial position of the target (ground truth). Next, we will use the example of Bolt's participation in the men's 100 meter dash to explain these three aspects.

Suppose we find the location of Bolt in the last frame of the video. What we need to do is to continue to find the location of Bolt in the current frame. As mentioned earlier, vision is the limiting condition of the tracking problem (visual object tracking), which brings the properties that can be used. Here, we can use de facto rules: in the same video, the size and spatial position of the same object in the previous and next two frames will not change dramatically [4]. For example, we can make the following judgment: Bolt's spatial position in the current frame will probably be in the runway, but it is almost impossible to be in the adjacent lawn. In other words, if we want to know the spatial position of Bolt in the current frame, we just need to generate some candidate positions in the runway, and then look for them. The above process leads to an important sub problem in tracking, namely candidate generation, which is usually expressed as candidate box generation.

"Objects of interest": how to shape Bolt?

Bolt is the tallest person in the image, and he is wearing yellow and green uniforms. However, we overlooked the problem that our "definition" of Bolt actually contains a lot of highly abstract information, such as the tallest, yellow and green uniforms. In the field of computer vision, we usually call these highly abstract information features. For a computer, if there is no feature, Bolt is no different from the lawn, runway, or other objects in the image that are meaningful to humans. Therefore, feature representation/extraction is a very important link and the second important sub problem in tracking if you want the computer to track Bolt.

"Subsequent frames": how to distinguish Bolt (from others)?

Here, we define the problem of "subsequent frames" as how to use the information in the previous frame to distinguish the target in the current frame. We not only need to track the target in every "subsequent frame", but also emphasize the significance of the context relationship between consecutive frames for tracking. Intuitively, the answer to this question is very simple: find the object most similar to the tracking result in the previous frame in the current frame. This leads to the third important sub problem in tracking: decision making. Decision making is the most important sub problem in tracking, and it is also the most concerned problem of most researchers. Generally speaking, the decision mainly solves the matching problem, that is, matching the object that may be the target in the current frame with the tracking result of the previous frame, and then selecting the object with the greatest similarity as the tracking result of the current frame.

contact

In the above three sections, we introduced three sub problems in the basic principle of tracking: candidate box generation, feature representation/extraction, and decision making. It should be noted that these three sub problems are not independent of each other. Sometimes, solutions to decision problems will include more accurate candidate box generation and/or more abstract feature extraction, and use the end-to-end idea to solve tracking problems to improve the performance of tracking systems and algorithms. This is very common among the popular depth learning based tracking algorithms in recent years [1].

Application of visual object tracking

In a sense, before answering the question "What are the applications of visual object tracking?", we should first discuss the "why" question in academic research methodology, that is, "Why do we do visual object tracking?".

The classic application fields of tracking in computer vision science include security field (vehicle tracking, license plate recognition, etc.), monitoring field (face recognition, gait recognition, etc.), patrol field (UAV tracking, robot navigation, etc.), emerging smart life (human-computer interaction, VR/AR, etc.), smart city (traffic monitoring, etc.), and smart industry (telemedicine, etc.). The main applications of tracking problems can be summarized as follows:

Tracking is mainly used to obtain the spatial position, shape and size of any object in video or continuous images with semantic association.

As a supplement to the detection algorithm, it can provide the spatial position of the target in the video or continuous semantically related images, reducing the complexity of the entire system (for example, detection is only applied to identify the target in the first frame of the video, and some frames in the subsequent frames to determine the target position, and then tracking is applied to determine the target position in the remaining frames).

II. How to track visual objects

System architecture of visual object tracking

Candidate box generation, feature representation/extraction, and decision-making constitute a complete logical link. Specifically, for each frame in the video (usually excluding the first frame), the tracking system flow can be represented by the architecture in Figure 3:

As shown in the figure, in the tracking system, the previous frame (including the tracking results, as shown in the figure, the input frame) and the current frame will be used as the system input, and then respectively through the motion model, feature model, and observation model, and finally as the final prediction of the current frame on the target position. Among them, candidate box generation, feature representation/extraction, and decision-making are solved in the above three models respectively, and the corresponding relationship between their inputs and outputs is shown in Table 1.

Note that the tracking system architecture in Figure 3 applies the hypothesis testing model. This model is a common method in statistical inference. Its basic principle is to make some assumptions about the characteristics of the system first, and then judge whether to accept or reject this assumption by studying the statistical distribution of sampling. This model can be well applied to the tracking problem, that is, assuming that a candidate frame of the current frame is a predicted target, and then judging whether the candidate frame can be used as a reasonable prediction of the target position of the current frame through feature representation/extraction and decision-making.

Motion model - where?

1) Target Expression

The approximate position of the target in the current frame is the main problem to be solved in the motion model, that is, candidate box generation (where). Before discussing how to generate a candidate box, we need to be clear about what the candidate box is. The candidate box is a hypothesis for the target bounding box. The expression here is different from the feature expression in the feature model, which focuses on how to "depict" the target in the video frame or image. Common expressions are shown in Figure 4.

As shown in the figure, the target can be represented by rectangle (4c), skeleton (4f), or contour (4h). Among them, the rectangular box (i.e. bounding box) in 4 (c) is widely used in computer vision research. The advantages of this form of expression include easy generation (such as the minimum circumscribed rectangle), easy expression (such as upper left corner+lower right corner coordinates, or center point coordinates+width and height), and easy evaluation (such as IOU (intersection over union), translation and comparison). See [5] for details.

2) De facto rules: small size change, slow position movement

After determining the expression form (candidate box) of the target, we need to focus on how to generate the candidate box. In many academic articles, the generation of positive and negative samples in the process of deep learning training is sometimes called candidate box generation. This candidate box generation and the candidate box generation we discussed in this section are two concepts. What are the two candidate box generation methods, and how to distinguish them to avoid confusion.

Reasoning process: the system flow in Figure 3 is used to predict the target position of the current frame, which is required by any tracking algorithm. In this process, the motion model generates candidate boxes, and then features are expressed/extracted through the feature model. The candidate boxes containing features are input into the observation model for decision-making (prediction of target position). As described in the definition section of visual object tracking, de facto rules are in the same video, and the size and spatial position of the same object in the two frames before and after will not change dramatically. Based on this, we can greatly reduce the number and types of candidate frames, that is, we only need to generate candidate frames with approximate size near the predicted target position in the previous frame, thus improving the efficiency of the entire tracking system.

Training process: usually required in the tracking algorithm based on discriminant method, which belongs to the process of tracking system learning how to distinguish between target and non target, and will be detailed in the section of algorithm classification of visual target tracking. In this process, the so-called candidate box generation should be called "positive and negative sample generation". Here, a positive sample can be approximately understood as a target, and a negative sample can be approximately understood as a non target interference item, such as a background or other object that looks like a target but is not a target. In order to improve the ability of the tracking system of this kind of algorithm to distinguish between positive and negative samples, when generating negative samples, it usually looks for them in the whole image, not only in the vicinity of the target position predicted in the previous frame.

To sum up, candidate frame generation is used in the reasoning process to generate the potential position of the current frame target; The generation of positive and negative samples is applied to the training process of the tracking algorithm based on the discriminant method to generate positive and negative samples to train the tracking system, so that the system can learn the ability to distinguish between targets and sub targets.

3) Architecture and classification of motion model system

Figure 5 shows the system architecture of the motion model and the method classification of how to obtain candidate frames. As shown in the figure, the position of the predicted target in the previous frame (the nth frame) is input into the model, and the candidate box of the current frame (the nth+1 frame) is output. These candidate boxes may have position changes, scale changes, and rotations, as shown in the green and orange dotted boxes in the figure.

In the motion model, the main candidate box generation methods are as follows:

a) Probabilistic sampling

Among them, the parameters in image.png include candidate box position transformation, scale transformation, rotation transformation, aspect ratio transformation and other information. An example of affine transformation is shown in Figure 5. Here, the probability is reflected in that the above parameters are random variables that conform to a certain probability distribution (usually Gaussian distribution), while the sampling is reflected in generating a different number of candidate boxes.

b) Sliding window

As shown in Figure 6, the structural elements of a certain shape and size (figuratively called windows) are moved in the current frame at a certain spatial interval, and the corresponding pixels in the image covered after each move are the generated candidate frames. Generally speaking, the candidate frame generated by this method has only position transformation compared with the rectangular frame of the previous frame, and other changes (such as rotation transformation) require additional processing.

c) Circular shift

As shown in Figure 7, if we change the pixels in the rectangular box of the target position predicted in the previous frame into the base sample in the figure according to a certain arrangement, then each time we move a pixel to the right, we can generate a corresponding arrangement of candidate boxes. By generating the inverse transformation of the arrangement, a candidate box can be obtained. Generally speaking, the candidate frame generated by this method has only position transformation (such as rotation transformation) compared with the rectangular frame of the previous frame, and other changes require additional processing. It is worth emphasizing that cyclic shift is a special case of sliding window, but its combination with fast Fourier transform in the tracking algorithm based on correlation filtering can greatly improve the algorithm efficiency, so that it does not need to use the traditional sliding window operation to generate candidate boxes, so it is listed here separately.-How look like?

1) What are image features

For humans, image features are intuitive feelings about images. For the computer, the image feature is the difference between some areas/the whole image and other areas/other images in the image. Common image features include color features, shape features, spatial features, texture features, and depth features obtained through convolutional neural networks in depth learning. Bolt's yellow and green uniforms are color features, while the height combines space features and texture features. Generally speaking, the deeper the feature (abstract and non intuitive features, such as depth features), the better the ability to identify the target; On the contrary, the shallower the feature (specific and intuitive features, such as color, etc.), the better the spatial location information of the target is retained. Therefore, feature representation/extraction usually requires a trade-off between the two to achieve better tracking effect.

2) What is image feature representation

After understanding what image features are, the problem to be solved by feature expression/extraction is how to describe these features, that is, to describe one or more dimensions of the mathematical characteristics of these features in a language that can be understood by computers. Common feature expression/extraction methods include naive methods (such as pixel values), statistics methods (such as histograms), and transformation (such as gradient of pixel values).

Features and feature representation are collectively referred to as feature models. The feature model can analyze the candidate frame obtained from the motion model to obtain the corresponding candidate frame feature representation/extraction, as shown in Figure 8.

3) Classification of feature models

Figure 9 shows how to obtain the method classification of feature expression/extraction. It can be seen that before the convolutional neural network (CNN) is used to obtain the depth feature, the manual (hand coded) feature expression/extraction method is the mainstream method for image feature processing in the tracking problem, including the above-mentioned features and expression methods. Among many features and expressions, color features and gradient histograms are most widely used. Color features are relatively easy to understand. They not only conform to human intuitive understanding of images, but also are the simplest computer representation of images, that is, pixel values. Gradient histogram is a histogram about gradient, where gradient is the change of image pixel value in a specific spatial direction, such as the pixel value difference between horizontal adjacent pixels; Histogram is a common image representation of data distribution, which can intuitively show the quantity change of a group of data within its value range. Please refer to [7] for more information about image features. At present, the method based on depth learning has gradually become the mainstream of tracking problem research. The depth feature (deep) obtained through convolutional neural network (CNN) greatly improves the tracking algorithm's ability to identify the target, and the performance achieved exceeds that of the tracking algorithm using manual features.

Observation model - which?

1) How to make decisions

In the observation model, how to select one of the many candidate boxes to predict the target position in the current frame is the main problem to be solved in the observation model, that is, to make a decision ("which"). Intuitively, we only need to find the candidate box of the prediction target of the previous frame that is most "like" in the candidate box of the current frame, but the most "like" is not only one definition.

In general, solving the most "like" problem in the field of computer vision can be classified as a matching problem, that is, finding the best match with the target in the previous frame in the candidate box. The matching problem is the core of the whole tracking problem and the main problem solved by most tracking algorithms. The effect of its solution directly affects the performance of the whole tracking algorithm. Sometimes, even if the candidate box generation and feature representation/extraction are not good enough, such as the shape and size of the candidate box are different from the actual situation, or the discrimination degree of the extracted features is not high, excellent matching algorithms can make up for the deficiencies in the first two models to a certain extent, and maintain the overall performance of the tracking algorithm.

2) How to match

The most "like" or matching problem mentioned above is essentially a similarity measurement problem. When solving the similarity problem, we need a measurement mechanism to calculate the similarity of two compared individuals. In the tracking problem, the individual being compared is usually the prediction result of the candidate box and the previous frame (or the ground truth), and the measurement mechanism can be abstracted as distance. The distance here is not only the spatial distance, that is, how many pixels are separated between frames in the image, but also the distance between two probability distributions.

Because the spatial distance is relatively easy to understand, we only explain the probability distribution distance here: the tracking result of each frame is a prediction value, that is, the probability that each candidate frame is the target. If all candidate boxes are combined, a probability distribution can be formed. Understanding the matching problem from the perspective of probability distribution, the tracking problem is converted into a group of candidate box distributions that are "closest" to the probability distribution of the candidate box of the previous frame in the current frame. The "closest" is the probability distribution distance. Commonly used spatial distances include Minkowski distance (Manhattan distance and Euclidean distance are special cases), and commonly used probability distribution distances include Kullback – Leibler (KL) divergence, Bhattacharyya distance, cross entropy, Wasserstein distance, etc. Refer to [8].

3) Architecture and classification of observation model system

Figure 10 shows the system architecture of the observation model. As shown in the figure, the predicted target position in the previous frame (the nth frame), the candidate box of the current frame (the nth+1 frame), and the characteristics of the candidate box are input into the model, and the prediction result (target position) of the current frame (the nth+1 frame) is output. These candidate boxes may have position changes, scale changes, and rotations, as shown in the green and orange dotted boxes in the figure.

Figure 11 shows the module disassembly and classification of the observation model. As shown in the figure, the core module of the observation model is match. For the classification of matching methods, the mainstream view in the industry is: generative method and discriminant method [1, 2, 4, 9]. The main difference between the two methods is whether background information is introduced. Specifically, the generative method uses mathematical tools to fit the image domain features of the target, and searches for the candidate box with the best fitting result (usually the least reconstruction error after fitting) in the current frame. The discriminant method is a different idea, which regards the target as the foreground and the area without the target as the background, thus transforming the matching problem into the problem of separating the target from the background.

In contrast, discriminant method has better discrimination ability, that is, the ability to distinguish the target from other interference items, which is the origin of this kind of matching method. As the argument support of the above views, the performance of the tracking algorithm using discriminant method has greatly surpassed that of the tracking algorithm using generative method, and has become the mainstream of academic research [9]. To sum up, the generative method models the tracking problem as a fitting or multi classification problem, while the discriminant method defines the tracking problem as a two classification problem.

In addition, in Figure 11, we notice that there are also two modules shown in dotted boxes, representing feature representation/extraction and update, respectively. Here, the dotted line indicates that these two steps do not have to be performed. For some algorithms, the features obtained through the feature model will be further abstracted to obtain deeper feature information of the target, and then sent to the matching module to perform the matching algorithm. At the same time, the updating step is not necessary, and its significance is to obtain more accurate prediction results.

Specifically, the matching algorithm obtains a series of parameters, which can be used to predict the target position of the current frame. If these parameters are applied in the prediction process of all subsequent frames, the result may be that the prediction trend is not accurate, and eventually lead to tracking failure. The possible causes include cumulative error, external factors (such as occlusion, illumination change), and internal factors (such as object appearance change, rapid movement). If the update module is introduced to update the parameters of the matching algorithm according to the previous prediction results after every few frames, the error can be reduced and the tracking accuracy can be improved.

Algorithm classification of visual object tracking

Tracking algorithms are divided into two categories according to their observation models: generative methods and discriminant methods. It is worth noting that we emphasize that the classification is based on the observation model to decouple different models in the whole tracking system architecture. Specifically, even if the two algorithms respectively apply the generative method and discriminant method as the similarity matching solution, they may both apply the same features, such as color histograms. If we use the features applied in the tracking algorithm as the basis for classification, the two algorithms should be classified into one category. Obviously, this is another perspective of algorithm classification, but it is possible to classify two algorithms that are quite different.

Here, we do not deny the rationality of classification according to features, but focus on the difference in the nature of the algorithm, that is, its observation model. However, most review articles on tracking algorithms directly divide tracking algorithms into generative and discriminant, and do not emphasize that this is just their observation model, which makes people wonder why algorithms with the same characteristics are classified into different categories. This kind of ambiguity is unfriendly to students who are just beginning to contact the tracking field.

After clarifying the premise of our classification, Figure 12 shows our classification of tracking algorithms and some classical algorithms under each classification. It is worth noting that here we only refine the classification to the second level, that is, further classify the generative and discriminant. According to the specific details of different algorithms, the classification in the figure can be further deepened, but this is different from the purpose of this paper, that is, the systematic generalization of the tracking problem.

As for the generative method, its core idea is to measure the similarity between the prediction target of the previous frame and the candidate frame of the current frame, and then select the most similar candidate frame as the tracking result of the current frame (that is, the position of the prediction target in the current frame). Generative methods are further divided into the following three categories:

1) Spatial distance

That is, the solution to measure the similarity by spatial distance, which usually uses the optimization theory to transform the tracking problem into the spatial distance minimization problem. Classical algorithms using this method include IVT (Incremental learning Visual Tracking) [10] and ASLA (Adaptive Structural Local spark Appearance model tracking) [11]. The core idea of the algorithm is to calculate the Euclidean distance between the pixel gray value of the current frame candidate frame and the pixel gray value of the prediction target in the previous frame, and then take the candidate frame with the smallest distance as the prediction target in the current frame. In feature extraction, singular value decomposition (SVD) is used to reduce the computational complexity.

2) Probability distribution distance

That is, the solution to measure the similarity with the probability distribution distance. Usually, the optimization theory is used to transform the tracking problem into the probability distribution distance minimization problem. Classical algorithms using this method include CBP (Color Based Probabilistic) [12] and FRAG (robust FRAGMENTS based) [13]. The core idea of the algorithm is to calculate the Bhattacharyya distance between the color histogram distribution of the current frame candidate frame and the color histogram distribution of the prediction target in the previous frame, and then take the candidate frame with the minimum distance as the prediction target in the current frame.

3) Comprehensive

This part of the solution is represented by MeanShift [14] and CamShift algorithm, which blurs the distance measurement for similarity matching, and even does not explicitly generate candidate frames. Instead, it uses the idea of mean shift clustering algorithm in machine learning to predict the color histogram distribution of the target in each frame using the previous frame, calculate the color histogram distribution of pixels at corresponding positions in the frame, and then cluster to obtain the mean value of its distribution, The corresponding pixel position is the center position of the predicted target in the frame, and then the spatial position of the predicted target in the current frame can be obtained by adding the information about the width and height of the candidate frame. In MeanShift algorithm, the width and height information is fixed, so it cannot cope with the change of target scale and rotation. CamShift obtains target scale and rotation information by introducing image moments into similarity matching [7], which further improves the performance of the algorithm.

As mentioned earlier, the discriminant method focuses on viewing the target as the foreground, and then separating it from other content that is regarded as the background. To some extent, discriminant method applies the idea of classification algorithm to transform the tracking problem into a binary classification problem. As we all know, algorithms based on classical machine learning (that is, machine learning without deep learning) and deep learning have excellent performance for classification problems. Therefore, it is very natural that the ideas of these algorithms are introduced into the solution of tracking problems. In addition, the essence of discriminant method is still to solve the matching problem, and a very effective method to solve the matching problem is correlation, that is, to use a template to correlate with the input, and to judge the similarity between the input and the template through the response (output) obtained, that is, correlation. Therefore, algorithms based on related operations are also introduced to solve tracking problems. Discriminant methods are further divided into the following three categories:

1) Classical machine learning

The idea of machine learning algorithm is applied to extract the target from the background as a foreground. Classical algorithms using this method include STRUCK (STRUCtured output tracking with Kernels) [15] and Tracking Learning Detection (TLD) [16]. The STRUCK and TLD algorithms are classified by support vector machine and ensemble learning, which are classical machine learning algorithms, respectively, and a series of optimization methods are adopted to improve the performance of the algorithms.

2) Correlation filter

The method of calculating the matching degree between candidate box and prediction target by using related operations.

3) deep learning

The above mentioned method uses the idea of deep learning algorithm to extract the target from the background as the foreground.

See [1, 2, 4, 5, 9, 23] for more excellent tracking algorithms. We summarize the algorithms mentioned above in Table 2, including motion models, feature models, and observation models applied in these algorithms. Table 2 shows the decoupling of different models in the whole tracking system architecture. Through Table 2, we can clearly understand which methods each algorithm uses in different models, which helps us classify algorithms from different perspectives, extract common points in similar algorithms, and effectively distinguish and compare different types of algorithms.

The following figure shows [17]'s summary of the tracking algorithm:

III. How to evaluate visual target tracking performance

Evaluation indicators

In the previous two chapters, we have inadvertently used some indicators to evaluate the performance of the tracking algorithm, such as accuracy, speed, etc. In the field of computer vision, the most commonly used indicators are precision, recall, F-score, FPS, etc. Here, we briefly introduce the first two indicators: these two indicators are derived from statistics and focus on the classification of positive and negative samples. In short, the accuracy rate is defined as the true positive sample ratio of all the samples predicted to be positive samples, while the recall rate is defined as the sample ratio of all the true positive samples predicted to be positive samples.

In the tracking problem, there is a similar definition. One of the reasons why tracking systems and algorithms are widely used in bounding boxes is that they are easy to evaluate. The core of their evaluation is the intersection over union ratio, or IOU (intersection over union). IOU is defined as follows:

The larger IOU value reflects to some extent that the two rectangular frames compared have a higher degree of fit. If we use the rectangular box of the prediction target and the rectangular box of the ground truss to calculate the IOU, we can know the effect of the tracking algorithm: the larger the IOU, the better the tracking effect.

VOT (visual object tracking) competition has been held since 2013 and has now become the mainstream standard for evaluating tracking algorithms. The two more important indicators in VOT are availability and robustness. In fact, the first index used in the competition is EAO, namely expected accuracy overlay, which is a weighted sum of accuracy, and its essence is still accuracy. Accuracy is defined as follows:

That is, calculate the IOU between the rectangular box of the prediction target and the rectangular box of the ground truss for each frame, and then sum all frames and average them. N is the total number of frames, which can be the total number of frames of a video, multiple videos, or multiple videos for repeated testing. Robustness is defined as follows:

That is, the ratio of the number of frames failed to track to the total number of frames. Tracking failure is defined as that the IOU of the rectangular box of the current frame prediction target and the rectangular box of the ground trunk is 0.

The above two indicators reflect the performance of the tracking algorithm from two aspects: accuracy reflects how high the accuracy of the algorithm is when the tracking is successful, that is, the probability of finding the location of the target, focusing on the accuracy of the algorithm; Robustness reflects the probability of the algorithm to find the target location, focusing on the robustness of the algorithm. With these two indicators, different algorithms can be compared under the same set of measurement criteria.

Evaluation Dataset

In addition to evaluation indicators, another important part of tracking algorithm evaluation is data set. VOT competition not only provides evaluation indicators, but also its own data set is an authoritative data set for evaluation and tracking algorithm. In addition, common evaluation data sets include OTB [2], UAV123 [18], and GOT-10K [19]. Different datasets have different definitions of evaluation indicators, but their ideas have something in common with the access and robustness mentioned in the evaluation indicators section: that is, the accuracy and robustness of the algorithm are fully considered.

In addition to different evaluation indicators, different data sets usually have something in common. It is also a standard to measure whether a data set is suitable for evaluation: sufficient video quantity, rich target categories, and accurate annotation information. If there are not enough videos and enough target categories, the performance of the algorithm is easy to over fit, that is, it performs well in a small number of videos and/or categories, but poorly in other videos and/or categories, so the authenticity of the algorithm cannot be accurately measured. The accuracy of the annotation information is self-evident, which directly affects the correctness of the evaluation. For other types of data sets, such as those used as training algorithms, including ImageNet [20] and COCO [21], if the above conditions can be met, they can also be expanded into evaluation data sets.

Evaluation Example

We take the latest VOT challenge 2019 as an example of tracking algorithm evaluation. Table 3 shows the results of the competition [9]. The general competition process is: the competition organizing committee opens the registration channel, through which competitors can submit algorithm codes, and then the organizing committee collects the codes uniformly, tests them on the evaluation data set, and finally publishes the competition results in the form of reports or white papers.

Table 3 shows the results of the algorithm in the evaluation dataset of the year, including EAO, availability, robustness and other indicators. Generally, the competition results are ranked according to the EAO, but the results will also be marked with the top three of each single index, such as the numbers marked with circles in Table 3.

Four epilogue

At present, the algorithm based on depth learning has gradually become the mainstream of visual object tracking research. In addition to the algorithms introduced in this paper, unsupervised learning, meta learning, and other cutting-edge algorithms in the field of artificial intelligence have also been introduced into the academic research of visual object tracking. In addition, the visual object tracking algorithm based on depth learning is gradually applied in the industry. The algorithm complexity is effectively reduced through optimization methods including model compression, so as to achieve and exceed the performance of the current tracking algorithm suitable for practical computer vision applications.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us