Online behavior prediction based on time scale selection

Online behavior prediction means that before an action is executed, the algorithm uses the observed fragments to predict the category of the action. There are several key points in this problem: first, it is "online", which means that the algorithm is fast enough to implement online applications; second, the algorithm needs to make class predictions early in the action (such as only 10% completed) ; In addition, the algorithm deals with unsegmented video, which means that the video may contain multiple action instances. For example, the video sequence in the figure below contains multiple actions.

For the problem of online behavior recognition, we can use a sliding window design in the time dimension. Traditional sliding window methods often use a fixed window scale, or multiple round-trip scans with multiple scales. In the problem of online behavior prediction, if multiple scans are used, the operating efficiency of the algorithm will be affected; but if only a single fixed scale is used, it is not easy to choose an appropriate time window scale.

This is because in behavior prediction tasks, the length of the observed portion of the currently occurring action varies at different time points. In the early stage of the action, we need to use a relatively small time window scale, because too large a window will contain many frames from previous actions, and these noise information will interfere with the recognition of the current action category. In the later stage of the action, we can use a large window scale to cover as much as possible the segment of the action that has been executed to achieve better prediction accuracy. This means that it is inappropriate to use a fixed window size at different stages.

In this paper, a "scale selection network" is proposed to dynamically select the most appropriate window scale at different points in time. The basic structure of the network is shown in the figure below.

The scale selection network adopts one-dimensional convolution in the time dimension to model the motion dynamic information between different frames. In order to obtain a series of different time scales, the network uses a dilated convolution design. By designing a hierarchical (hierarchical) expanded convolutional network architecture, in the network, nodes in different convolutional layers have different perception window ranges. For example, the perception range of the first convolutional layer is 2, the second layer is 4, the third layer is 8, and so on.

In the above network architecture, we get a range of perceptual scales. Aiming at the problem of online behavior recognition of scale changes at different time points, we need to dynamically select the current appropriate time window scale at each time point. This article designs a scale regression sub-network to predict the scale size required at each time point. The sub-network is shown in the figure below.

The scale regression sub-network estimates the distance (s) between the current frame of the current action and the start frame of the action by aggregating the information of all convolutional layers in the network and feeding the aggregated information into a fully connected network. The obtained s can be used to represent the part of the current action that has been executed, so it can be used as a suitable time window scale for predicting the current action category.

After obtaining a suitable window scale (s), we can find the convolutional layer corresponding to this scale. We mentioned earlier that in the scale selection network, different layers correspond to different perceptual scales, so we find the best matching layer, and then we can use the information of this layer to predict the action category (c). This paper designs a category prediction sub-network, where information from appropriate convolutional layers is fed into a fully-connected network for behavior prediction. As shown in the figure below, assuming that the third convolutional layer best matches the window scale s, then the information from the first to third layers is aggregated. Note that the paper not only uses the information of the third layer, but also integrates the layers below it. This is because this skip connection design can make the network converge faster. At the same time, multi-scale information fusion can also improve the accuracy of behavior prediction. .

Because at each time point of the video sequence, the network regresses and adopts the most appropriate time window scale, so this method can obtain reliable prediction accuracy. It is worth mentioning that although the scale selection network has multiple sub-networks, such as the one-dimensional convolution sub-network for time series modeling, the scale regression sub-network, and the behavior prediction sub-network, all these sub-networks are integrated in the same network architecture, so the entire network can be trained end-to-end.

Two public datasets were used to test the effect of the scale selection network, and good experimental results were achieved on both datasets. The experimental results are shown in the figure below, where SSNet is the scale selection network proposed in this paper; SSNet-GT means using the Ground Truth scale for behavior prediction; FS-Net (S) means using the same fixed network at all time points The scale(s) used for behavior prediction. ST-LSTM is "Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates" previously published on T-PAMI. Attention Net is "Global Context-Aware Attention LSTM Networks for 3D ActionRecognition" published in CVPR17. JCR-RNN is "Online Human Action Detection using Joint Classification-RegressionRecurrent Neural Networks" published by MSRA and PKU on ECCV16. It can be seen that the experimental results of SSNet proposed in this paper are better than other methods, and the accuracy rate is also close to the result of using the Ground Truth scale.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us