How to digitize human and goods movements

business analysis

In the question of "when, where, and who is interested in which product?", the time and place require accurate determination of when the customer's behavior occurs and which shelf is located in the store; and it is required to accurately associate customers, It even associates the customer's age, gender, new and old customers and other attributes; and it is required to accurately determine which SKU's product was flipped or picked up. Different from scenes such as automatic vending cabinets and offline large screens, the image of the camera is too far away from the goods in the monitoring perspective, and the images are not clear. At the same time, the goods in shopping malls and supermarkets are densely arranged and the images are not complete. Therefore, the visual camera can only be generally used to judge whether customers have The action of flipping the product, but the information of the product in the "people-goods-field" cannot be obtained directly, as shown in the figure.

In order to solve this problem, we use the radio frequency electronic tag (RFID), which is widely used in the new retail scene, to detect the goods turned by the user. Passive RFID tags are cheap and can be widely assembled in clothing and other goods. The algorithm currently provides the data of "when, where, who, and what product interacts", and the overall implementation process is shown in the figure.

Among them, the remote physical store is equipped with monitoring camera equipment and RFID receiver equipment, which record real-time video and the timing signal of the stimulated reflection of the RFID tag respectively; the monitoring video is obtained by running the customer detection algorithm on the remote server deployed in the physical store If there are pictures of pedestrians, only the detected pictures of pedestrians will be sent back to the background, and the signal volume of the RFID timing signal is small, so all of them will be sent back to the background; After flipping, because the shop assistant has associated the EPC number of the RFID tag with the SKU number of the product, the SKU of the corresponding product can be retrieved based on the EPC number of the tag being flipped; at the same time, the returned customer picture is sent to the Mobilenet classifier Classify the customers who are suspected of flipping the product, and perform coordinate transformation according to the customer’s image coordinates to obtain the real physical coordinates of the customer; finally, combine the detected products that are suspected of being flipped with those that are suspected of flipping the product Customers are associated based on time and suspiciousness of actions to obtain the best match between products and pedestrians, so as to detect "when, where, who and what product interacted".

This article discusses three key technical points:

Image-based customer action detection algorithm;

Commodity turnover detection algorithm based on radio frequency signal;

Human-goods association algorithm based on bipartite graph matching.

1. Image-based customer action detection algorithm
The image-based customer action detection algorithm has experienced the evolution process from video-based detection to single-frame image detection with the needs of business and deployment.

1.1 Video image motion detection

1.1.1 Problem Analysis

Different from problems such as pedestrian and face detection, the interaction between customers and products is a sequential process, such as "picking up", "flipping", "trying on", etc. are all sequential processes with a certain length of time, so understand the behavior of users It is a typical video action classification problem. Video-level action understanding mainly studies model learning and prediction based on the entire video. With the wide application of deep neural networks, video action understanding technology has been greatly developed in the near future. The more famous ones are the CNN model in 2014 [1], the LRCN model in 2015 [2] and the I3D model in 2017 [ 3]. Among them, LRCN uses 2D convolution to extract features on a single frame image, and uses RNN to extract the timing relationship between frames, which achieves good results but takes a lot of time to train; the I3D model is based on 3D convolution to truly implement the entire video. Feature extraction, while training a lot faster.

However, open source data sets such as UCF101, HMDB51, and Kinetics only have a limited number of common sports, musical instrument performances and other actions, and do not include data in the supermarket scene. Therefore, we independently construct action videos in supermarket scenes and provide them to outsourcing for labeling. The interaction process between customers and products in the supermarket scene is extremely short. In the whole day video, there are only about 0.4% of the time when there is customer behavior, and the positive examples are particularly sparse. In order to discover positive samples as much as possible and improve the efficiency of outsourcing labeling, we preprocessed the video using the Lucas–Kanade optical flow algorithm, and screened the video by preliminarily judging whether there are active objects in the video. The ratio was raised to 5%. In order to improve the confidence of the sample, we also calibrated the positive examples in the sample to ensure the accuracy of the positive examples.

1.1.2 Action detection model and its optimization

To detect actions happening in a video, this paper tries:

Based on the human body key point detection algorithm (Pose) and the tracking algorithm (Track), the real-time short video of each customer and each hand is automatically cut out;

A hand motion classifier is trained based on the Inception-V1 C3D (I3D) model, and the I3D model is cropped and optimized.

Cropping the short video of each hand of the customer helps to accurately determine the location of the action. At the same time, since the customer's flipping action on the product must be performed through the hand, other parts of the customer's body have no direct impact on the detection results, and focus on The wrist part is beneficial to reduce the calculation amount of the model and speed up the model convergence. The Pose model involved in the algorithm is based on Open Pose[4], and the Track algorithm is based on Deep Sort[5]. There are mature models that can be used directly.

This paper mainly discusses the research on the application of the Inception-V1 C3D model to understanding hand motion videos. Image-based learning tasks usually use 2D convolution kernels to slide and pool on the image to achieve feature extraction, while video-based learning tasks require extending the traditional 2D convolution to 3D, sliding and pooling in the video to achieve feature extraction . As shown in the figure, 3D convolution has one more time dimension than 2D convolution.

This paper uses the newly proposed Inception-V1 C3D (I3D) model, which extends the traditional Inception-V1 model to 3D dimensions, and has achieved the best results so far on Kinetics, HMDB51, UCF101 data sets. The block diagram of the I3D model is shown in the figure. All the convolution and pooling operations in the schematic diagram are performed on the 3D level. The Mix module is the Inception perception sub-module in the Inception V1 model, which consists of 4 branches, as shown on the right.

When actually applied to the task of understanding customer behavior in supermarkets, we found that 3D convolution brings too many model parameters, which is very easy to overfit. At the same time, too deep a network also leads to problems such as poor generalization ability of the model and weak dropout effect. Therefore, we simplified the Inception-V1 model, removed unnecessary multi-layer Inception modules (Mix_4c, Mix_4d, Mix_4e, Mix_4f, Mix_5c) in the model, and reduced the number of convolution kernels (depth) of each Inception module To the original 1/2, the model is simplified as shown in the figure:

This simplified operation increased the model classification accuracy (Accuracy) from 80.5% to 87.0%, and further adjusted the hyperparameters, and the final classification accuracy increased to 92%.

1.1.3 Conclusion and Analysis

After processing the samples with tfrecords, we uploaded them to the group data storage center, and used PAI to conduct model training in a distributed environment (30 GPU cards), and applied the trained models to understand customer actions. The algorithm is currently implemented The actual prediction effect is shown in the video action detection result.mp4. At present, the real-time detection of customer actions in single-channel video can be realized under the server, but the problems of video action classification detection compared with traditional image classification tasks are mainly have:

The model complexity is high, and three-dimensional convolution (C3D, Convolutional 3D) needs to be performed in time and space at the same time;

The data scale is huge and training is difficult. Compared with pictures, the scale of video data has increased by hundreds to tens of thousands of times;

The actual deployment of the model is challenging, and the video action model has great challenges in terms of prediction time efficiency, organization of time series samples, and rolling window prediction in the practical process.

1.2 Single frame image motion detection

1.2.1 Problem Analysis

The customer action detection algorithm based on video convolution has high accuracy (92%), and can give accurate action position (accurate to wrist position), but due to the high complexity of video 3D convolution model and Pose model parameters, One server can only run one detection algorithm, which puts a heavy burden on computing performance. At the same time, the 3D convolution algorithm requires accurate correlation of continuous hand videos within a period of time, which also brings great challenges to the Tracking algorithm. In order to make the action detection algorithm adaptable to the scene of multiple stores with multiple monitoring signals, we have further developed a suspicious action detection algorithm based on a single picture, and directly classify and detect suspicious actions on a single picture. Due to the lack of information, the suspicious action detection of a single picture is less accurate than the video-based action detection algorithm. However, the defect of low accuracy can be made up for by fusion and correction with the RFID detection results.

1.2.2 Model and its optimization

Suspected action detection based on a single image is a classic binary classification problem. In order to reduce the demand for computing power, we use MobileNet[6] as the classification model. MobileNet is an efficient and lightweight network model proposed by Microsoft. It is mainly based on depth-wise convolutions and point-wise convolutions to optimize the calculation of traditional convolutional layers. Compared with the traditional convolutional network, MobileNet can reduce the amount of calculation and the number of parameters to about 10% to 20% while maintaining the accuracy almost unchanged. In this article, the MobileNet_V1 version model is used, and the convolution channel number factor Depth multiplier is set to 0.5 to further reduce the model size. We annotated the image action data set through outsourcing, and balanced the samples to ensure that the number of positive and negative samples is consistent during the training process. At the same time, considering that we hope to capture the moment when the user has an action as much as possible in the business scenario, we optimized the parameters based on the output of the model Logits, and improved the positive recall rate ( recall) reaches 90%, ensuring that positive examples are recalled as much as possible, as shown in the figure below.

1.2.3 Conclusion and extension

The image-based motion detection algorithm can classify motions with an accuracy of 89% (slightly lower than the video-based motion detection algorithm, and cannot be accurate to the wrist position), and the recall is 90%. The figure below shows whether the model predicts whether the pedestrian has taken the action of clothes. The gray and white stripes at the bottom of the picture indicate that the model detection action is a positive example, and the black stripes at the bottom of the picture indicate that the model detection action is a negative example.

Because the image-based motion detection algorithm is based on the portable MobileNet model, the computational complexity is greatly reduced, and it only takes about ten minutes to predict tens of thousands of suspicious images in a store throughout the day. At the same time, the algorithm can only process the images sent back to the backend, which greatly reduces the burden on field devices and servers. However, the motion detection algorithm based on pictures is only accurate to the position of pedestrians, and cannot give the precise position of the action. At the same time, pictures within several seconds before and after the same action may be detected as positive examples, which also has a certain impact on the accuracy of time correlation.

2. Commodity turnover detection algorithm based on radio frequency signal

2.1 Problem Analysis

When the customer flips the clothes, the RFID tag hung on the clothes will vibrate slightly, and the RFID receiver device records the changes in the characteristic values of the signal reflected by the tag, such as RSSI, Phase, etc., and sends it back to the background. The algorithm passes each antenna The returned signal value is analyzed to determine whether the product has been flipped. There are many problems in judging product flipping based on RFID signals, including signal noise, environmental multipath effects, occasional electromagnetic noise, and the impact of containers on signal occlusion. At the same time, there is a nonlinear relationship between the size of the RFID reflected signal and the distance between the receiver and the tag [7].

Among them, d represents the distance between the RFID tag and the receiver, l=d mod λ, γ is affected by Multipath and the current environment, and μ represents the offset caused by various static equipment errors. It can be seen from the formula that the location where the receiver is installed and the store environment will have a great impact on the RFID signal. It is a big challenge to find a unified flip judgment algorithm that can be applied to different stores and receivers in different locations.

2.2 Flip detection algorithm and its optimization

We first tried to establish a supervision model to detect whether the product has been flipped, by collecting the RSSI and Phase timing signals of the tags collected by two different RFID antennas actually deployed in the store, and manually combined the following features:

Among them, Ant1 and Ant2 represent the signals collected by two different antennas for the same tag, Diff represents the differential operation of the signal, and Avg represents the mean value processing of the signal. Finally, at 50 frames per second, each sample collects 8-second continuous signals to form 400×10 two-dimensional features. And use the self-constructed data set to train based on the following model, and the final actual classification accuracy is 91.9%.

Using a detection algorithm based on a supervised model can obtain a high detection accuracy of product flipping, but in the actual application process, it is found that the generalization ability of the model is very poor. When the store container moves or the antenna position moves, the signal changes drastically. Based on the original The model trained by the dataset is difficult to generalize to new scenarios. Therefore, we further tried the algorithm based on the unsupervised model, trying to improve the generalization ability of the detection algorithm. In particular, we noticed that the phase information has nothing to do with the spatial position, but has something to do with the relative displacement: the frequency distribution diagram of the phase information; the frequency information has nothing to do with the spatial position, but has something to do with the speed of the movement: the frequency domain information frequency distribution diagram.

Strictly speaking, the amplitude information in the frequency information is related to the spatial position, but when we only focus on the frequency distribution (the proportion of different frequency components), the frequency information can also be regarded as a feature that has nothing to do with the spatial position information. The acquisition of frequency information requires discrete Fourier transform of the RSSI signal and the Phase signal:

Then the distribution diagram of frequency signal and phase signal is counted. For the obtained distribution graph, calculate the JS divergence between the current distribution and the previous time distribution (relative to the KL divergence, the JS divergence has additive symmetry, so it can be used to measure the relative distance between multiple distributions).

Based on the JS divergence difference of two samples before and after adjacent moments, the flipping behavior of the product is detected, and the final JS divergence difference threshold value can be adjusted according to the scene to obtain an action detection accuracy similar to that based on the supervision model.

2.3 Conclusion and Analysis

Based on the supervised model and the unsupervised model, the flipping of the product is detected, and the algorithm gives the accurate flipped product (SKU). The detection accuracy of the supervised model can reach 91.9%, and the detection accuracy of the unsupervised model can reach 94% ( Higher than the supervised model), while using the JS divergence measure to improve the generalization performance of the algorithm, slightly modifying the threshold value according to different scenarios, the detection algorithm can be applied to different scenarios.

3. Human-goods association algorithm based on bipartite graph matching

3.1 Problem Analysis

Obviously, image-based customer behavior detection and RFID-based commodity turnover detection are two separate processes:

The detection based on RFID gives the situation that the product is flipped, but it cannot give the customer corresponding to the flip action;

Image-based motion detection gives customers who are suspected of flipping products, but cannot tell which products are flipped.

In actual scenarios, only a few actions usually occur at the same location and at the same time, so the time when the product is flipped detected by RFID can be matched with the time when the product is flipped by the customer detected by the image, and the customer who flipped the product Associated with the item being flipped. But the problem is:

The cumulative error between the action time detected by the image and the action time detected by the RFID can reach 5-15 seconds. The time difference between the RFID signal and the monitoring video signal back to the server, the time difference caused by the out-of-sync clock of the on-site equipment, and the inaccurate estimated action time of the algorithm will cause the time difference of the two detection algorithms to be inconsistent.

There may be multiple suspicious customers and multiple suspicious flipped items at adjacent locations at adjacent times.

3.2 Human-goods association algorithm and its optimization

We associate the motion detected by the RFID device near the same shelf with the motion detected by the image, and when associating the motion, we also consider the matching of the time consistency and the suspicious degree of the motion. Make sure that there are customers who have indeed taken the goods on the same shelf at the same time and are associated with the goods that have actually been taken. When multiple customers have suspicious actions in the same area or multiple products are suspiciously flipped in the same area, further use the bipartite graph matching algorithm to find the most optimal relationship between multiple products and multiple customers. best match. In this paper, the matching degree (edge weight) between customers and products is defined as:

Therefore, we divide the bipartite graph into multiple disconnected subgraphs before the matching algorithm, and match each subgraph separately to reduce the computational and storage complexity. When using the adjacency matrix to store edge weight relationships, subgraph matching greatly improves the efficiency of the algorithm, reducing the time spent matching hundreds to thousands of customers and products throughout the day from hours to minutes.

3.3 Conclusion and Analysis

The people-goods association algorithm based on bipartite graph matching can obtain the best match between products and customers throughout the day. The improved algorithm greatly improves the calculation efficiency and reduces the calculation time for matching customers and pedestrians throughout the day from several hours to within minutes. At the same time, the human-product association algorithm is the final link of the overall human-product association. The accuracy of the human-product association is affected by the accuracy of customer detection, the accuracy of customer actions based on single-frame image detection (89%), and the accuracy of product flipping behavior detection based on RFID. (94%) and the accuracy of each upstream algorithm model, such as the accuracy of the matching degree of human-goods association.

4. Conclusion and Analysis

We integrate the above algorithms:

Image-based customer action detection algorithm;
Commodity turnover detection algorithm based on radio frequency signal;
Human-goods association algorithm based on bipartite graph matching.
And make actual forecasts in store. For ease of understanding, here are some example results detected by the algorithm: "when (moment), where (image coordinates), who (pedestrian ID), and which product (product SKU) interacted with".

Among them, the pedestrian ID is the number assigned by the customer detection algorithm to the user entering the store. The coordinates given here are the coordinates in the image, which can be converted to their physical location in the store in the downstream process. The moment you turn over the product.

Although the accuracy of motion detection based on a single frame image is low, it has a significant effect on improving the accuracy of association between customers and products. We compared and analyzed the human-goods data obtained without single-frame image motion detection association and the human-goods data obtained based on single-frame image motion detection association, and found that the single-frame image-based motion detection algorithm improved the final accuracy of matching from 40.6% to 85.8%.

Since the human-product association algorithm is the final link of the overall human-product association, the accuracy of the human-product association is affected by the accuracy of each upstream algorithm model. We use the two dimensions of time correlation and action suspiciousness to match at the same time, so that the final match The accuracy rate of pedestrians and flipped goods can reach 85.8%. explained:

It is practical and feasible to associate commodity SKUs with pedestrian movements in seconds in the people and goods yard;

The unsupervised RFID detection algorithm can reduce the difficulty of large-scale deployment of product flipping detection algorithms;

The pedestrian motion detection algorithm based on a single frame image is effective and can be deployed directly in the back-end server environment on a large scale;

At the same time, the algorithm is still in its infancy, and it still needs to be further expanded in the following directions:

1. The pedestrian action detection algorithm based on a single frame of pictures still has a lot of room for optimization: increasing the data set from thousands of positive examples to tens of thousands of positive examples may bring significant performance improvements, and the server can bear more computing pressure, so the classification model You can try better models such as VGG, ResNet, Inception, etc.

2. The current single-frame picture motion detection algorithm is only accurate to the position of pedestrians. You can try to detect the model to accurately locate the position of the action.

3. At present, the RFID receiver can only detect the signals of dozens to hundreds of commodities within 1~3 meters nearby. At the same time, the collection frequency of the RFID receiver and the antenna polling mechanism have greatly limited the capacity and range of RFID detection. . Optimizing the RFID detection algorithm from the hardware layer can greatly reduce the cost and improve the detection capacity, range and accuracy.

4. The unsupervised RFID product turnover detection algorithm relies on threshold debugging, and there is still a lot of room for optimization to find an unsupervised detection algorithm that can be generalized to different stores.

5. Linking people and goods is currently based on the suspiciousness of time and actions. Further, it can be linked based on the rough product location and pedestrian location, which will have a great effect on improving the accuracy of the association.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us