Gesture Recognition: The Right Way to AI Interaction

Baitao, Yueyi, Jingzhou, Hongcheng, Yingzhi, and Kante introduce research about the interactive ability of gesture recognition by the Tmall Genie M laboratory.

By Baitao, Yueyi, Jingzhou, Hongcheng, Yingzhi, and Kante

This article introduces the research about the interactive ability of gesture recognition by the Tmall Genie M laboratory. The research covers the exploration of business and algorithms in gesture recognition.

1. Overview

"Gestures are the most natural form of human communication. Hardware is the only limitation that prevents us from controlling our devices well." Here, the hardware limitation refers to the need for additional depth sensors by traditional gesture recognition algorithms. Thanks to the continuous development of adaptive artificial intelligence (AI) and edge computing over the past decade, gesture recognition has gradually become possible.

We may see more functions operated with gestures in smartphones, tablets, desktop computers, laptops, smartwatches, smart televisions, and Internet-of-Things (IoT) devices.

This year, we have already seen such a trend. Technology behemoths have launched their own gesture recognition products. Among them, Google provided gesture interactions on its mobile phones and smart speakers. Apple has submitted a patent regarding the application of gestures on smart speakers. Indeed, gestures are the most natural method of human-computer interaction. Now, imagine the following scenarios:

While watching TV, you want to change the channel or adjust the volume. If you cannot find the remote control, you can directly operate the TV with gestures.
While driving, you want to cut off an unpleasant song as soon as possible. Interaction with the touch screen takes your sight off the road, which is potentially dangerous. Gesture recognition in your car makes driving safer.
While watching a TV series on an iPad, a call from your boss or wife suddenly came in. You can mute the iPad with a gesture.
In a smart home, it is conceivable to use gestures to operate your electric lights, air conditioners, and range hoods.

In a word, you are the only interface you need.

2. Current Business Scenarios

We belong to the Tmall Genie M laboratory, which is mainly responsible for the visual algorithms of Tmall Genie. Our main research direction is visual algorithms for human-computer interaction, including gesture recognition, body recognition, and multi-modal visual speech interaction.

Last year, we launched an ultra-lightweight gesture recognition algorithm based on Tmall Genie smart speakers. This year, we have gone further by exploring the following technologies, business, and algorithms.

We launched gesture control on large-screen products, like Tmall Genie CC, CCH, and CCL.
We applied our gesture recognition to the Youku client for iPad, together with partners from Youku.
We incorporated tappable reading materials with finger gestures for children's education. Children can tap on the area they do not understand to get explanations.
We are working with TV manufacturers and other IoT ecosystem manufacturers to complete the first phase of large-screen gesture interaction. In the future, it will not be a dream to operate a TV without a remote control.

3. Ubiquitous Single-Point (Static) Gestures

3.1 Gesture Recognition on Tmall Genie and the Youku Client for iPad

Last year, we launched an ultra-lightweight gesture recognition algorithm based on Tmall Genie smart speakers. This year, we cooperated with partners from Youku to implement single-point gestures into the Youku client for iPad.

3.1.1 Application of Single-Point Gestures: the Youku Client for iPad, a Magical Tool to Watch a TV Series While Eating

A quote from users: "a magical tool to watch a TV series while eating"

This is an introduction video that was spontaneously uploaded by users after Youku went online. It also accords with the expected scenarios and pain points:

When you are watching a TV show, it is often necessary to skip, fast-forward, or rewind. There are occasions when users are not comfortable operating with their hands, for example, when they are eating, or their hands are full.
When using an iPad: 1) Due to the size and weight of the device, the iPad is rarely held in the user's hands. 2) Due to the larger screen, users usually watch from a certain distance. Therefore, gesture recognition can improve the viewing experience.

3.2 Go Further: Distant Gesture Interactions for Large Screens

3.2.1 Interaction Scenarios for Large Screens

In recent years, more smart TVs (smart screens) have entered our homes. According to the forecast by the Ministry of Industry and Information Technology, the market penetration rate of smart TVs is expected to exceed 90% by 2020. In addition to popularity, a powerful interaction capability is a requisite feature of a smart home interface. Smart TVs have become another important interface for the smart home IoT. Due to their large screens, gesture interaction is more user-friendly.

3.2.2 Challenges

The further we forge, the greater challenges we are often accompanied by. Compared with mobile devices, like Tmall Genie CC or the iPad, the development of gesture algorithms on smart TVs face the following challenges:

Further distance. A smart TV has a large screen, which is usually comfortable to watch at a distance of three to five meters. At this distance, hands account for a very small proportion of the picture.
More audiences. There may be many people watching TV at the same time, so we need to be able to identify and respond to the interaction of each person promptly.
More complex backgrounds. The placement of TVs in different families varies, and our algorithm needs to recognize the same gestures with many different things in the background.
Limited computing power. Although smart TVs are becoming more popular, the hardware loaded has very limited computing power.

3.2.3 The Large Screen Solution

To address the preceding challenges, after exploring and researching the algorithms, we proposed Contextual-attention-guided Fast Tiny Hand Detection and Classification.

The large screen solution: Contextual-attention-guided Fast Tiny Hand Detection and Classification

1) Lightweight hourglass-like backbones

The lightweight hourglass module downsamples the input and keeps detailed features as much as possible while obtaining feature graphs with high-level semantic information. This is conducive to the detection of a tiny hand.

2) Contextual attention

From three to five meters away, human hands occupy an extremely small proportion of pixels in the entire picture. Although very small, a hand is close to certain parts of the body, such as the wrist, arm, and face. Also, it may be a similar color to these parts. These human bodies or body parts are larger than hands, which provide us with additional clues for the detection of smaller hands. With these clues, we can detect smaller hands more easily. Based on this, we use similarity context and semantics context as the contextual attention to guide the network to obtain semantic information outside the hand area and enhance detection performance.

Haier TV gestures

4. Implementation and Optimization Closed Loops

I believe that any student who has implemented AI algorithms will encounter various practical algorithm problems.

We went through the same process after we launched gesture recognition on Tmall Genie and Youku. To improve our algorithm, we took the following actions:

We proposed a new detection algorithm and applied the cutting edge overflow-aware quantization solution to provide an excellent experience.
We integrated AutoML to facilitate the rapid implementation of AI applications and optimized our algorithms in a dynamic closed loop.

4.1 A Faster and More Powerful Detection Algorithm on the Client and the Overflow-Aware Quantization Application

4.1.1 A More Powerful Detection Algorithm on the Client

Based on the anchor-free solution, we adopted a more efficient algorithm framework by assisting the anchor solution with heatmap.

Due to the computing power difference on different devices, such as Tmall Genie speakers and IoT visual modules, gesture recognition on the client has higher requirements. To improve the gesture recognition algorithms, a centernet-lite detection algorithm is proposed based on the popular anchor-free centernet algorithm. However, in the process of implementing the algorithms, we found that the popular anchor-free solution had some inherent disadvantages in small networks:

As the anchor-free solution is based on heatmap, the eventual accuracy is closely related to heatmap, which is not good news for miniaturization.
Meanwhile, due to heatmap, this solution is weak in detecting overlaps of the same kind of objects.

4.1.2 The Overflow-Aware Low-Bit Quantization Algorithm

Quantization on the client

The current popular solution to accelerate the recognition in the industry is Google's 8-bit quantization algorithm, but there is a better low-bit quantization algorithm. This algorithm learns the min. and max. ranges of each layer, and dynamically adjusts the quantization solution of each layer. Currently, the acceleration ratio at the inference engine end is 70%.

Finally, we use the heatmap solution to assist the anchor-free solution and integrate the overflow-aware solution. This has achieved a good balance between accuracy and performance on Tmall Genie hardware.

Hardware	Before optimization	centernet-lite	Overflow-aware
MTK8167S	200ms	110ms	80ms
iPad Mini4	45ms	23ms	18ms

4.2 Optimization Closed Loops: Online Optimization Framework of AUTOAI for Gesture Recognition

We have adopted the distillation idea of deep learning and used the output of a pre-trained complex model (teacher model) as a supervisory signal to train the online network (student model.) We can continuously optimize algorithms without directly using business data.

Here are the optimization results with the fence model:

5. Product-Level Sequence (Dynamic) Gestures

5.1 Why is Dynamic Gesture Recognition Needed?

We have tried and applied many single-point gestures. However, dynamic gestures are a more natural and comfortable method of interaction. It is the direction we have been studying continuously.

From product planning, dynamic gestures provide more senses of interaction and participation, and their application scenarios are also different. For example, single-point gestures may be applied in algorithm scenarios where IoT devices are controlled. By comparison, the dynamic gestures with the unique sense of participation are suitable for scenarios, such as education, entertainment, and offline operations. This is also why we keep trying to break through in dynamic gesture scenarios.

5.2 A Dynamic Gesture Recognition Algorithm Based on Skeleton

Last year, we found a skeleton-based dynamic gesture recognition algorithm. Related work has been submitted to ISMAR2019 and published here.

However, in the productization process, we found that a purely skeleton-based solution may be impractical for general dynamic gesture recognition, due to the following reasons:

Computing power: Completing a series of skeleton-based operations, including gesture detection, fingertip regression, and time series networking, requires high computing power. Due to a lack of computing power on IoT devices, such as Tmall Genie, these operations cannot be completed timely.
Motion blurring: Most dynamic gestures suffer from motion blurring due to the fast gesture movement, which is very unfriendly to key-point detecting algorithms. Therefore, we shift our attention to the time series inference solution, which is based on action recognition and assisted by fingertip regression.

5.3 A Dynamic Gesture Recognition Algorithm Based on Video Understanding

Temporal Reasoning

Principle: To help computers recognize these two behaviors, the temporal reasoning capability of image relations requires two or more frame images to assist mutual recognition. The behavior needs to be interpreted by the combination of multiple frames. Therefore, this solution works around motion blurring and is more controllable in computing power consumption.

Our Temporal Generation Network

To solve motion blurring, a video recognition solution, which utilizes the RGB time series sequence as the main framework, is adopted. The solution extracts the features of frames under continuous sampling and merges time series features with an improved 3D convolution network that is efficient and non-degenerated.

At the same time, to recognize specific gestures, an auxiliary branch based on finger key points is proposed. Heatmap is used to carry out multi-task learning and regression of the finger key points and detect the motion trajectory of the finger. Then, the finger key point branch merges features with the RGB branch to assist dynamic gesture recognition. The algorithm combines the advantages of the solution that is based on RGB and key points, achieving a balance between speed and accuracy.

6. Future Prospects

We have already explored and tried many algorithms in various businesses with single-point and dynamic gesture recognition. Therefore, we have some prospects on the algorithm exploration direction and business focus for gesture recognition.

6.1 Rise of 3D Hand Posture Estimation

3D hand posture estimation is the process of modeling human hands based on input RGB or RGB-D images and finding the positions of key components, such as knuckles. We live in a 3D world, and 3D hand posture interaction will inevitably bring a more natural and comfortable interactive experience. We are also actively exploring 3D hand posture interaction. In the future, we will launch more interactive products to provide more humanized interactive experiences and services, such as interactive display of e-commerce products, virtual reality (VR), or artificial reality (AR), gesture language recognition, and online education.

3D hand posture manipulation launched by Oculus Quest this year

6.2 The Application of Gestures in IoT Scenarios

Can gesture control surpass voice control as the most natural method of control for smart home devices? For example, in the IoT scenario, you can use gestures to control TVs, light bulbs, and air conditioners. Currently, some startup companies have begun to explore this aspect.

Take Bixi as an example. Bixi is a small remote control that senses your gestures. It can control your favorite smartphone applications, LifX or Hue lightbulbs, Internet speakers, GoPros, and many other IoT devices.

Another example is the Bearbot universal remote control shown in the following figure. With its cute appearance, Bearbot supports custom gestures to control all home appliances, so you can get rid of the shackles of traditional remote controls.

Bearbot gesture remote control. Source.

6.3 The Application of Gestures in Education Scenarios

In addition to finger gesture reading, gestures have more applications in the education industry. Gestures can increase the sense of interaction in the virtual classroom. Also, the interesting and novel manipulation experience provided by gestures and vision is very important for children to focus in class. For example, it guides children to raise their hands before answering questions. Take another example, when small exercises are needed in class, the ordinary practice may be boring. However, dynamic gesture recognition allows children to complete these exercises interactively, such as drawing a tick or a cross on-screen.

Community