Detecting the nuances of nonverbal communication between individuals will allow robots to serve in social spaces, enabling them to perceive what people around them are doing, what moods they are in, and whether they can be interrupted. A self-driving car could get an early warning that a pedestrian is about to step into the street by monitoring body language. Enabling machines to understand human behavior also could enable new approaches to behavioral diagnosis and rehabilitation for conditions such as autism, dyslexia, and depression.
Researchers have enabled a computer to understand body poses and movements of multiple people from video in real time including, for the first time, the pose of each individual's hands and fingers. The new method was developed using a single camera and a laptop computer.
Methods for tracking 2D human form and motion open up new ways for people and machines to interact with each other, and for people to use machines to better understand the world around them. The ability to recognize hand poses, for instance, will make it possible for people to interact with computers in new and more natural ways, such as communicating with computers simply by pointing at things. In sports analytics, real-time pose detection will make it possible for computers to track not only the position of each player on the field of play, but to know what players are doing with their arms, legs, and heads at each point in time. The methods can be used for live events or applied to existing videos.
Tracking multiple people in real time, particularly in social situations where they may be in contact with each other, presents a number of challenges. Simply using programs that track the pose of an individual does not work well when applied to each individual in a group, particularly when that group gets large. The researchers took a “bottom-up” approach that first localizes all the body parts in a scene — arms, legs, faces, etc. — and then associates those parts with particular individuals.
The challenges for hand detection are greater. As people use their hands to hold objects and make gestures, a camera is unlikely to see all parts of the hand at the same time. Unlike the face and body, large datasets of hand images do not exist that have been annotated with labels of parts and positions. For every image that shows only part of the hand, there often exists another image from a different angle with a full or complementary view of the hand. Thirty-one high-definition cameras were used to build a dataset of the hand.