Researchers have invented an earphone that can continuously track full facial expressions by observing the contour of the cheeks — and can then translate expressions into emojis or silent speech commands. With the ear-mounted device (called C-Face), users could express emotions to online collaborators without holding cameras in front of their faces — an especially useful communication tool as much of the world engages in remote work or learning.
The device is simpler, less obtrusive, and more capable than existing ear-mounted wearable technologies for tracking facial expressions. In previous wearable technology aiming to recognize facial expressions, most solutions needed to attach sensors on the face; even with so much instrumentation, they could only recognize a limited set of discrete facial expressions.
With C-Face, avatars in virtual reality environments could express how their users are actually feeling and instructors could get valuable information about student engagement during online lessons. It could also be used to direct a computer system, such as a music player, using only facial cues. Because it works by detecting muscle movement, C-Face can capture facial expressions even when users are wearing masks.
The device consists of two miniature RGB cameras — digital cameras that capture red, green, and blue bands of light — positioned below each ear with headphones or earphones. The cameras record changes in facial contours caused when facial muscles move. When performing a facial expression, facial muscles stretch and contract, pushing and pulling the skin and affecting the tension of nearby facial muscles. This effect causes the outline of the cheeks (contours) to alter from the point of view of the ear.
Once the images are captured, they are reconstructed using computer vision and a deep learning model. Since the raw data is in 2D, a convolutional neural network — a kind of artificial intelligence model that is good at classifying, detecting, and retrieving images — helps reconstruct the contours into expressions. The model translates the images of cheeks to 42 facial feature points, or landmarks, representing the shapes and positions of the mouth, eyes, and eyebrows, since those features are the most affected by changes in expression.
Because of restrictions caused by the COVID-19 pandemic, the researchers could test the device on only nine participants. They compared its performance with a state-of-the-art computer vision library, which extracts facial landmarks from the image of a full face captured by frontal cameras. The average error of the reconstructed landmarks was less than 0.8 mm.
These reconstructed facial expressions represented by 42 feature points can also be translated to eight emojis including “natural” and “angry” as well as eight silent speech commands designed to control a music device such as “play,” “next song,” and “volume up.”
The ability to direct devices using facial expressions could be useful for working in shared workspaces, for example, where people might not want to disturb others by speaking out loud. Translating expressions into emojis could help those in virtual reality collaborations communicate more seamlessly.
One limitation to C-Face is the earphones’ limited battery capacity. As its next step, the team plans to work on a sensing technology that uses less power.
For more information, contact Jeff Tyson at