New research from Duke University details a system dubbed SonicSense that allows robots to interact with their surroundings in ways previously limited to humans.

“Robots today mostly rely on vision to interpret the world,” explained Lead Author Jiaxun Liu, first-year Ph.D. student in the laboratory of Boyuan Chen, Professor of Mechanical Engineering and Materials Science at Duke. “We wanted to create a solution that could work with complex and diverse objects found on a daily basis, giving robots a much richer ability to ‘feel’ and understand the world.”

SonicSense features a robotic hand with four fingers, each equipped with a contact microphone embedded in the fingertip. These sensors detect and record vibrations generated when the robot taps, grasps or shakes an object. And because the microphones are in contact with the object, it allows the robot to tune out ambient noises.

Based on the interactions and detected signals, SonicSense extracts frequency features and uses its previous knowledge, paired with recent advancements in AI, to figure out what material the object is made out of and its 3D shape. If it’s an object the system has never seen before, it might take 20 different interactions for the system to come to a conclusion. But if it’s an object already in its database, it can correctly identify it in as little as four.

Here is an exclusive Tech Briefs interview, edited for length and clarity, with Chen.

Tech Briefs: What was the biggest technical challenge you faced while developing SonicSense?

Chen: I think the first is that there really haven't been extensive studies on using acoustic vibrations for robot perception. Most of the previous work has been with a single finger or has been very preliminary. But, putting this on a real robotic hand and being able to interact with a variety of possible objects is not an easy task.

Tech Briefs: How did this project come about? What was the catalyst for your work?

Chen: This is a very interesting story. Part one of my work was called boombox, this was during COVID. I was thinking I want to do work on robots and vision. So, I was already interested a few years ago, in bringing acoustic vibrations into sensing, because we use acoustic sound vibrations a lot.

In neuroscience, human skin has vibration neurons. So, I read about these things and I was thinking about how we can bring this to robots. But, during COVID, I didn't have access to robots. I did my Ph.D. at Columbia, so I lived in a small dorm in New York City, but I really wanted to do this research. I had a random idea one day, ‘What can I do without robots to show that this is helpful?

I had a toy bin in my room. By randomly throwing objects in there, I realized, ‘Hey, I have to go and retrieve this object, but I don't know where it is. What object did I throw in?’ That was a perfect research question.

I started with three different wooden boxes with different shapes, and I threw them into the bin. I trained an AI system that predicted the shape of the object I threw in and where the object ended up after I threw it because I could not see it. So this was the project.

The idea was basically you have four contact microphones. You attach them around the wall of the bin. You only record acoustic vibrations from the four channels of microphones. I used microphones that are used to pick up sound from a guitar. I stuck them onto the bin, and I trained a system going from sound to the prediction of this 3D operation. And that was the beginning of that project.

Then, of course, I wanted to do this for robots. That was pretty much the birthplace of SonicSense.

Tech Briefs: Can you explain in simple terms how it works?

Chen: It's an integrated hardware and software system. The hardware part has a robotic gripper with four fingers, and each of the fingertips has an embedded contact microphone. This contact microphone doesn't sense what we’re saying, but it senses the vibrations of physical contact.

The software side basically enables the robot to autonomously export environments through a simple tapping or grasping an object with a container and shaking it. The software will collect the signals from the four contact microphones and the motor signals together.

We train an artificial intelligence network to predict things like ‘How many dice do you have in a container? How many edges does this die have? How much liquid do you have in this water bottle? How much are you pouring into another container?’

Tech Briefs: What are your next steps?

Chen: We're looking into a couple of new ideas in terms of, first of all, what other sensing modalities are needed for us to achieve human-level dexterity of manipulation. But, even more for broader content for robotics, not just for manipulation. But also for locomotion, navigation, and everything. Are there other sensing modalities that we need? So, exploring novel modalities that can enable robots or empower robots to have capabilities that even humans or animals don’t have, is one direction.

Another direction we're looking at is, other modalities that we already have in robots. For example, vision — and how do we fuse all the modalities together to have a coherent understanding of the world instead of just one perspective? So, making lots of sensing modalities come together and learn a unified understanding.

A third direction we're looking at is to bring this to the true human vectors of manipulation capability — the current design is very much a prototype. We want to do this by scaling both the morphology and sensing capability of the hand to be on a much larger scale. This means putting lots of sensors on a much more human-like hand, and to really showcase dexterous manipulation capability. Right now, we're just doing object perception, but we want to be able to be able to manipulate objects with much more advanced capabilities.



Transcript

00:00:00 we introduced Sonic sense an integrated hardware and software framework to enable acoustic vibration sensing for Rich robot object perception recent work has leveraged acoustic vibration sensing for object material and category classification position prediction estimating the amount and flow of granular material and collectively performing object spatial reasoning for

00:00:23 visual reconstruction however previous work focused on a small number of primitive objects with homogeneous material composition constrained settings for data collection and single finger testing therefore it is not clear whether acoustic vibration sensing can be helpful for object perception under noisy and less controlled conditions we

00:00:44 present Sonic sense a holistic design on both hardware and algorithm advancements for object perception through enhan acoustic vibration sensing our robot hand has four fingers a p Electric contact microphone is embedded inside each fingertip and around counterweight is mounted on the outer shell surface to increase the momentum of the finger motion our intuitive mechanical design

00:01:07 enables a range of interactive motion Primitives for object perception including tapping grasping and shaking motions the embedded contact microphone is able to collect highfrequency acoustic vibrations created by the contact between object objects or object hand interactions our robot can infer the geometry and inventory status of various objects inside a container from

00:01:31 their unique acoustic vibration signatures during interactions we derive 12 interpretable features based on traditional acoustic signal processing methods to help distinguish these different acoustic vibration signatures We performed an unsupervised nonlinear dimensionality reduction with tne on this 12-dimensional feature vector by shaking the container our robot can

00:01:54 successfully distinguish different numbers of dice or dice with different shapes inside the container when pouring water inside the bottle held by our robot we can detect the subtle differences in acoustic signatures based on different existing amounts of water inside the bottle our robot can also detect different amounts of water inside the bottle when shaking it in more

00:02:15 challenging object perception tasks we developed a data set with 83 diverse Real World objects our objects cover nine material categories and a variety of geometries from simple Primitives to complex shapes unlike previous work that uses humans to manually hold the robot's hand to interact with objects or design fixed interaction poses and forces for replay we derive a simple but effective

00:02:40 heuristic based interaction policy to autonomously collect the acoustic vibration response of objects our policy works well for all our Real World objects covering variable sizes and geometries we trained a material classification model that takes in the Mel spectrogram of our collected acoustic vibration signal from the impact sound and learns to predict the

00:03:02 material label the network takes the form of three convolutional neural network layers followed by two MLP layers the initial result of our method leads to a 0.523 F1 score however we observed object materials are relatively uniform and smooth around local regions based on this assumption we can iterative refine our prediction our final average F1

00:03:25 score reaches to 0.763 our shape Recon construction model takes the sparse and noisy contact points to generate a dense and complete 3D shape of the object we stack two pointed layers to encode the input and then feed the global feature Vector into a decoder network with fully connected layers to produce the final Point Cloud our results obtained an average of z. Z

00:03:50 Z 876 M champ for distance score the prediction on objects with primitive shapes generally has near perfect performance additionally our method exhibit the capability to reconstruct objects with complex shapes only through spars and noisy contact Point estimations when an object has been interacted with by the robot with its acoustic vibration responses we aim to

00:04:13 have our robot re-identify the object through a set of 15 new tapping interactions we input 15 both the collection of Mel spectrograms and their Associated contact points to the network to predict the label of this object among 82 objects in our data set our robot can re-identify the same object with more than 92% accuracy our robot has a strong resistance against ambient

00:04:37 noises and only focuses on vibration signals through physical contact this ensures high quality and reliable sensing data under challenging environmental conditions our entire robot hand costs $215 with commercially available components and 3D printing our experimental results demonstrate the versatility and efficacy of our design on variet ities of object perception

00:05:01 tasks including solid and liquid object inventory status estimation within containers material classification 3D shape reconstruction and object reidentification overall our method presents unique contributions to tactile perception with acoustic vibrations and opens up new opportunities for future robot designs to build a more robust complete

00:05:23 versatile and holistic perceptual model of the world