New research from Duke University details a system dubbed SonicSense that allows robots to interact with their surroundings in ways previously limited to humans.
“Robots today mostly rely on vision to interpret the world,” explained Lead Author Jiaxun Liu, first-year Ph.D. student in the laboratory of Boyuan Chen, Professor of Mechanical Engineering and Materials Science at Duke. “We wanted to create a solution that could work with complex and diverse objects found on a daily basis, giving robots a much richer ability to ‘feel’ and understand the world.”
SonicSense features a robotic hand with four fingers, each equipped with a contact microphone embedded in the fingertip. These sensors detect and record vibrations generated when the robot taps, grasps or shakes an object. And because the microphones are in contact with the object, it allows the robot to tune out ambient noises.
Based on the interactions and detected signals, SonicSense extracts frequency features and uses its previous knowledge, paired with recent advancements in AI, to figure out what material the object is made out of and its 3D shape. If it’s an object the system has never seen before, it might take 20 different interactions for the system to come to a conclusion. But if it’s an object already in its database, it can correctly identify it in as little as four.
Here is an exclusive Tech Briefs interview, edited for length and clarity, with Chen.
Tech Briefs: What was the biggest technical challenge you faced while developing SonicSense?
Chen: I think the first is that there really haven't been extensive studies on using acoustic vibrations for robot perception. Most of the previous work has been with a single finger or has been very preliminary. But, putting this on a real robotic hand and being able to interact with a variety of possible objects is not an easy task.
Tech Briefs: How did this project come about? What was the catalyst for your work?
Chen: This is a very interesting story. Part one of my work was called boombox, this was during COVID. I was thinking I want to do work on robots and vision. So, I was already interested a few years ago, in bringing acoustic vibrations into sensing, because we use acoustic sound vibrations a lot.
In neuroscience, human skin has vibration neurons. So, I read about these things and I was thinking about how we can bring this to robots. But, during COVID, I didn't have access to robots. I did my Ph.D. at Columbia, so I lived in a small dorm in New York City, but I really wanted to do this research. I had a random idea one day, ‘What can I do without robots to show that this is helpful?
I had a toy bin in my room. By randomly throwing objects in there, I realized, ‘Hey, I have to go and retrieve this object, but I don't know where it is. What object did I throw in?’ That was a perfect research question.
I started with three different wooden boxes with different shapes, and I threw them into the bin. I trained an AI system that predicted the shape of the object I threw in and where the object ended up after I threw it because I could not see it. So this was the project.
The idea was basically you have four contact microphones. You attach them around the wall of the bin. You only record acoustic vibrations from the four channels of microphones. I used microphones that are used to pick up sound from a guitar. I stuck them onto the bin, and I trained a system going from sound to the prediction of this 3D operation. And that was the beginning of that project.
Then, of course, I wanted to do this for robots. That was pretty much the birthplace of SonicSense.
Tech Briefs: Can you explain in simple terms how it works?
Chen: It's an integrated hardware and software system. The hardware part has a robotic gripper with four fingers, and each of the fingertips has an embedded contact microphone. This contact microphone doesn't sense what we’re saying, but it senses the vibrations of physical contact.
The software side basically enables the robot to autonomously export environments through a simple tapping or grasping an object with a container and shaking it. The software will collect the signals from the four contact microphones and the motor signals together.
We train an artificial intelligence network to predict things like ‘How many dice do you have in a container? How many edges does this die have? How much liquid do you have in this water bottle? How much are you pouring into another container?’
Tech Briefs: What are your next steps?
Chen: We're looking into a couple of new ideas in terms of, first of all, what other sensing modalities are needed for us to achieve human-level dexterity of manipulation. But, even more for broader content for robotics, not just for manipulation. But also for locomotion, navigation, and everything. Are there other sensing modalities that we need? So, exploring novel modalities that can enable robots or empower robots to have capabilities that even humans or animals don’t have, is one direction.
Another direction we're looking at is, other modalities that we already have in robots. For example, vision — and how do we fuse all the modalities together to have a coherent understanding of the world instead of just one perspective? So, making lots of sensing modalities come together and learn a unified understanding.
A third direction we're looking at is to bring this to the true human vectors of manipulation capability — the current design is very much a prototype. We want to do this by scaling both the morphology and sensing capability of the hand to be on a much larger scale. This means putting lots of sensors on a much more human-like hand, and to really showcase dexterous manipulation capability. Right now, we're just doing object perception, but we want to be able to be able to manipulate objects with much more advanced capabilities.