Vision systems have allowed robots to tell the difference between, say, the handle of a screwdriver and the head of a wrench.

Carnegie Mellon researcher Lerrel Pinto wants to throw the tools around a bit, and prove that sound can be just as valuable of an identification asset for robots.

A team at Carnegie Mellon is creating an audio dataset of clashes, crashes, and clangs — a collection of common objects sliding around a tray and hitting its sides. A falling metal wrench, it turns out, sounds different than a falling metal screwdriver.

Given the unique weight and shape of objects, the researchers at CMU's Robotics Institute found that robots can recognize the specific noises that characterize them. The Carnegie Mellon team hopes to someday advance the audio-recognition capability to a level where robots can someday be instrumented with canes that tap on objects to be identified.

"A lot of preliminary work in other fields indicated that sound could be useful, but it wasn't clear how useful it would be in robotics," said Lerrel Pinto , who recently earned his Ph.D. in robotics at CMU and will join the faculty of New York University this fall.

Pinto and his colleagues found that the robots that used sound successfully classified objects 76 percent of the time.

To perform their study, the CMU team simultaneously recorded video and audio of 60 common objects, including toy blocks, wrenches, screwdrivers, shoes, apples, and tennis balls. The objects rolled around in an experimental apparatus called "Tilt-Bot," a square tray attached to the arm of a Sawyer robot from Rethink Robotics.

Using Tilt-Bot, the CMU team collected 15,000 interactions on 60 different objects by tilting them in a tray. When sufficiently tilted, the object slides across the tray and hits the walls of the tray. This generates sound, which is captured by four contact microphones mounted on each side of the tray. An overhead camera records visual (RGB+Depth) information, while the robotic arm applies the tilting actions through end-effector rotations. (Image Credit: CMU)

To collect the audio data, Pinto placed an object in the Tilt-Bot and let the robot spend a few hours moving the tray in random directions, with varying levels of tilt, as cameras and microphones recorded each action.

The resulting dataset of the many objects catalogued 15,000 interactions.

{youtube} {/youtube}

The team discovered that object representations derived from the audio embeddings were indicative of implicit physical properties. Or more simply: that you can detect an object by its sound.

The researchers presented their findings last month during the virtual Robotics Science and Systems conference . Other team members included Abhinav Gupta, associate professor of robotics, and Dhiraj Gandhi, a former master's student who is now a research scientist at Facebook Artificial Intelligence Research's Pittsburgh lab.

In a Q&A below, Pinto tells Tech Briefs about the kinds of applications that are possible when a robot can identify an object by giving it a tap.

Tech Briefs: What applications are especially valuable when a robot can detect a particular sound?

Lerrel Pinto: Sound can help identify objects when visual information is insufficient, like scenarios where there is low-light, occlusions, or environments where cameras are hard to integrate into. Other applications of sounds involve settings in which vision just cannot capture relevant information.

Tech Briefs: Can you bring us through a theoretical application that comes to mind?

Lerrel Pinto: Say, if you see soda can, vision alone cannot tell you how much soda is left in the can, but sound generated by the can when you interact with it contains such information.

Tech Briefs: A recent CMU press release mentioned that you were "surprised" at how useful sound proved to be in the support of robotics. How so?

Lerrel Pinto: So to clarify, what I meant was that it wasn't surprising that sound gives us useful information. But what was surprising was how useful this information could be. In the task of object classification for instance we achieve nearly 80% accuracy on 60 objects, where random prediction would be around 1.6%.

Tech Briefs: What inspired you to work with sound?

Lerrel Pinto: Our past work has looked at visual information and how it can be used by robots. However as I've described earlier, there are several settings in which vision alone might be insufficient to solve a robotic task. Audio sensors, or microphones, are usually quite cheap and provide a robust way to obtain this missing information. Moreover, from a biological perspective, we as humans don't just have eyes; we have ears as well.

Tech Briefs: What sounds are difficult for a robot to detect?

Lerrel Pinto: One challenge with sound is that there is always a lot of ambient sound. In our setting, we have an additional source of sound: the robot moving. So if the object makes sounds similar to the robot, our models may not be able to distinguish them.

Tech Briefs: What will you be working on next?

Lerrel Pinto: In the current work, we show how sound provides useful information that is hard to obtain otherwise. In the future, we want to work on using this additional information to solve or improve real-world robotic tasks like, say, manipulating objects.

What do you think? Share your questions and comments below.