A machine learning system from the Massachusetts Institute of Technology (MIT) recognizes sounds by watching video. The neural network interprets natural sounds in terms of image categories, without hand-annotated training data.

After being fed 26 terabytes of video data from the photo-sharing site Flickr, for example, the technology determines that birdsong sounds tend to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.

The sound recognition software, developed by a team at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), could be used to support home-security and elder-care applications, as the technology responds to potentially alarming deviations from ordinary sound patterns, like the sound of broken glass.

To train the system on video, existing computer vision systems firsts recognized scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.

Because collecting and processing audio data requires far less data than the collecting and processing of visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.

When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started; the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.

“For instance, think of a self-driving car,” said researcher Yusuf Aytar. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”


Also: Learn about AUDREY, NASA's artificial-intelligence system.