To charge, the microphones automatically return to their charging station. (Image: April Hong/University of Washington)

With the help of the team’s deep-learning algorithms, the system lets users mute certain areas or separate simultaneous conversations, even if two adjacent people have similar voices.

Like a fleet of Roombas, each about an inch in diameter, the microphones automatically deploy from, and then return to, a charging station. This allows the system to be moved between environments and set up automatically. In a conference room meeting, for instance, such a system might be deployed instead of a central microphone, allowing better control of in-room audio.

“If I close my eyes and there are 10 people talking in a room, I have no idea who’s saying what and where they are in the room exactly. That’s extremely hard for the human brain to process. Until now, it’s also been difficult for technology,” said Co-Lead Author Malek Itani. “For the first time, using what we’re calling a robotic ‘acoustic swarm,’ we’re able to track the positions of multiple people talking in a room and separate their speech.”

The team’s prototype consists of seven small robots that spread themselves across tables of various sizes. As they move from their charger, each robot emits a high-frequency sound, like a bat navigating, using this frequency and other sensors to avoid obstacles and move around without falling off the table. The automatic deployment allows the robots to place themselves for maximum accuracy, permitting greater sound control than if a person set them. The robots disperse as far from each other as possible since greater distances make differentiating and locating people speaking easier. Today’s consumer smart speakers have multiple microphones, but clustered on the same device, they’re too close to allow for this system’s mute and active zones.

The tiny individual microphones are able to navigate around clutter and place themselves with only sound. (Image: April Hong/University of Washington)

“If I have one microphone a foot away from me, and another microphone two feet away, my voice will arrive at the microphone that’s a foot away first. If someone else is closer to the microphone that’s two feet away, their voice will arrive there first,” said Co-Lead Author Tuochao Chen. “We developed neural networks that use these time-delayed signals to separate what each person is saying and track their positions in a space. So, you can have four people having two conversations and isolate any of the four voices and locate each of the voices in a room.”

The team tested the robots in offices, living rooms and kitchens with groups of three to five people speaking. Across all these environments, the system could discern different voices within 1.6 feet (50 centimeters) of each other 90 percent of the time, without prior information about the number of speakers. The system was able to process three seconds of audio in 1.82 seconds on average — fast enough for live streaming, though a bit too long for real-time communications such as video calls.

The researchers plan to eventually make microphone robots that can move around rooms, instead of being limited to tables. The team is also investigating whether the speakers can emit sounds that allow for real-world mute and active zones, so people in different parts of a room can hear different audio.

The team acknowledges the potential for misuse, so they have included guards against this: The microphones navigate with sound, not an onboard camera like other similar systems. The robots are easily visible and their lights blink when they’re active. Instead of processing the audio in the cloud, as most smart speakers do, the acoustic swarms process all the audio locally, as a privacy constraint. And even though some people’s first thoughts may be about surveillance, the system can be used for the opposite, the team says.

“It has the potential to actually benefit privacy, beyond what current smart speakers allow,” Itani said. “I can say, ‘Don’t record anything around my desk,’ and our system will create a bubble 3 feet around me. Nothing in this bubble would be recorded. Or if two groups are speaking beside each other and one group is having a private conversation, while the other group is recording, one conversation can be in a mute zone, and it will remain private.”

Here is an exclusive Tech Briefs interview, edited for length and clarity, with Itani.

Tech Briefs: I’m sure there were too many to count, but what was the biggest technical challenge you faced while developing this smart speaker?

Itani: While there were plenty of challenges along the way, from designing the circuits in such a small form factor to coordinating the robot motion, the biggest challenge that we faced was generalizing our design to work in real-world environments. It's one thing for the system to work in a neatly controlled simulated environment, but it’s a completely different thing to design the system so that it works in ordinary rooms where this will be ultimately used.

Once you move to the real world, there's a whole bunch of problems that will suddenly appear. For example, the objects on the tables would cause reflections that interfere with the swarm’s ability to both localize itself and to localize the speakers in a room. Additionally, the noise from various sources such as ventilation systems, artifacts from audio compression while streaming audio, or even unavoidable random fluctuations in the microphone recordings needed to be accounted for when collecting data to train our algorithms. We had to address all these practical issues so we can design a system that can actually work in the real world.

Tech Briefs: Can you explain in simple terms how it works?

Itani: We can look at the system as two parts: the self-distributing smart speaker and the speaker separation and localization processing.

When the robots move, they emit sounds to find their distance to other robots. When one robot emits a sound, the other robots measure the time it takes for this sound to reach them and use the speed of sound to calculate the distance toward this robot. The robots measure the distance between each other and calculate the positions of all the robots. They can then coordinate their movements based on these positions, without requiring any cameras, just sound.

As for the speech processing, we can notice that if we have multiple robots spread across a table, a sound from a specific location will arrive at different robots at different times. Then, we can use this observation to create deep learning algorithms to find the location of every speaker in a noisy room and separate out what they are saying.

Tech Briefs: What’s your next step? Do you have any plans for further research?

Itani: We are currently deploying such speech technologies on hearable devices, such as headphones and earphones. We have a new suite of projects in the works which we are extremely excited to share in the next few months.

Tech Briefs: Do you have any advice for engineers/researchers aiming to bring their ideas to fruition?

Itani: Many ideas that we have explored and presented in the paper seemed completely impossible to us just a while back. They just seemed so wild and out of reach. And to be fair, we did face various failures, but we needed to learn new techniques and develop novel methods to address them. The project was a huge learning experience for all of us, and this has really helped us ideate and develop some of our current ongoing research. The bottom line is, although many ideas can seem large, intimidating, or even impossible, they can guide us in directions to acquire very valuable skills and perspectives. So, it's really important to give yourself the opportunity to explore these ideas — you never know where they'll take you.