Automatic speech recognition is on the verge of becoming the chief means of interacting with computing devices. To address this, MIT researchers have built a low-power chip specialized for automatic speech recognition. Whereas a cellphone running speech recognition software might require about 1 Watt of power, the new chip requires between 0.2 and 10 milliwatts, depending on the number of words it has to recognize. That probably translates to a power savings of 90 to 99 percent, which could make voice control practical for relatively simple electronic devices, including power-constrained devices that harvest energy from their environments, or go months between battery charges.
“Speech input will become a natural interface for many wearable applications and intelligent devices,“ said Anantha Chandrakasan, the Vannevar Bush Professor of Electrical Engineering and Computer Science at MIT, whose group developed the chip. “The miniaturization of these devices will require a different interface than touch or keyboard. It will be critical to embed the speech functionality locally to save system energy consumption compared to performing this operation in the cloud.”
Today, the speech recognizers that perform best are based on neural networks, much like other state-of-the-art artificial intelligence systems. These virtual networks of simple information processors are roughly modeled on the human brain. Much of the new chip's circuitry is concerned with implementing speech recognition networks as efficiently as possible.
But even the most power-efficient speech recognition system would quickly drain a device's battery if it ran without interruption. The chip includes a simpler “voice activity detection” circuit that monitors ambient noise to determine whether it might be speech. If the answer is yes, the chip fires up the larger, more complex speech recognition circuit.
In tests, the chip had three different voice activity detection circuits with different degrees of complexity and, consequently, different power demands. Which circuit is most power-efficient depends on context, but in tests simulating a wide range of conditions, the most complex of the three circuits led to the greatest power savings for the system. Even though it consumed almost three times as much power as the simplest circuit, it generated far fewer false positives; the simpler circuits often depleted their energy savings by spuriously activating the rest of the chip.
Neural networks consist of thousands of processing nodes capable of only simple computations, but densely connected to each other. In the type of network commonly used for voice recognition, the nodes are arranged into layers. Voice data are fed into the bottom layer of the network, whose nodes process and pass them to the nodes of the next layer, whose nodes process and pass them to the next layer, and so on. The output of the top layer indicates the probability that the voice data represents a particular speech sound. A voice recognition network is too big to fit in a chip's onboard memory, which is a problem because going off-chip for data is much more energy-intensive than retrieving it from local stores. The new design concentrates on minimizing the amount of data the chip has to retrieve from off-chip memory.
A node in the middle of a neural network might receive data from a dozen other nodes and transmit data to another dozen. Each of those two dozen connections has an associated “weight” — a number that indicates how prominently data sent across it should factor into the receiving node's computations. The first step in minimizing the new chip's memory bandwidth is to compress the weights associated with each node. The data are decompressed only after they're brought on-chip.
With speech recognition, waves of data must pass through the network. The incoming audio signal is split up into 10-millisecond increments, each of which must be evaluated separately. The new chip brings in a single node of the neural network at a time, but it passes the data from 32 consecutive 10-millisecond increments through the node. If a node has a dozen outputs, then the 32 passes result in 384 output values, which the chip stores locally. Each of those must be coupled with 11 other values when fed to the next layer of nodes, and so on. The chip ends up requiring a sizable onboard memory circuit for its intermediate computations. But it fetches only one compressed node from off-chip memory at a time, keeping its power requirements low.
For more information, visit here .