2009

The speaker need not make any sound.

A recently invented speech-recognition method applies to words that are articulated by means of the tongue and throat muscles but are otherwise not voiced or, at most, are spoken sotto voce. This method could satisfy a need for speech recognition under circumstances in which normal audible speech is difficult, poses a hazard, is disturbing to listeners, or compromises privacy. The method could also be used to augment traditional speech recognition by providing an additional source of information about articulator activity. The method can be characterized as intermediate between (1) conventional speech recognition through processing of voice sounds and (2) a method, not yet developed, of processing electroencephalographic signals to extract unspoken words directly from thoughts.

alt
Surface Electrodes Under and Near the Chin acquire signals indicative of tongue and throat muscle activity during articulation of words, even when the words are unvoiced. The signals are then processed to recognize the words. The six signal samples shown here correspond to the noted words.
This method involves computational processing of digitized electromyographic (EMG) signals from muscle innervation acquired by surface electrodes under a subject’s chin near the tongue and on the side of the subject’s throat near the larynx (see figure). After preprocessing, digitization, and feature extraction, EMG signals are processed by a neural-network pattern classifier, implemented in software, that performs the bulk of the recognition task as described below.

Before processing signals representing words that one seeks to recognize, the neural network must be trained. During training, EMG signals representing known words and/or phrases are first sampled over specified time intervals (each typically about 2 seconds long). The portions of the signals recorded during each time interval are denoted sub-audible muscle patterns (SAMPs). Sequences of samples of SAMPs for overlapping time intervals are processed by a suitable signal-processing transform (SPT), which could be, for example, a Fourier, Hartley, or wavelet transform. The SPT outputs are entered into a matrix of coefficients, which is then decomposed into contiguous, non-overlapping two-dimensional cells of entries, each cell corresponding to a feature. Neural-network analysis is performed to estimate reference sets of weight coefficients for weighted sums of the SAMP features that correspond to known words and/ or phrases.

Once training has been done, a SAMP that includes an unknown word is sampled and processed by the SPT, the SPT outputs are used to construct a matrix, the matrix is decomposed into cells, and neural-network analysis is performed, all in the same manner as that of training. The weight coefficients computed during training are used to determine whether there is a sufficiently close match between an unknown word in the SAMP and a known word in the training database. If such a match is found, the word is deemed to be recognized.

This work was done by C. C. Jorgensen and D. D. Lee of Ames Research Center.

This invention is owned by NASA and a patent application has been filed. Inquiries concerning rights for the commercial use of this invention should be addressed to the Ames Technology Partnerships Division at (650) 604-2954. Refer to ARC-15040-1.

The U.S. Government does not endorse any commercial product, process, or activity identified on this web site.