A coder/decoder system (codec) is being developed for use in a digital voice communication system in which speech is encoded as a series of frames of data and transmitted over a noisy or lossy channel. Called a “loss-tolerant speech codec” (LTSC), the system is designed to maintain the quality of reconstructed speech at the receiver (decoder) when frames are occasionally lost or corrupted by noise during propagation through the channel. Digital satellite and digital cellular telephone links are typical examples of channels in which frames can be lost or corrupted by noise. In addition, variable frame delays in packet-switching digital communication channels can occasionally become long enough to exert the same effect upon speech decoders as that of loss of speech during the affected frames.

The Decoder in the Loss-Tolerant Speech Codec estimates speech represented by missing or erroneous frames of received speech data. The estimate is an extrapolation based on the data from preceding frames and on the dynamics of speech as learned by a back-propagation neural network.

The principal innovative aspect of this LTSC is a subsystem called an “intelligent speech filter” (ISF), which combines the latest neural-network technology with state-of-the-art speech-processing techniques. The ISF functions as a speech predictor or extrapolator; it reconstructs an approximate version of the missing speech frames on the basis of previous speech frames and of its knowledge of the dynamics of the vocal tract and pitch. The quality of speech reconstructed after transmission through a noisy channel is improved over that of other codecs because in a dynamical sense, the estimated speech inserted in the missing or erroneous frames sounds like the immediately preceding speech. In contrast, other codecs put out such distracting, discontinuous sounds as clicks, abrupt silences, or garbled speech in response to missing or erroneous frames.

The design of this LTSC will eventually call for high-speed microprocessors, parallel processing architecture, and efficient algorithm coding to implement its functions, which are shown in simplified form in the figure. The analog-to-digital converter in the transmitter linearly quantizes the input speech signal into a binary representation of 12 to 16 bits at a sampling rate of fs. The digital input speech signal estimates the spectral envelope that represents the vocal tract, along with an appropriate driving function. Within the encoder, the digital input speech signal is first classified as either voiced or unvoiced, because the mathematical models for processing voiced and unvoiced signals are quite different; the model for voiced signals involves pitchsynchronous spectral analysis, while that for unvoiced signals involves linear prediction spectral analysis.

The decoder in the receiver includes the ISF plus signal-processing circuitry that determines whether or not each received frame of data is erroneous. If a frame is received without error, the data are decoded into a spectral envelope, frame energy, and pitch period for voiced speech. The decoded data are used to generate a speech time series, which is fed to a digital-to-analog converter for conversion to an analog output speech signal.

The decoded data are also sent to an input buffer in the ISF. When a frame of data is found to be lost or erroneous, the ISF uses the decoded data stored in the buffer during preceding frames to extrapolate the spectral envelope, energy, and pitch period into the missing or erroneous frame. The energy and pitch are both predicted by an all-pole infinite-impulse-response digital filter. The neural network is used to solve the most difficult part of the extrapolation problem, which is prediction of the spectral envelope.

The neural network is of a back-propagation-learning type. The neural network can be trained in the dynamics of speech by use of the time-varying spectral envelopes of representative samples of speech recorded from a variety of speakers. The neural network in the prototype ISF was trained on a single sentence from a single speaker, but the training set will eventually be expanded to include many sentences from many speakers, containing all phonemes and phoneme transitions.

This work was done by Jaime L. Prieto of LinCom Corp. for Johnson Space Center.

In accordance with Public Law 96-517, the contractor has elected to retain title to this invention. Inquiries concerning rights for its commercial use should be addressed to

Jaime L. Prieto LinCom Corp. 1020 Bay Area Blvd. Suite 200 Houston
TX 77058

Refer to MSC-22426, volume and number of this NASA Tech Briefs issue, and the page number.