Tech Briefs

Research on Spoken Dialogue Systems

Human verbal interaction with complex information sources.

Research in the field of spoken dialogue systems has been performed with the goal of making such systems more robust and easier to use in demanding situations. The term “spoken dialogue systems” signifies unified software systems containing speech-recognition, speech-synthesis, dialogue management, and ancillary components that enable human users to communicate, using natural spoken language or nearly natural prescribed spoken language, with other software systems that provide information and/or services. The research is proceeding on several fronts: recognition of speech signals, syntactic and semantic parsing, language modeling, discourse analysis, and contact modeling.

Many of the advances made thus far in this research have been incorporated into a voice-enabled procedure-browser and reader, called Clarissa, that has been tested aboard the International Space Station. [A procedure-browser and reader is essentially a software version of an instruction manual that may describe one or more possibly complex procedure(s).] Major problems that have been addressed in developing Clarissa include creating voice-navigable versions of formal procedure documents, grammar-based speech recognition, methods for accurate detection of user’s speech directed toward a listener other than Clarissa based on grammar filtering or support vector machines, and robust, side-effect-free dialogue management for enabling undoing, correction, and/or confirmation of steps of a procedure.

Clarissa enables the user to navigate a complex procedure using only spoken input and output, making it unnecessary for the user to shift visual attention from the task at hand to a paper instruction manual or to an equivalent document displayed on a computer screen. Clarissa also provides a graphical user interface (GUI) for optional visual display of information. Clarissa has a vocabulary of about 260 words and supports about 75 different commands, including commands for reading steps of the procedure, scrolling forward or backward in the procedure, moving to an arbitrary new step, reviewing non-current steps, adding and removing voice notes, displaying pictures, setting and canceling alarms and timers, requiring challenges to verify critical commands, and querying the system as to status of the procedure.

Clarissa includes the following main software modules:

  • Speech Processor — Performs low-level speech-recognition (input) and speech-synthesis (output) functions.
  • Semantic Analyzer — Converts output from the speech processor into an abstract dialogue move. 
  • Response Filter — Decides whether to accept or reject the spoken input from the user.
  • Dialogue Manager — Converts abstract dialogue moves into abstract dialogue actions, and maintains knowledge of both the context of the discourse and the progress through the procedure.
  • Output Manager — Accepts abstract dialogue actions from the Dialogue Manager and converts them into lists of procedure calls that result in concrete system responses, which can include spoken output, requests for display of visual output on the GUI, or sending dialogue moves back to the Dialogue Manager.
  • GUI Module — Mediates conventional keyboard and screen-based interaction with the user and accepts display requests from the Output Manager. This module can also convert keyboard input from the user into dialogue moves, which are sent to the Dialogue Manager.

Another accomplishment of this research has been the development of a targeted-help module that is highly portable in that it can be added to a spoken dialogue system, with minimal application-specific modifications, to make the spoken dialogue system more robust. The targeted-help module is intended, more specifically, for incorporation into a spoken dialogue system in which, as in Clarissa, there is a prescribed spoken language containing a limited number of words. The purpose served by the targeted-help module is to assist an untrained user to learn the prescribed language by providing help messages in response to out-of-coverage users’ utterances (that is, users’ utterances outside the prescribed language). These messages can be much more informative than “Sorry, I didn’t understand” and variants thereof generated by older, less-capable spoken dialogue systems.

The targeted-help module includes two submodules that run simultaneously: a grammar-based recognizer and a statistical language model (SLM). When the grammar-based recognizer succeeds, the ordinarily-less-accurate hypothesis generated by the SLM recognizer is not used. When the grammar-based recognizer fails and the SLM recognizer produces a recognition hypothesis, the SLM output is processed to generate a message that tells the user what was recognized as having been uttered, a diagnosis of what was problematic about the recognized utterance, and a related in-coverage example. The in-coverage example is intended to encourage alignment between the user’s utterances and the prescribed language.

This work was done by Gregory Aist and James Hieronymus of the Research Institute for Advanced Computer Science; John Dowding and Beth Ann Hockey of the University of California, Santa Cruz; Manny Rayner of the International Computer Science Institute; Nikos Chatzichrisafis of the University of Geneva; Kim Farrell of QSS; and Jean-Michel Renders of Xerox Research Center Europe for Ames Research Center. For more information, download the Technical Support Package (free white paper) at under the Information Sciences category. ARC-14610-1