News & Analysis
Embedded multimodal systems tell a good story
Roberto Pieraccini
8/18/2003 10:06 AM EDT
The convergence of phenomena such as computer cost and size reduction, their increasing performance, their ubiquity, and the availability of pocket-size high resolution displays has generated more interest in embedded multimodal systems that adopt speech, visual and haptic interfaces as their primary modes of interaction. The target devices for these applications range from cellular telephones and PDAs to appliances and in-car computers. Possible applications range from simple voice dialing to more sophisticated conversational interfaces for managing non-critical automobile operations like climate control, entertainment, and navigation systems.
With a progression in time analogous to that of the technological advancement of computational devices, the performance of speech processing algorithms has increased in terms of vocabulary size and accuracy. Conversely, the need for memory and CPU cycles has decreased. Advanced technologies, such as finite state transducers (FST), allow speech recognition systems to be built that can handle thousands of words in real time with a memory footprint of a few megabytes. Moreover, a new technique known as DSR (Distributed Speech Recognition), based on the Aurora standard recommended by the European Telecommunications Standards Institute (ETSI), enables the speech recognition front end to run on a networked client device, while the more expensive large vocabulary search algorithm resides on a server. Speech synthesis has witnessed a similar trend with concatenative TTS delivering near human quality speech in a few megabytes of memory.
Multimodal technology is thus ready for embedded environments. The issue now is how to enable developers to author sophisticated applications. A promising approach known as "Speech Application Language Tags" or "SALT" enhances the HTML language with a set of elements that control the speech processing and call control resources that support the development of multimodal web pages by binding graphics with speech input and output events. The World Wide Web Consortium (W3C) is working at creating a standard for multimodal authoring, and the SALT specification has been contributed as a possible candidate.
However, in an analogy with web applications, HTML is only one aspect of development, and the real complexity lies in the server programs that manage different layers of logics. Similarly, for multimodal interaction systems, the real complexity is behind the presentation layer, in what is known as the dialog or interaction manager. While Web and network speech application development can rely on a wide variety of technologies and products for server side development, embedded systems still must rely on traditional programming. This situation can be alleviated by carefully identifying the different levels of logic involved in a multimodal interaction application, such as the input/output, the multimodal integration layers, the interaction manager, the semantic component, etc., and reusable, application independent engines for each one of them.
Anatomy of multi-modal dialog
The simplest multi-modal dialog application is obtained by combining speech and GUI interfaces into a single system. To simplify even further, we will restrict our discussion to what is generally referred to as sequential multimodality, in which users provide one input at a time. For example, users can either speak an utterance or provide input to the GUI by clicking a button or filling a text field at will, but cannot interact with both channels at the same time. Such a system can be decomposed into the blocks shown in Figure 1. The interaction manager, is at the core of the system and represents the application logic. It receives data from the user input channels, such as the speech recognizer and GUI, and other sources, such as a telephone, a web application, or a database. In response it sends data to the output channels, such as the audio prompt player, the GUI, or other devices. The interaction manager has two main functions. The first is the management and update of the application state, which is generally embodied by a structured set of variables. The second is deciding what is the next action to perform given the current configuration of the application state, which may include playing an audio prompt, activating the speech recognizer with a particular grammar, or changing the GUI layout. The decision mechanism that selects the proper next action given a certain configuration of the application state is generally referred to as the dialog strategy.
There are many forms of dialog strategies that can be adopted, most of them the result of years of research, and targeted at complex conversational interactions. One of the simplest and most effective ways of representing a dialog strategy, often adopted in commercial applications, is the state machine controller. In its simplest form a state machine controller is nothing more than an if-then-else, or case statement structure, where the conditions are drawn over the variables of the application. However, an interaction manager based on hard-coded case statements is difficult to structure into modular elements of interaction that are reusable and make the application easy to maintain and update.
For instance, a complex multimodal application such as the Ford Concept Car Model U is structured into several sub-applications such as navigation, entertainment, and climate control, each having several functional domains. For example, the climate control branch is structured into cabin temperature, seat temperature, fan speed and direction, etc., while the navigation branch into destination entry, points of interest, map display, etc. Each functional domain is composed of further sub-branches, such as requesting information entry, input confirmation, retries for missing and errant input, etc. The complexity of the application can be represented by a graph structure, such as the one of Figure 2. At any point in time the application may be in a state represented by one of the nodes of the graph of Figure 2. However, at any time the user may require to move to any other node without following the predetermined hierarchical structure. For instance, when finished with adjusting the cabin temperature the user may want to place a telephone call without going back to the climate and main menu nodes, by simply saying "call Susan." This behavior, typical of conversational systems, is referred to as mixed initiative.
With such dialogs in sophisticated applications it becomes obvious why the simple case-statement approach would fail in producing a reusable, easy to maintain code base for a multimodal interaction manager. A solution is to delegate the task of managing the interaction to a general-purpose finite state machine engine. Building the application then amounts to defining the characteristics of the state machines that describe it. At this point one can build a more sophisticated reusable state machine engine that includes, for instance, logic for handling the stack of state machines when moving across different portions of the application graph, and for performing typical user interface operations, such as backup/undo, cancel, and repeat.
The granularity and distribution of the state information between the interaction manager and the other components of Figure 1 is a design choice. The peripheral components of the simple multi-modal application of Figure 1 can be made completely stateless. For instance, the speech recognizer can receive configuration information (e.g. grammar, confidence thresholds, etc.) from the interaction manager at a certain interaction turn, recognize the input speech, send the results of the recognition back to the interaction manager and return into an idle state, ready to serve another request. Similarly, the GUI can send user input information to the interaction manager as soon as it is available. In this case the interaction manager also has to manage the integration of the inputs coming from different channels. For instance, if a button is pressed on the GUI while an audio prompt is being played and the speech recognition is active, the interaction manager can stop the prompt (a reaction known as barge in) and deactivate the speech recognizer, while moving to a different application state. In other words, events raised at the peripheral components, such as the pushing of a button, need to propagate to the interaction manager layer and activate proper event handlers.
Sophisticated applications spanning multiple HTML pages benefit from a well-designed interaction manager that dynamically generates pages with SALT rather than directly invoking the speech processing and GUI methods.
Roberto Pieraccini, Director of the Natural Dialog Group,SpeechWorks International, Boston, MA


See related chart
See related chart
