
system is a semiautomated “coach” that presents to part icipants episodes of the
meeting during which dysfunctional behavior is detected (e. g., dominant or
aggressive behavior), wit h the goal of improving meeting participation behavior
over time by allowing participants to reflect upon their own actions.
Other work looks into the role of speech amplitude, lexical content, and gaze
to automatically detect who the intended addressee is in interactions involv-
ing groups of people and computational assistants (Jovanovic, op den Akker, &
Nijholt, 2006; Katzenmaier, Stiefelhagen, & Schultz, 2004; Lunsford, Oviatt, &
Arthur, 2006; Lunsford, Oviatt, & Coulston, 2005; van Turnhout et al., 2005).
Speech amplitude is found to be a strong indicator of who participants intend to
address in situations in which a computational assistant is available (Lunsford
et al., 2005, 2006). Directives intended to be handled by the computer are deliv-
ered with amplitude significantly higher than speech directed to human peers,
as indicated by studies of users engaged in an educational-problem–solving task.
Leveraging such a technique, a system is able to automatically determine whether
specific spoken utterances should be interpreted as commands requiring a
response from the system, separating these from the remaining conversation
intended to be responded to by human peers. This open-microphone engagement
problem is one of the more challenging but fundamental issues remaining to be
solved by new multimodal collaborative systems.
12.2.2 Concepts and Mechanisms
A primary technical concern when designing a multimodal system is the definition
of the mechanism used to combine—or
fuse
—input related to multiple modalities
so that a coherent combined interpretation can be achieved. Systems such as
Bolt’s (1980) “Put that there” and other early systems mainly processed speech,
and used gestures just to resolve
x
,
y
coordinates of pointing events. Systems that
handle modalities such as speech and pen, each of which is able to provide seman-
tically rich information, or speech and lip movements, which are tightly correlated,
require considerably more elaborate fusion mechanisms. These mechanisms
include representation formalisms, fusion algorithms, and entirely new software
architectures.
Multimodal fusion emerges from the need to deal with multiple modalities not
only as independent input alternatives, but as also able to contribute parts or elements
of expressions that only make sense when interpreted synergistically (Nigay &
Coutaz, 1993). When a user traces a line using a pen while speaking “Evacuation
route,” a multimodal system is required to somehow compose the attributes of the
spatial component given via the pen with the meaning assigned to this line via
speech.
A well-designed multimodal system offers, to the extent possible, the capabil-
ity for commands to be expressible through a single modality. Users should be
able to specify meanings using only the pen or using just speech. It must be noted,
12.2 Technology of the Interface
403