
Collaborative Multimodal Systems
Recently, considerable attention has been given to systems that interpret the com-
munication taking place among humans in multiparty collaborative scenarios,
such as meetings. Such communication is naturally multimodal—people employ
speech, gestures, and facial expressions, take notes, and sketch ideas in the course
of group discussions.
A new breed of research systems such as Rasa (McGee & Cohen, 2001), Neem
(Barthelmess & Ellis, 2005; Ellis & Barthelmess, 2003), and others (Falcon et al.,
2005; Pianesi et al., 2006; Rienks, Nijholt, & Barthelmess, 2006; Zancanaro, Lepri,
& Pianesi, 2006) have been exploiting multimodal collaborative scenarios, aiming
at providing assistance in ways that leverage group communication and attempt to
avoid adversely impacting performance.
Processing unconstrained communication among human actors introduces
a variety of technical challenges. Conversational speech over an open microphone
is considerably harder to recognize than more constrained speech directed to a
computer (Oviatt, Cohen, & Wang, 1994). The interpretation of other modalities
is similarly more complex. Shared context that may not be directly accessible to
a system is relied on very heavily by communicating partners (Barthelmess,
McGee, & Cohen, 2006; McGee, Pavel, & Cohen, 2001).
Devising ways to extract high-value items from within the complex group
communication streams constitutes a primary challenge for collaborative multi-
modal systems. Whereas in a single-user multimodal interface, a high degree of
control over the language employed can be exerted, either directly or indirectly,
systems dealing with group communication need to be able to extract the informa-
tion they require from natural group discourse, a much harder proposition.
Collaborative systems are furthermore characterized by their vulnerability to
changes in work practices, which often result from the introduction of technology
(Grudin, 1988). As a consequence, interruptions by a system looking for explicit
confirmation of potentially erroneous interpretations may prove too disruptive.
This in turn requires the development of new approaches to system support that
are robust to misrecognitions and do not interfere with the natural flow of group
interactions (Kaiser & Barthelmess, 2006).
Automatic extraction of meeting information to generate rich transcripts
has been one of the focus areas in multimodal meeting analysis research. These tran-
scripts may include video, audio, and notes processed by analysis components
that produce transcripts (e.g., from speech) and provide some degree of semantic
analysis of the interaction. This analysis may detect who spoke when (which is some-
times called “speaker diarization”) (Van Leeuwen & Huijbregts, 2006), what topics
were discussed (Purver et al., 2006), the structure of the argumentation (Verbree,
Rienks, & Heylen, 2006b), roles played by the participants (Banerjee & Rudnicky,
2004), action items that were established (Purver, Ehlen, & Niekrasz, 2006), structure
of the dialog (Verbree, Rienks, & Heylen, 2006a), and high-level turns of a meeting
12 Multimodal Interfaces: Combining Interfaces
400