Referent Identification Requests in Multi-Modal Dialogs 295
simultaneously, and can co-constrain (Maybury, 1993). In the examples shown in
Maybury (1993), however, he used the same schema for linguistic identification
requests, independently of whether visual actions could be used. Based on this
example only, he looks to treat visual identification requests such as pointing
actions as just supplemental; they have no effect on linguistic requests. On the
other hand, WIP uses cross-modal references, the references between modes that
are possible only when the system can utilize more than one mode (Andr@ and
Rist, 1994; Wahlster et al., 1991).
Several systems have different criteria on what reference expression should
be preferred. For example, Neal and Shapiro (1991) claim that graphic/pictorial
presentation is always desirable, and that natural language can always be used
as a last resort. In Claassen (1992), contextual factors such as salience play an
important role, while whether the object is currently visible or not is taken into
account towards the bottom of the decision tree. Although it is obvious that
these criteria depend on the domains that the systems are concerned with and
do not need to be identical, some criteria based on empirical studies is needed.
The objective of our research is to empirically determine what kinds of infor-
mation are appropriate for referent identification requests in multi-modal dialogs,
and how that information should be communicated. The long term goal of this
study is to provide useful suggestions for designing more sophisticated multi-
modal dialog systems. Cohen also picked up referent identification requests and
compared the kinds of speech acts used to achieve them for two dialog situations:
keyboard dialog and telephone dialog (Cohen, 1984). Our research not only ex-
tends the situations considered to the multi-modal one in which conversants have
audio and visual channels, but also considers the kinds and amounts of infor-
mation used for referent identification and clarifies how they are influenced by
the communicative modes and contextual factors. Moreover, elaboration related
phenomena and the roles of the addressee are also examined.
2 The Experiments and the Corpus
Experiments were conducted to obtain the corpus needed to design multi-modal
dialog systems. The task is the installation of a telephone with an answering
machine feature. In this task, the telephone set is unpacked, then eight settings,
such as checking the volume, adjusting the clock and recording a response mes-
sage, are accomplished. Finally, some function buttons are explained.
In order to consider the effect of communicative modes and level of interac-
tivity, we recorded explanations in four situations: SD (Spoken-mode Dialog),
MD (Multi-modal Dialog), SM (Spoken-mode Monolog) and MM (Multi-modal
Monolog). In all situations, the experts were able to handle the telephones in
front of them freely. In the dialog situations, SD and MD, each expert conversed
with a remote apprentice to lead him/her through installation. In the monolog
situations, SM and MM, the experts verbalized the instructions with no audi-
ence in the setting that an apprentice will follow his/her instruction afterwards
by listening to an audiotape in SM or watching a videotape in MM.