Kortum P. (ed.) HCI Beyond the GUI. Design for Haptic, Speech, Olfactory, and Other Nontraditional Interfaces

Подождите немного. Документ загружается.

In Section 12.7 , this is examined in the context of identifying causes of speech dis-

fluencies, which introduce hard-to-process features into the spoken language.

Transparent Guidance of User Input

A strategy for dealing with hard-to-process features (e.g., disfluencies) present in

users’ input is to design interfaces that transparently guide users’ input to reduce

errors (Oviatt, 1995). The essence of this approach is to “get users to say [and do]

what computers can understand” (Zoltan-Ford, 1991).

The main goal of the design at this stage is to identify means to guide users’

input toward simpler and more predictable language. How this is achieved may

depend on the specifics of the domain. Determining domain-specific factors

requires detailed analysis of natural linguistic features manifested by users. Based

on this analysis, experimentation with alternative interface designs reveals meth-

ods to reduce complexity in ways that are not objectionable to users. This is illu-

strated in the online case study (see Section 12.7).

Techniques that have led to transparent guidance of user input include choos-

ing interface modalities that match the representational systems users require

while working on a task (Oviatt, Arthur, & Cohen, 2006; Oviatt & Kuhn, 1998),

structuring the interface to reduce planning load (Oviatt, 1995, 1996b, 1997),

and exploiting users’ linguistic convergence with a system’s output (Larson,

2003; Oviatt et al., 2005). In terms of design, the language used in a system’s pre-

sentation should be compatible with the language it can process. These aspects

are further discussed in Section 12.6.

Development of Formal Models

Linguistic and behavioral regularities detected during the analysis of a domain can

be represented via a variety of formal models that directly influence and drive the

interpretation process of a system. Given these models, a system is able with higher

likelihood to distinguish the actual users’ intentions from a variety of alternative

interpretations.

Formal models include primarily multimodal grammars used to drive interpre-

tation. These grammars provide rules that guide the input interpretation, which in

turn drives interface responses or activation of other applications.

More recently, there has been a growing interest in the application of

machine learning techniques to model aspects of multimodal interactions. One

example is the user-adapted model developed by Huang and Oviatt (2006) to pre-

dict the patterns of users’ next multimodal interaction. Other machine learning

models have been used to interpret user actions during multimodal multiparty

interactions (e.g., McCowan et al., 2005; Pianesi et al., 2006; Zancanaro et al.,

2006). The advantage of these systems is that they are able to adapt to individual

users’ characteristics and are therefore more capable of processing input with

fewer errors.

12 Multimodal Interfaces: Combining Interfaces

422

12.5

TECHNIQUES FOR TESTING THE INTERFACE

Iterative testing throughout the development cycle is of fundamental importance

when designing a multimodal interface. Empirical evaluation and user modeling

are the proactive driving forces during system development.

In this section, the generic functionality required for successfully testing a

multimodal interface is described. An initial step in testing an interface is the col-

lection of user data for analysis. Given the need to examine a variety of modalities

when designing a multimodal interface, an appropriate data collection infrastruc-

ture is required (Section 12.5.1). A strategy that has proven very fruitful for proto-

typing new multimodal interfaces is to exploit high-fidelity simulation techniques,

which permit comparing trade-offs associated with alternative designs (Section

12.5.2). Analysis of multimodal data also requires synchronizing multiple data

streams and development of annotation tools (as described in Section 12.5.3).

12.5.1 Data Collection Infrastructure

The focus of data collection is on the multimodal language and interaction pat-

terns employed by users of a system. A data collection facility must therefore be

capable of collecting high-quality recordings of multiple streams of synchronized

data, such as speech, pen input, and video information conveying body motions,

gestures, and facial expressions.

Besides producing recordings for further analysis, the collection infrastruc-

ture also has to provide facilities for observation during the collection in order

to support simulation studies. Since views of each modality are necessary during

some simulations, capabilities for real-time data streaming, integration, and dis-

play are required. Building a capable infrastructure to prototype new multimodal

systems requires considerable effort (Arthur et al., 2006).

Of primary importance is that data be naturalistic and therefore representative

of task performance that is expected once a system is deployed. Thus, any data col-

lection devices or instrumentation need to be unobtrusive. This is challenging,

given the requirement for rich collection of a variety of data streams that each

can require a collection device (e.g., cameras, microphones, pens, physiological

sensors). This can be particularly problematic when the aim is to collect realistic

mobile usage information (Figure 12.17).

12.5.2 High-Fidelity Simulations

A technique that has been very valuable in designing multimodal interfaces is the

construction of high-fidelity simulations, or Wizard-of-Oz experiments (Oviatt et al.,

1992; Salber & Coutaz, 1993a, 1993b). These experiments consist of having subjects

12.5 Techniques for Testing the Interface

423

interact with a system for which functionality is at least partially provided by a

human operator (i.e., “wizard”) who controls the interface from a remote location.

The main advantage of prototyping via simulations is that they are faster and

cheaper than developing a fully functional system and they provide advanced

information about interface design trade-offs. Hard-to-implement functionality

can be delegated to the wizard, which may shorten the development cycle consid-

erably via rapid prototyping of features. Simulations make it possible for interface

options to be tested before committing to actual system construction, and may

even support experiments that would not otherwise be possible given the state

of the art (one can, for instance, examine the repercussions of much enhanced

recognition levels on an interface).

One challenge when setting up a Wizard-of-Oz experiment is making it believ-

able (Oviatt, 1992; Arthur et al., 2006). In order to be effective, working prototypes

that users interact with must be credible; for example, making some errors adds to

credibility. The wizard needs to be able to react quickly and accurately to user

actions, which requires training and practice. Techniques for further facilitating

this type of experiment include simplifying the wizard’s interface through semi-

automation (e.g., by automatically initiating display actions) (Oviatt et al., 1992).

An automatic random error modules can be used to introduce system misrecogni-

tions, which can be set at any level and contribute to the simulation’s credibility.

FIGURE

12.17

Mobile data collection.

The subject (on the left) carries a variety of devices, including a processing unit

in a backpack.

Source:

From Oulasvirta et al. (2005); courtesy ACM.

12 Multimodal Interfaces: Combining Interfaces

424

Figure 12.18 shows an interface viewed by geometry students and the correspond-

ing wizard interface.

To address the complexity of multimodal interactions, multiple wizards have

been used in the past in some studies, especially in cases involving collaboration

(Arthur et al., 2006; Oviatt et al., 2003; Salber & Coutaz, 1993a). Each wizard then

concentrates on providing simulation feedback for particular aspects of the interac-

tion, which assists in making the overall task of driving an interface manageable.

12.5.3 Support for Data Analysis

The assessment of the effectiveness of an interface design or the comparison of

alternative designs requires detailed examination of the data that have been col-

lected. Analysis tools therefore need to provide means for playback and navigation

(a)

(b)

FIGURE

12.18

Simulated multimodal interface.

The interface as (a) seen by users and (b) used by the wizard to control the flow of

interaction in a believable way. This interface was used during a collaborative

geometry problem–solving interaction.

Source:

From Arthur et al. (2006);

courtesy ACM.

12.5 Techniques for Testing the Interface

425

of multiple data streams. High-fidelity synchronization is also required. Audio,

video, and other input (e.g., pen) should be aligned well enough that differences

are not be perceptible to a human analyst.

The analysis process usually incorporates

mark-ups

(or annotations) of selected

parts of an interaction. These annotations might include speech transcripts and

semantic annotation of gestures, gaze, or prosodic characteristics of the speech.

The specifics of what is annotated depend on the purpose of the research and inter-

face being designed. Annotated data can be examined in terms of characteristics

of their language production, performance, and error characteristics under varying

circumstances.

A variety of different playback and annotation tools are available, such as

Nomos (Gruenstein, Niekrasz, & Purver, 2005), Anvil (Martin & Kipp, 2002),

and AmiGram (Lauer et al., 2005). Arthur et al. (2006) describe a tool for annota-

tion and playback of multiple high-definition video and audio streams appropriate

for the analysis of multiparty interactions.

12.6

DESIGN GUIDELINES

The guidelines presented in this section are grouped in two primary classes: those

that have to do with issues related to the uncertainty of recognition (Section 12.6.1),

and those concerned with the circumstances guiding the selection of modalities

(Section 12.6.2).

12.6.1 Dealing with Uncertainty

One essential aspect that has to be considered when building systems that rely

on ambiguous interpretation is how uncertain interpretations are dealt with

(Bourguet, 2006; Mankoff, Hudson, & Abowd, 2000). Shielding users from errors

and providing graceful ways for handling unavoidable misinterpretations are

essential usability concerns in this class of interfaces.

As discussed (Section 12.4.1), human language production involves a highly

automatized set of skills not under users’ full conscious control (Oviatt, 1996b).

The most effective strategy for error avoidance is to design interfaces that lever-

age users’ engrained cognitive and linguistic behavior in order to transparently

guide input that avoids errors. In fact, training and practice in an effort to change

engrained behavior patterns often prove useless (Oviatt et al., 2003, 2005). While

this strategy can be the most effective, it is also the most demanding in terms of user

modeling and implementation effort. In order to determine the root cause of errors

and design an effective strategy for avoiding and resolving them, a cycle of experi-

ments is required, as illustrated by the online case study (see Section 12.7). In this

section, the effective principles of multimodal interaction are distilled as a set of

guidelines.

12 Multimodal Interfaces: Combining Interfaces

426

Support the Range of Representational Systems

Required by the Task

The structural complexity and linguistic variability of input generated by users

are important sources of processing difficulties. A primary technique to elicit

simpler, easier-to-process language is related to the choice of modalities that an

interface supports. Users will naturally choose the modalities that are most appro-

priate for conveying content. For example, users typically select pen input to pro-

vide location and spatially oriented information, as well as digits, symbols, and

graphic content (Oviatt, 1997; Oviatt & Olsen, 1994; Suhm, 1998). In contrast, they

will use speech for describing objects and events and for issuing commands for

actions (Cohen & Oviatt, 1995; Oviatt & Cohen, 1991).

A primary guideline is therefore to support modalities so that the representa-

tional systems required by users are available. The language that results when

adequate complementary modalities are available tends to be simplified linguisti-

cally, briefer, syntactically simpler, and less disfluent (Oviatt, 1997), and it con-

tains less linguistic indirection and fewer co-referring expressions (Oviatt &

Kuhn, 1998). One implication of this is that the fundamental language models

needed to design a multimodal system are not the same as those used in the past

for processing textual language.

Structure the Interface to Elicit Simpler Language

A key insight in designing multimodal interfaces that lead to simpler, more process-

able language is that the language employed by users can be shaped very strongly by

system presentation features. Adding structure, as opposed to having an uncon-

strained interface, has been demonstrated to be highly effective in simplifying the

language produced by users, resulting in more processable language and fewer

errors. A forms-based interface that guides users through the steps required to

complete a task can reduce the length of spoken utterances and eliminate up to

80 percent of hard-to-process speech disfluencies (Oviatt, 1995). Similar benefits

have been identified in map-based domains. A map with more detailed information

displaying the full network of roads, buildings, and labels can reduce disfluencies

compared to a minimalist map containing one-third of the roads (Oviatt, 1997).

Other techniques that may lead users toward expected language are guided

dialogs and context-sensitive cues (Bourguet, 2006). These provide additional

information that helps users determine what their input options are at each point

of an interaction, leading to more targeted production of terms that are expected

by the interface at a given state. This is usually implemented by having a prompt

that explicitly lists the options the user can choose from.

Exploit Natural Adaptation

A powerful mechanism for transparently shaping user input relies on the ten-

dency that users have of adapting to the linguistic style of their conversational

12.6 Design Guidelines

427

partners (Oviatt, Darves et al., 2005). A study in which 24 children conversed with

a computer-generated animated character confirmed that children’s speech signal

features, amplitude, durational features, and dialog response latencies sponta-

neously adapt to the basic acoustic-prosodic features of a system’s text-to-speech

output, with the largest adaptations involving utterance pause structure and ampli-

tude. Adaptations occurred rapi dly, bidirectionally, and consistently, with 70 to

95percentofchildren’sspeechconvergingwiththatoftheircomputerpartners.

Similar convergence has been observed in users’ responses to discrete system

prompts. People will respond to system prompts using the same wording and syn-

tactic style (Larson, 2003). This suggests that system prompts should be matched

to the language that the system ideally would receive from users—usually present-

ing a simple structure, restricted vocabulary, and recognizable signal features.

Offer Alternative Modalities Users Can Switch

to When Correcting Errors

There is evidence that users switch modalities they are using to correct misrecog-

nitions after repeated failures (Oviatt & VanGent, 1996). Users correcting a spo-

ken misrecognition will attempt to repeat the misrecognized word via speech a

few times, but will then switch to another modality such as handwriting when

they realize that the system is unable to accept the correction. This behavior

appears to be more pronounced for experienced users compared to novices. The

latter tend to continue to use the same modality despite the failures (Halverson

et al., 1999). Therefore, a well-designed system should offer alternative modalities

for correcting misrecognitions. The absence of such a feature may lead to “error

spirals” (Halverson et al., 1999; Oviatt & VanGent, 1996)—situations in which the

user repeatedly attempts to correct an error, but due to increased hyperarticulation,

the likelihood of correct system recognition actually degrades.

Make Interpretations Transparent But Not Disruptive

One disconcerting effect of systems that rely on interpretation of users’ ambig-

uous input is when users cannot clearly connect the actions performed by the

system with the input they just provided. That is particularly disconcerting when

the system display disagrees with the expectations that users had when providing

the input, as would happen when a user commands a system to paint a table green

and sees the floor turning blue as a result (Kaiser & Barthelmess, 2006). One tech-

nique to make system operation more transparent is to give users the opportunity

to examine the interpretation and potentially correct it, such as via a graphical

display of the interpretation.

One popular way of making users aware of a system’s interpretation is to make

available a list of alternate recognitions. These lists, sometimes called “

-best” lists,

represent a limited number of the most likely (“best”) interpretations identified by a

12 Multimodal Interfaces: Combining Interfaces

428

system. This strategy needs to consider very carefully how easy it is to access and

dismiss the list, and also how accurate the list is. In cases in which the accuracy

is such that most of the time the correct interpretation is not present (for very

ambiguous input), this strategy can become counterproductive (Suhm, Myers, &

Waibel, 1999). When well designed, making alternative recognitions available can

be effective, to the point that users may try the list first and just attempt to repeat

misrecognized terms as a secondary option (Larson & Mowatt, 2003). One example

of a simple interface that makes an alternative interpretation available is Google.

A hyperlink lists a potential alternative spelling of query terms and may be

activated by a single click.

Displaying the state of recognition can sometimes prove disruptive, such as

during multiparty interactions. Displaying misrecognitions or requiring the users

to choose among alternative recognitions in the course of a meeting has been

shown to disrupt the interaction, as users turn their attention to correcting the

system. A fruitful approach (explored, for example, in the Distributed Charter

System) is to present the state of recognition in a less distracting or forceful

way, such as via subtle color coding of display elements. Users are then free

to choose the moment that is most appropriate to deal with potential issues

(Kaiser & Barthelmess, 2006).

12.6.2 Choosing Modalities to Support

The choice of which modalities to support is naturally an important one when a

multimodal interface is being designed. The appropriate modalities and character-

istics of the language support within each modality are influenced by the tasks

that users are to face, conditions of use, and user characteristics.

Context of Use

Conditions of use determine, for instance, whether a system is required to be

mobile or whether it is to be used within an office environment or a meeting

room. That in turn determines the nature and capability of the devices that can

be employed. Mobile users will certainly not accept carrying around heavy loads

and will therefore not be able to take advantage of modalities that require proces-

sing power or sensors that cannot be fit into a cell phone or PDA (portable digital

assistant), as is the case for most vision-based techniques. Meeting rooms, on the

other hand, may make use of a much larger set of modalities, even in cases in

which several computers are required to run a system.

Most current mobile systems provide support for pen input and are powerful

enough to execute at least a certain level of speech recognition. Speech interfaces

are ideal for many mobile conditions because of the hands-and-eyes-free use that

speech affords. Pen input is used in many devices as an alternative for keyboard

input in small devices.

12.6 Design Guidelines

429

Other considerations associated with usage context that affect choice of mod-

alities are privacy and noise. Speech is less appropriate when privacy is a concern,

such as when the interface is to be used in a public setting. Speech is also not indi-

cated when noisy conditions are to be expected (e.g., in interfaces for construction

sites or noisy factories).

A well-designed interface takes advantage of multiple modalities to provide

support for a variety of usage contexts, allowing users to choose the modality that

is most appropriate given a particular situation.

User Characteristics

A well-designed interface leverages the availability of multiple modalities to

accommodate users’ individual differences by supporting input over alternative

modalities.

Motor and cognitive impairments, age, native language (e.g., accent), and

other individual characteristics influence individual choice of input modalities.

For example, pen input and gesturing are hard to use when there is diminished

motor acuity. Conversely, spoken input may prove problematic to users who have

speech or hearing impairments or who speak with an accent.

A particular kind of temporary impairment occurs when users are mobile or

are required to keep a high level of situational awareness, such as on a battlefield

or in an emergency response situation, or even while operating a vehicle (Oviatt,

2007). Supporting spoken interaction then becomes an attractive option.

12.7

CASE STUDIES

Case studies for these kinds of multimodal interfaces can be found at

www.

beyondthegui.com

12.8

FUTURE TRENDS

Multimodal interfaces depend ultimately on the quality of the underlying recogni-

zers for multiple modalities that are used. Thus, these interfaces benefit from the

advances of natural language processing techniques and advances in hardware

capabilities that make it possible for more challenging recognitions to be

successfully achieved.

12.8.1 Mobile Computing

Mobile computing is one of the areas in which multimodal interfaces are expected

to play an important role in the future. Multimodal interfaces suit mobility

12 Multimodal Interfaces: Combining Interfaces

430

particularly well because of the small factor requirements that are usually

imposed by on-the-move operation. Multimodal interfaces provide expressive

means of interaction to users without requiring bulky, hard-to-carry equipment

such as keyboards or mice.

The flexibility provided by interfaces that support multiple modes also fits

the demands introduced by mobility via the adaptation to shifting contexts of

use. A multimodal interface can, for instance, provide spoken operation for users

whose hands and eyes are busy, such as while driving, or maintaining situational

awareness in dangerous environments, such as disaster areas. These interfaces

can also adapt to noisy environments either by leveraging mutual disambiguation

to enhance recognition or by providing means for commands to be given via

nonspoken means such as a pen.

Speech and pen input are already supported by a variety of devices (smart

phones, PDAs), and successful commercial implementations are available. Further

computational power should lead to an era in which a majority of the mobile

devices might offer multimodal capabilities, including increasing levels of video-

based modalities.

12.8.2 Collaboration Support

Collaborative human interaction is intrinsically multimodal (Tang, 1991). Groups

of people collaborating make ample use of each other’s speech, gaze, pointing, and

body motion to focus attention and establish required shared contexts. Collabora-

tive multimodal interfaces are therefore a natural fit for groupware.

Interesting challenges are introduced by the shift from single-user to multi-

user interfaces. The language used by humans to communicate among them-

selves can be considerably more complex than that employed when addressing

computers (Oviatt, 1995). Systems that are based on observation of human–

human interaction will need to employ novel techniques for extracting useful

information.

The introduction of technology into collaborative settings has not been with-

out problems. These interactions are also known to be brittle in the face of tech-

nology (Grudin, 1988) and the potential disruptions of subtle social processes

that technology may introduce. Considerable care is therefore required to exam-

ine how systems that operate via natural language can best be integrated.

Pioneering systems have mostly employed a passive approach, in which obser-

vations performed by a system are collected with minimal direct interaction between

the system and a group of users. System results are delivered after the interactions

have concluded, in the form of browsable information (Ehlen, Purver, & Niekrasz,

2007) or MS-Project charts (Kaiser et al., 2004) or via a semiautomated “coach” that

presents episodes of the meeting to participants during which dysfunctional behavior

is detected (Zancanaro et al., 2006).

12.8 Future Trends

431