Nof S.Y. Springer Handbook of Automation

Подождите немного. Документ загружается.

Automating Errors and Conﬂicts Prognostics and Prevention 30.1 Deﬁnitions 505

Table 30.2 Examples of errors and conﬂicts in service automation

Error Conﬂict

•

The engine of an airplane shuts down unexpect-

edly during the ﬂight

•

A patient’s electronic medical records are acci-

dently deleted during system recovery

•

A pacemaker stops working

•

Trafﬁc lights go off due to lightening

•

A vending machine does not deliver drinks or

snacks after the payment

•

Automatic doors do not open

•

An elevator stops between two ﬂoors

•

A cellphone automatically initiates phone calls

due to a software glitch

•

The time between two ﬂights in an itinerary gen-

erated by an online booking system is too short for

transition from one ﬂight to the other

•

A ticket machine sells more tickets than the num-

berofavailableseats

•

An AT M machine dispenses $ 250 when

a customer withdraws $260

•

A translation software incorrectly interprets text

•

Two surgeries are scheduled in the same room

due to a glitch in a sensor that determines if the

room is empty

ﬂicts. An error is deﬁned as

∃E



r,i

(t)



, if ϑ

(t)

Dissatisfy

−−−−−→con

(t) . (30.1)



r,i

(t)



is an error, u

(t) is unit i in a system at time t,

(t) is unit i’s state at time t that describes what has oc-

curred with unit i by time t,con

(t) denotes constraintr

in the system at time t,and

Dissatisfy

−−−−−→denotes that a con-

straint is not satisﬁed. Similarly, a conﬂict is deﬁned

∃C

[

(t)

]

, if θ

(t)

Dissatisfy

−−−−−→con

(t) . (30.2)

[

(t)

]

is a conﬂict and n

(t) is a network of units

that need to satisfy con

(t) at time t. The use of con-

straints helpsdeﬁne errorsand conﬂictsunambiguously.

A constraint is the system speciﬁcation, expectation,

comparison objective or acceptable difference between

different units’ goals, plans, tasks or other activities.

Tables 30.1 and 30.2 illustrate errors and conﬂicts in

automation with some typical examples. There are also

human errors and conﬂicts that exist in automation

systems. Figure 30.1 describes the difference between

errors and conﬂicts in pin insertion.

This Chapter provides a theoretical background and

illustrates applicationsof how to prevent errors and con-

ﬂicts automatically in production and service. Different

terms have been used to describe the concept of er-

rors and conﬂicts, for instance, failure (e.g., [30.2–5]),

fault (e.g., [30.4, 6]), exception (e.g., [30.7]), and ﬂaw

(e.g., [30.8]). Error and conﬂict are the most popular

terms appearing inliterature (e.g., [30.3,4,6,9–15]).The

related terms listed here are also useful descriptions of

errors and conﬂicts. Depending on the context, some of

these terms are interchangeable with error; some are in-

terchangeable with conﬂict; and the rest refer to both

error and conﬂict.

Eight key functions have been identiﬁed as useful

to prevent errors and conﬂicts automatically as de-

scribed below [30.16–19]. Functions 5–8 prevent errors

and conﬂicts with the support of functions 1–4. Func-

tions 6–8 prevent errors and conﬂicts by managing

those that have already occurred. Function 5, prognos-

tics, is the only function that actively determines which

errors and conﬂicts will occur, and prevents them. All

other seven functions are designed to manage errors

and conﬂicts that have already occurred, although as

a result they can prevent future errors and conﬂicts

directly or indirectly. Figure 30.2 describes error and

conﬂict propagation andtheir relationshipwith theeight

functions:

1. Detection is a procedure to determine if an error or

a conﬂict has occurred.

2. Identiﬁcation is a procedure to identify the observa-

tion variables most relevant to diagnosing an error

or conﬂict; it answers the question: Which of them

has already occurred?

3. Isolation is a procedure to determine the exact loca-

tion of an error or conﬂict. Isolation provides more

information than identiﬁcation function, in which

only the observation variables associated with the

error or conﬂict are determined. Isolation does not

provide as much information as the diagnostics

function, however, in which the type, magnitude,

Part C 30.1

506 Part C Automation Design: Theory, Elements, and Methods

Propagation

Prognostics

Diagnostics

Detection

Identification

Isolation

Error recovery

Conflict resolution

Exception handling

Occurrence

PropagationPropagation

E[u(r

,i,t)]

C[n(r

,t)]

u(i,t)

n(r

,t)

Error

conflicts

Error

conflicts

Fig. 30.2 Error and conﬂict propagation and eight functions to pre-

vent errors and conﬂicts

and time of the error or conﬂict are determined. Iso-

lation answers the question: Where has an error or

conﬂict occurred?

4. Diagnostics is a procedure to determine which er-

ror or conﬂict has occurred, what their speciﬁc

characteristics are, or the cause of the observed out-

of-control status.

5. Prognostics is a procedure to prevent errors and con-

ﬂicts through analysis and prediction of error and

conﬂict propagation.

6. Error recovery is a procedure to remove or mitigate

the effect of an error.

7. Conﬂict resolution is a procedure to resolve a con-

ﬂict.

8. Exception handling is a procedure to manage ex-

ceptions. Exceptions are deviations from an ideal

process that uses the available resources to achieve

the task requirement (goal) in an optimal way.

There has been extensive research on the eight func-

tions, except prognostics. Various models, methods,

tools, and algorithms have been developed to automate

the management of errors and conﬂicts in production

and service. Their main limitation is that most of them

are designed for a speciﬁc application area, or even

a speciﬁc error or conﬂict. The main challenge of au-

tomating the management of errors and conﬂicts is how

to preventthem throughprognostics, which is supported

by the other seven functions and requires substantial

research and developments.

30.2 Error Prognostics and Prevention Applications

30.2.1 Error Detection in Assembly

and Inspection

As the ﬁrst step to prevent errors, error detection

has attracted much attention, especially in assembly

and inspection; for instance, researchers [30.3]have

studied an integrated sensor-based control system for

a ﬂexible assembly cell which includes error detection

function. An error knowledge base has been devel-

oped to store information about previous errors that had

occurred in assembly operations, and corresponding re-

covery programs which had been used to correct them.

The knowledge base provides support for both error

detection and recovery. In addition, a similar machine-

learning approach to error detection and recovery in

assembly has been discussed. To realize error recovery,

failure diagnostics has been emphasized as a necessary

step after the detection and before the recovery. It is

noted that, in assembly, error detection and recovery are

often integrated.

Automatic inspection has been applied in various

manufacturing processes to detect, identify, and isolate

errors or defects with computer vision. It is mostly used

to detect defects on printedcircuit board [30.20–22]and

dirt in paper pulps [30.23, 24]. The use of robots has

enabled automatic inspection of hazardous materials

(e.g., [30.25]) and in environments that human opera-

tors cannotaccess, e.g.,pipelines [30.26]. Automatic in-

spection has also been adopted to detect errors in many

other products such as fuel pellets [30.27], printing the

contents of soft drink cans [30.28], oranges [30.29], air-

craft components [30.30], and microdrills [30.31]. The

key technologies involved in automatic inspection in-

clude but are not limited to computer or machine vision,

feature extraction, and pattern recognition [30.32–34].

30.2.2 Process Monitoring

and Error Management

Process monitoring, or fault detection and diagnostics

in industrial systems, has become a new subdiscipline

within the broad subject of control and signal pro-

cessing [30.35]. Three approaches to manage faults for

process monitoring are summarized in Fig. 30.3. The

Part C 30.2

Automating Errors and Conﬂicts Prognostics and Prevention 30.2 Error Prognostics and Prevention Applications 507

analytical approach generates features using detailed

mathematical models. Faults can be detected and di-

agnosed by comparing the observed features with the

features associated with normal operating conditions

directly or after some transformation [30.19]. The data-

driven approach applies statistical tools on large amount

of data obtained from complex systems. Many qual-

ity control methods are examples of the data-driven

approach. The knowledge-based approach uses qualita-

tive models to detect and analyze faults. It is especially

suited for systems in which detailed mathematical mod-

els are not available. Among these three approaches, the

data-driven approach is considered most promising be-

cause of its solid theoretical foundation compared with

the knowledge-based approach and its ability to deal

with large amount of data compared with the analyti-

cal approach. The knowledge-based approach, however,

has gained much attention recently. Many errors and

conﬂicts can be detected and diagnosed only by experts

who have extensive knowledge and experience, which

need to be modeled and captured to automate error and

conﬂict prognostics and prevention.

30.2.3 Hardware Testing Algorithms

The three fault management approaches discussed in

Sect.30.2.2 can also be classiﬁed according to the way

that a system is modeled. In the analytical approach,

quantitative models are used which require thecomplete

speciﬁcation of system components, state variables, ob-

served variables, and functional relationships among

them for the purpose of fault management. The data-

driven approach can be considered as the effort to

develop qualitative models in which previous and cur-

rent data obtained from a system are used. Qualitative

models usually require less information about a sys-

tem than do quantitative models. The knowledge-based

approach uses qualitative models and other types of

models; for instance, pattern recognition techniques

use multivariate statistical tools and employ qualitative

models, whereas the signed directed graph is a typical

dependence model which represents the cause–effect

relationships in the form of a directed graph [30.36].

Similar to algorithms used in quantitative and

qualitative models, optimal and near-optimal test se-

quences have been developed to diagnose faults in

hardware [30.36–45]. The goal of the test sequencing

problem is to design a test algorithm that is able to

unambiguously identify the occurrence of any system

state (faulty or fault-free state) using the test in the test

set and minimizes the expected testing cost [30.37].

Analytical

approach

Data-driven

approach

Parameter estimation

Observers

Parity relations

Shewhart charts

Cumulative sum (CUSUM) charts

Exponentially weighted moving

average (EWMA) charts

Univariate

statistical

monitoring

Principal component analysis (PCA)

Signed directed graph (SDG)

Symptom tree model (STM)

Artificial neural networks (ANN)

Self-organizing map (SOM)

Fisher discriminant analysis (FDA)

Partial least squares (PLS)

Canonical variate analysis (CVA)

Multivariate

statistical

techniques

Knowledge-

based

approach

Causal

analysis

techniques

Expert systems

Pattern

recognition

techniques

Fig. 30.3 Techniques of fault management in process monitoring

, S

p Test passes

f Test fails

OR node

AND node

Fig. 30.4 Single-fault test strategy

The test sequencing problem belongs to the general

class of binary identiﬁcation problems. The problem to

diagnose single fault is a perfectly observed Markov

decision problem (MDP). The solution to the MDP is

a deterministic AND/OR binary decision tree with OR

nodes labeled by the suspect set of system states and

AND nodes denoting tests (decisions) (Fig.30.4). It is

Part C 30.2

508 Part C Automation Design: Theory, Elements, and Methods

12345

678910

11 12 13 14 15

Component with test

Component without test

Fig. 30.5 Digraph model of an example system

well known that theconstruction of the optimal decision

tree is an NP-complete problem [30.37].

To subdue the computational explosion of the opti-

mal test sequencing problem, algorithms that integrate

concepts from information theory and heuristic search

have been developed and were ﬁrst used to diag-

nose faults in electronic and electromechanical systems

with a single fault [30.37]. An X-Windows-based soft-

ware tool, the testability engineering and maintenance

system (TEAMS), has been developed for testabil-

ity analysis of large systems containing as many as

Table 30.3 D-matrix of the example system derived from Fig.30.5

State/test T

(5) T

(6) T

(8) T

(11) T

(12) T

(13) T

(14) T

(15)

(1) 0 1 0 1 1 0 0 0

(2) 0 0 1 1 0 1 1 0

(3) 0 0 0 0 0 0 0 1

(4) 0 0 1 0 1 0 1 0

(5) 1 0 0 0 0 0 1 0

(6) 0 1 0 0 1 0 0 0

(7) 0 0 0 1 0 0 0 0

(8) 0 0 1 0 0 0 1 0

(9) 0 0 0 0 1 0 0 0

(10) 0 0 0 0 0 0 0 1

(11) 0 0 0 1 0 0 0 0

(12) 0 0 0 0 1 0 0 0

(13) 0 0 0 0 0 1 0 0

(14) 0 0 0 0 0 0 1 0

(15) 0 0 0 0 0 0 0 1

50000 faults and 45000 test points [30.36]. TEAMS

can be used to model individual systems and gener-

ate near-optimal diagnostic procedures. Research on

test sequencing then expanded to diagnose multiple

faults [30.41–45] in various real-world systems includ-

ing the Space Shuttle’s main propulsion system. Test

sequencing algorithms with unreliable tests [30.40]and

multivalued tests [30.45] have also been studied.

To diagnose a single fault in a system, the relation-

ship between the faulty states and tests can be modeled

by directed graph (digraph model) (Fig. 30.5). Once

a system is described in a diagraph model, the full or-

der dependences among failure states and tests can be

captured by a binary test matrix, also called a depen-

dency matrix (D-matrix, Table 30.3). Other researchers

have used digraph model to diagnose faults in hyper-

cube microprocessors [30.46]. The directed graph is

a powerful tool to describe dependences among system

components and tests.

Three important issues have been brought to light

by extensive research on test sequencing problem and

should be considered when diagnosing faults for hard-

ware:

1. The order of dependences. The ﬁrst-order cause–

effect dependence between two nodes, i. e., how

a faulty node affects another node directly, is

the simplest dependence relationship between two

nodes. Earlier research did not consider the de-

pendences among nodes [30.37, 38], whereas in

Part C 30.2

Automating Errors and Conﬂicts Prognostics and Prevention 30.2 Error Prognostics and Prevention Applications 509

most recent research different algorithms and test

strategies have been developed with the considera-

tion of not only the ﬁrst-order, but also high-order

dependences among nodes [30.43–45]. The high-

order dependences describe relationships between

nodes that are related to each other through other

nodes.

2. Types of faults. Faults can be classiﬁed into two

categories: functional faults and general faults.

A component or unit in a complex system may

have more than one function. Each function may

become faulty. A component may therefore have

one or more functional faults, each of which in-

volves only one function of the component. General

faults are those faults that cause faults in all func-

tions of a component. If a component has a general

fault, all its functions are faulty. Models that de-

scribe only general faults are often called worst-case

models [30.36] because of their poor diagnosing

ability;

3. Fault propagation time. Systems can be classiﬁed

into two categories: zero-time and nonzero-time

systems [30.45]. Fault propagation in zero-time sys-

tems is instantaneous to an observer, whereas in

nonzero-time systems it is several orders of magni-

tude slower than the response time of the observer.

Zero-time systems can be abstracted by taking the

propagation times to be zero.

Another interesting aspect of the test sequencing

problem is the list of assumptions that have been dis-

cussed in several articles, which are useful guidelines

for the development of algorithms for hardware testing:

1. At most one faulty state (component or unit) in

a system at any time [30.37]. This may be achieved

if the system is tested frequently enough [30.42].

2. All faults are permanent faults [30.37].

3. Tests can identify system states unambiguously

[30.37]. In other words, a faulty state is either iden-

tiﬁed or not identiﬁed. There is not a situation such

as: There is a 60% probability that a faulty state has

occurred.

4. Tests are 100% reliable [30.40,45]. Both false posi-

tive and false negative rates are zero.

5. Tests do not have common setup operations [30.42].

This assumption has been proposed to simplify the

cost comparison among tests.

6. Faults are independent [30.42].

7. Failure states that are replaced/repaired are 100%

functional [30.42].

8. Systems are zero-time systems [30.45].

Note the critical difference between assumptions 3

and 4. Assumption 3 is related to diagnostics ability.

When an unambiguous test detects a fault, the con-

clusion is that the fault has deﬁnitely occurred with

100% probability. Nevertheless, this conclusion could

be wrong if the false positive rate is not zero. This is the

test (diagnostics) reliability described in assumption 4.

When an unambiguous test does not detect a fault, the

conclusion is that the fault has not occurred with 100%

probability. Similarly, this conclusion could be wrong

if the false negative rate is not zero. Unambiguous tests

have better diagnostics ability than ambiguous tests. If

a fault has occurred, ambiguous tests conclude that the

fault has occurred with a probability less than one. Sim-

ilarly, if the fault has not occurred, ambiguous tests

conclude that the fault has not occurred with a proba-

bility less than one. In summary, if assumption 3 is true,

a test gives only two results: a fault has occurred or has

not occurred, always with probability1. If both assump-

tions 3 and 4 are true, (1) a fault must have occurred if

the test concludes that it has occurred, and (2) a fault

must have not occurred if the test concludes that it has

not occurred.

30.2.4 Error Detection in Software Design

The most prevalent method to detect errors in soft-

ware is model checking. As Clarke et al. [30.47] state,

model checking is a method to verify algorithmically

if the model of software or hardware design satisﬁes

given requirements and speciﬁcations through exhaus-

tive enumeration of all the states reachable by the

system and the behaviors that traverse them. Model

checking has been successfully applied to identify in-

correct hardware and protocol designs, and recently

there has been a surge in work on applying it to reason

about a wide variety of software artifacts; for exam-

ple, model checking frameworks have been applied to

reason about software process models, (e.g., [30.48]),

different families of software requirements models

(e.g., [30.49]), architectural frameworks (e.g., [30.50]),

design models (e.g., [30.51]), and system implemen-

tations (e.g., [30.52–55]). The potential of model

checking technology for (1) detecting coding errors

that are hard to detect using existing quality assurance

methods, e.g., bugs that arise from unanticipated in-

terleavings in concurrent programs, and (2) verifying

that system models and implementations satisfy crucial

temporal properties and other lightweight speciﬁcations

has led a number of international corporations and

government research laboratories such as Microsoft,

Part C 30.2

510 Part C Automation Design: Theory, Elements, and Methods

IBM, Lucent, NEC, the National Aeronautics and Space

Administration (NASA), and the Jet Propulsion Labo-

ratory (JPL) to fund their own software model checking

projects.

A drawback of model checking is the state-

explosion problem. Software tends to be less structured

than hardware and is considered as a concurrent but

asynchronous system. In other words, two independent

processes in software executing concurrently in either

order result in the same global state [30.47]. Failing to

execute checking because of too many states is a partic-

ularly serious problem for software. Several methods,

including symbolic representation, partial order reduc-

tion, compositional reasoning, abstraction, symmetry,

and induction, have been developed either to decrease

the number of states in the model or to accommodate

more states, although none of them has been able to

solve the problem by allowing a general number of

states in the system.

Based on the observation that software model

checking has been particularly successful when it can

be optimized by taking into account properties of a spe-

ciﬁc application domain, Hatcliff and colleagues have

developed Bogor [30.56], which is a highly modular

model-checking framework that can be tailored to spe-

ciﬁc domains. Bogor’s extensible modeling language

allows new modeling primitives that correspond to do-

main properties to be incorporated into the modeling

language as ﬁrst-class citizens. Bogor’s modular archi-

tecture enables its core model-checking algorithms to

be replaced by optimized domain-speciﬁc algorithms.

Bogor has been incorporated into Cadena and tailored

to checking avionics designs in the common object re-

quest broker architecture (CORBA) component model

(CCM), yielding orders of magnitude reduction in

veriﬁcation costs. Speciﬁcally, Bogor’s modeling lan-

guage has been extended with primitives to capture

CCM interfaces and a real-time CORBA (RT-CORBA)

event channel interface, and Bogor’s scheduling and

state-space exploration algorithms were replaced with

a scheduling algorithm that captures the particular

scheduling strategy of the RT-CORBA event chan-

nel and a customized state-space storage strategy that

takes advantage of the periodic computation of avionics

software.

Despite this successful customizable strategy, there

are additional issues that need to be addressed when

incorporating model checking into an overall de-

sign/development methodology. A basic problem con-

cerns incorrect or incomplete speciﬁcations: before

veriﬁcation, speciﬁcations in some logical formalism

(usually temporal logic) need to be extracted from

design requirements (properties). Model checking can

verify if a model of the design satisﬁes a given speci-

ﬁcation. It is impossible, however, to determine if the

derived speciﬁcations are consistent with or cover all

design properties that the system should satisfy. That

is, it is unknown if the design satisﬁes any unspeci-

ﬁed properties, which are often assumed by designers.

Even if all necessary properties are veriﬁed through

model checking, code generated to implement the de-

sign is not guaranteed to meet design speciﬁcations, or

more importantly, design properties. Model-based soft-

ware testing is being studied to connect the two ends in

software design: requirements and code.

The detection of design errors in software engineer-

ing has received much attention. In addition to model

checking and software testing, for instance, Miceli

et al. [30.8] has proposed a metric-based technique for

design ﬂaw detection and correction. In parallel com-

puting, synchronization errors are major problems and

a nonintrusive detection method for synchronization er-

rors using execution replay has been developed [30.14].

Besides, concurrent error detection (CED)iswell

knownfor detecting errors in distributed computing sys-

tems and its use of duplications [30.9, 57], which is

sometimes considered a drawback.

30.2.5 Error Detection and Diagnostics

in Discrete-Event Systems

Recently, Petri nets have been applied in fault detection

and diagnostics [30.58–60] and fault analysis [30.61–

63]. Petri nets are formal modeling and analysis tool

for discrete-event or asynchronous systems. For hybrid

systems that have both event-driven and time-driven

(synchronous) elements, Petri nets can be extended to

global Petri nets to model both discrete-time and event

elements. To detect and diagnose faults in discrete-

event systems (DES), Petri nets can be used together

with ﬁnite-state machines (FSM) [30.64, 65]. The no-

tion of diagnosability and a construction procedure for

the diagnoser have been developed to detect faults in

diagnosable systems [30.64]. A summary of the use of

Petri nets in error detection and recovery before the

1990s can be found in the work of Zhou and DiCe-

sare [30.66].

To detect and diagnose faults with Petri nets, some

of the places in a Petri net are assumed observable and

others are not. All transitions in the Petri net are also

unobservable. Unobservable places, i.e., faults, indi-

cate that the number of tokens in those places is not

Part C 30.2

Automating Errors and Conﬂicts Prognostics and Prevention 30.2 Error Prognostics and Prevention Applications 511

observable, whereas unobservable transitions indicate

that their occurrences cannot be observed [30.58, 60].

The objective of the detection and diagnostics is to

identify the occurrence and type of a fault based on

observable places within ﬁnite steps of observation

after the occurrence of the fault. It is clear that to

detect and diagnose faults with Petri nets, system mod-

eling is complex and time consuming because faulty

transitions and places must be included in a model.

Research on this subject has mainly involved the exten-

sion of previous work using FSM and has made limited

progress.

Faults in discrete-event systems can be diagnosed

with the decentralized approach [30.67]. Distributed

diagnostics can be performed by either diagnosers

communicating with each other directlyor througha co-

ordinator. Alternatively, diagnostics decisions can be

made completely locally without combining the infor-

mation gathered [30.67]. The decentralized approach is

a viable direction for error detection and diagnostics in

large and complex systems.

30.2.6 Error Detection in Service

and Healthcare Industries

Errors tend to occur frequently in certain service in-

dustries that involve intensive human operations. As

the use of computers and other automation devices,

e.g., handwriting recognition and sorting machines

in postal service, becomes increasingly popular, er-

rors can be effectively and automatically prevented

and reduced to minimum in many service indus-

tries including delivery, transportation, e-Business, and

e-Commerce. In some other service industries, espe-

cially in healthcare systems, error detection is critical

and limited research has been conducted to help de-

velop systems that can automatically detect human

errors and other types of errors [30.68–72]. Several

systems and modeling tools have been studied and

applied to detect errors in health industries with the

help of automation devices (e.g., [30.73–76]). Much

more research needs to be conducted to advance the

development of automated error detection in service in-

dustries.

30.2.7 Error Detection and Prevention

Algorithms for Production

and Service Automation

The fundamental work system has evolved from

manual power, human–machine system, computer-

aided and computer-integrated systems, and then to

e-Work [30.77], which enables distributed and decen-

tralized operations where errors and conﬂicts propagate

and affect not only the local workstation, but the

entire production/service network. Agent-based algo-

rithms, e.g., (30.3), have been developed to detect

and prevent errors in the process of providing a sin-

gle product/service in a sequential production/service

line [30.78, 79]. Q

is the performance of unit i. U



and L



are the upper limit and lower limit, respec-

tively, of the acceptable performance of unit m. U

and

are the upper limit and lower limit, respectively, of

the acceptable level of the quality of a product/service

after the operation of unit m. Units 1 through m −1

complete their operation on a product/service before

unit m starts its operation on the same product/service.

An agent deployed at unit m executes (30.3)toprevent

errors

∃E(u

) , if

−L



m−1



i=1

∪

−U



m−1



i=1

(30.3)

In the process of providing multiple products/services,

traditionally, the centralized algorithm (30.4)isusedto

predict errors in a sequential production/service line.

(0) is the quantity of available raw materials for unit

i at time 0. η

is the probability a product/service is

within speciﬁcations after being operated by unit i,

assuming the product/service is within speciﬁcations

before being operated by unit i. ϕ

(t) is the needed

number of qualiﬁed products/services after the opera-

tion of unit m at time t. Equation (30.4) predicts at

time 0 the potential errors that may occur at unit m at

time t. Equation (30.4) is executed by a central control

unit that is aware of I

(0) and η

of all units. Equa-

tion (30.4) often has low reliability, i.e., high false

positive rates (errors are predicted but do not occur),

or low preventability, i. e., high false negative rate (er-

rors occur but are not predicted), because it is difﬁcult

to obtain accurate η

when there are many units in the

system.

∃E[u

(t)], if

min

i=1

(

)

<ϕ

(t)

(30.4)

To improve reliability and preventability, agent-based

error prevention algorithms, e.g., (30.5), have been de-

Part C 30.2

512 Part C Automation Design: Theory, Elements, and Methods

Error Initial response

Known causes

Unknown causes

Contingent

action

Root causes

after error

analysis

Preventive

action

Circumstance

Fig. 30.6 Incident mapping

veloped to prevent errors in the process of providing

multiple products/services [30.80]. C



) is the num-

ber of cumulative conformities produced by unit m by

time t



. N



) is the number of cumulative noncon-

formities produced by unit m by time t



. An agent

deployed at unit m executes (30.5) by using infor-

mation about unit m −1, i.e., I

m−1



), η

m−1

,and

m−1



) to prevent errors that may occur at time t,



< t. Multiple agents deployed at different units can

execute (30.5) simultaneously to prevent errors. Each

agent can have its own attitude, i. e., optimistic or

pessimistic, toward the possible occurrence of errors.

Additional details about agent-based error prevention

algorithms can be found in the work by Chen and

Nof [30.80]:

∃E[u

(t)] if min





), I

m−1



)×η

m−1



)−N



)

−C



)



×η



)

<ϕ

(t), t



< t . (30.5)

30.2.8 Error-Prevention Culture (EPC)

To prevent errors effectively, an organization is ex-

pected to cultivate an enduring error-prevention culture

(EPC) [30.81], i.e., the organization knows what to do

to prevent errors when no one is telling it what to do.

The EPC model has ﬁve components [30.81]:

1. Performance management: the human performance

system helps manage valuable assets and involves

ﬁve key areas: (a) an environment to minimize

errors, (b) human resources that are capable of per-

forming tasks, (c) task monitoring to audit work,

(d) feedback provided by individuals or teams

through collaboration, and (e) consequences pro-

vided to encourage or discourage people for their

behaviors.

2. System alignment: an organization’s operating sys-

tems must be aligned to get work done with

discipline, routines, and best practices.

3. Technical excellence: an organization must promote

shared technical and operational understanding of

how a process, system or asset should technically

perform.

4. Standardization: standardization supports error pre-

vention with a balanced combination of good

manufacturing practices.

5. Problem-resolution skills: an organization needs

people with effective statistical diagnostics and

issue-resolution skills to address operational pro-

cess challenges.

Not all errors can be prevented manually and/or by

automation systems. When an error does occur, incident

mapping (Fig. 30.6) [30.81] as one of the exception-

handling tools can be used to analyze the error and

proactively prevent future errors.

30.3 Conﬂict Prognostics and Prevention

Conﬂicts can be categorized into three classes [30.82]:

goal conﬂicts, plan conﬂicts, and belief conﬂicts. Goals

of an agent are modeled with an intended goal struc-

ture (IGS; e.g., Fig. 30.7), which is extended from

a goal structure tree [30.83]. Plans of an agent are

modeled with the extended project estimation and re-

view technique (E-PERT) diagram (e.g., Fig.30.8). An

agent has (1) a set of goals which are represented

by circles (Fig.30.7), or circles containing a number

(Fig.30.8), (2) activities such as Act 1 and Act 2 to

achieve the goals, (3) the time needed to complete an

activity, e.g., T1, and (4) resources, e.g., R1 and R2

(Fig.30.8). Goal conﬂicts are detected by comparing

goals by agents. Each agent has a PERT diagram and

plan conﬂicts are detected if agents fail to merge PERT

diagrams or the merged PERT diagrams violate certain

rules [30.82].

The three classes of conﬂicts can also be mod-

eled by Petri nets with the help of four basic

modules [30.84]: sequence, parallel, decision, and

Part C 30.3

Automating Errors and Conﬂicts Prognostics and Prevention 30.4 Integrated Error and Conﬂict Prognostics and Prevention 513

decision-free, to detect conﬂicts in a multiagent sys-

tem. Each agent’s goal and plan are modeled by

separate Petri nets [30.85], and many Petri nets are

integrated using a bottom-up approach [30.66, 84]

with three types of operations [30.85]: AND, OR,

and precedence. The synthesized Petri net is ana-

lyzed to detect conﬂicts. Only normal transitions and

places are modeled in Petri nets for conﬂict detec-

tion. The Petri-net-based approach for conﬂict detection

developed so far has been rather limited. It has empha-

sized more the modeling of a system and its agents

than the analysis process through which conﬂicts are

detected.

The three common characteristics of available con-

ﬂict detection approaches are: (1) they use the agent

concept because a conﬂict involves at least two units in

a system; (2) an agent is modeled for multiple times be-

cause each agent has at least two distinct attributes: goal

and plan; and (3) they not only detect, but mainly pre-

vent conﬂicts because goals and plans are determined

before agents start any activities to achieve them. The

main difference between the IGS and PERT approach,

and the Petri net approach is that agents communi-

cate with each other to detect conﬂicts in the former

approach whereas a centralized control unit analyzes

the integrated Petri net to detect conﬂicts in the lat-

ter approach [30.85]. The Petri net approach does not

detect conﬂictsusing agents,although systems are mod-

eled with agent technology. Conﬂict detection has been

mostly applied in collaborative design [30.86–88]. The

ability to detect conﬂicts in distributed design activities

is vital to their success because multiple designers tend

to pursue individual (local) goals prior to considering

common (global) goals.

Time

Agent A's IGS

Fig. 30.7 Development of agent A’s intended goal structure (IGS)

over time

Agent 1

Agent 2

Agent 3

Act6, T6

Act1, T1

Act3, T3

Act5, T5

Act2, T2

Act4, T4

Act7, T7

Act8, T8

R3, R4

Dummy

Fig. 30.8 Merged project estimation and review technique (PERT)

diagram

30.4 Integrated Error and Conﬂict Prognostics and Prevention

30.4.1 Active Middleware

Middleware wasoriginally deﬁned as software that con-

nects two separate applications or separate products

and serves as the glue between two applications; for

example, in Fig.30.9, middleware can link several dif-

ferent database systems to several different web servers.

The middleware allows users to request data from any

database system that is connected to the middleware us-

ing the form displayed on the web browser of one of the

web servers.

Active middleware is one of the four circles

of the “e-” in e-Work as deﬁned by the PRISM

Center (Production, Robotics, and Integration Soft-

ware for Manufacturing & Management) at Purdue

University [30.77]. Six major components in active

middleware have been identiﬁed [30.89, 90]: model-

ing tool, workﬂows, task/activity database, decision

support system (DSS), multiagent system (MAS), and

collaborative work protocols. Active middleware has

been developed to optimize the performance of in-

teractions in heterogeneous, autonomous, and distrib-

uted (HAD) environments, and is able to provide an

e-Work platform and enables a universal model for

error and conﬂict prognostics and prevention in a dis-

tributed environment. Figure 30.10 shows the structure

Part C 30.4

514 Part C Automation Design: Theory, Elements, and Methods

Database 2 Database 3 Database nDatabase 1

Server 2 Server 3 Server mServer 1

Middleware

Fig. 30.9 Middleware in a database server system

Distributed enterprises

Enterprises IIEnterprises I

HAD information systems: Engineering

systems, planning decision systems

MAS

User: Human/machine

Workflows

DSS

Cooperative

work

protocols

Task/activity

database

Modeling

tool

Middleware

Distributed

databases

Fig. 30.10 Active middleware architecture (after [30.89])

Error and conflict announcement

CEDP

Receive

Send

CEDP

CEDA

Error

knowledge base

Error detection

Detection policy

generation

Conflict

evaluation

Error and conflict

announcement

Fig. 30.11 Conﬂict and error detection model (CEDM)

of the active middleware; each component is described

below:

1. Modeling tool: The goal of a modeling tool is to cre-

ate a representation model for a multiagent system.

The model can be transformed to next-level models,

which will be the base of the system implementa-

tion.

2. Workﬂows: Workﬂows describe thesequence andre-

lations of tasks in a system. Workﬂows store the

answer to two questions: (1) Which agent will ben-

eﬁt from the task when it is completed by one or

more given agents? (2) Which task must be ﬁnished

before other tasks can begin? The workﬂows are

speciﬁc to the given system, and can be managed

by a workﬂow management system (WFMS).

3. Task/activity database: This database is used to

record and help allocate tasks. There are many tasks

in a large system such as those applied in auto-

motive industries. Certain tasks are performed by

several agents, and others are performed by one

agent. The database records all task information and

the progress of tasks (activity) and helps allocate

and reallocate tasks if required.

4. Decision support system (DSS): DSS for the ac-

tive middleware is like the operating system for

a computer. In addition, DSS already has programs

running for monitoring, analysis, and optimization.

It can allocate/delete/create tasks, bring in or take

off agents, and change workﬂows.

5. Multiagent system (MAS): MAS includes all agents

in a system. It stores information about each agent,

for example, capacity and number of agents, func-

tions of an agent, working time, and effective date

and expiry date of the agent.

6. Cooperative work protocols: Cooperative work

protocols deﬁne communication and interaction

protocols between components of active middle-

ware. It is noted thatcommunication betweenagents

also includes communication between components

because active middleware includes all agents in

a system.

30.4.2 Conﬂict and Error Detection Model

A conﬂict and error detection model (CEDM;

Fig.30.11) that is supported by the conﬂict and er-

ror detection protocol (CEDP, part of collaborative

Part C 30.4