Korb K.B., Nicholson A.E. Bayesian Artificial Intelligence

Подождите немного. Документ загружается.

SES

SEX

FIGURE 6.7

Alternative causal model for college plans.

6.3.1 Markov equivalence

What Verma and Pearl’s CI algorithm is discovering in these cases is the set of

Markov equivalent causal models (these are also known as statistically equivalent

models). Two causal models

and are Markov equivalent if and only if they

contain the same variables and any probability distribution that can be represented

by one can be represented by the other. In other words, given some parameterization

for the one model, we can ﬁnd a parameterization for the other which yields the very

same probability distribution.

That the CI algorithm has this much power and no more follows from Verma and

Pearl’s proof of the following theorem [285]. (Note that the theorems of this section

are known to hold only for linear Gaussian and discrete probability distributions. In

general, we will not consider other probability distributions, and so can use them.)

Theorem 6.1 (Verma and Pearl, 1990) Any two causal models over the same vari-

ables that have the same skeleton and the same v-structures are Markov equivalent.

Given an infallible oracle (and, of course, the Markov property), Principle I will

recover the skeleton. Principle II will recover all the v-structures. Step 3 merely

enforces consistency with dag structure. Theorem 6.1, therefore, ensures that the

partially oriented graph which CI discovers, called a pattern by Verma and Pearl,

can only be completed by directing unoriented arcs in Markov equivalent ways. In

addition to showing us this, the theorem also provides a simple, easily applied graph-

ical criterion of Markov equivalence. Thus, some examples of Markov equivalence:

All fully connected models (in the same variables) are Markov equivalent.

Fully connected models contain no v-structures, since the end variables in

any candidate chain must themselves be directly connected. Therefore, on

the graphical criterion, they fall under the same pattern.

and are Markov equivalent. There is no

v-structure here.

and are Markov equivalent. The

only v-structure, centered on

, is retained.

FIGURE 6.8

All patterns in three variables. Patterns are indicated with dotted lines. (Note that

the fully connected last pattern shows only one of 6 dags.)

For illustration, all patterns in three variables are displayed in Figure 6.8.

Following on Verma and Pearl’s ground-breaking work, D. Max Chickering inves-

tigated Markov equivalent models and published some important results in the 1995

Uncertainty in AI conference [45].

Theorem 6.2 (Chickering, 1995) If

and are Markov equivalent, then they

have the same maximum likelihoods relative to any joint samples:

where is a parameterization of

Since standard statistical inference procedures are largely based upon maximum like-

lihood measures, it is unlikely that they will have the ability to distinguish between

Markov equivalent hypotheses. This may sound like a truism — that statistically

indistinguishable (equivalent) hypotheses shouldn’t be distinguishable statistically.

And, indeed, many researchers, including some calling themselves Bayesians, advo-

cate metrics and methods which assume that causal discovery should be aimed at pat-

terns (Markov equivalence classes) rather than at their constituent causal hypotheses.

There are at least two reasons to disagree. First, Bayesian methods properly incorpo-

rate prior probabilities and do not merely respond to maximum likelihoods. Second,

what is statistically indistinguishable using observational data alone is not so using

experimental data, which are far more discerning in revealing causal structure. In

consequence of this received view, far less attention has been paid to experimental

inference and to the causal semantics of Bayesian networks than is warranted. We

shall have somewhat more to say about this in Chapter 8.

6.3.1.1 Arc reversal

A second result from Chickering’s work is very helpful in thinking graphically about

causal models.

Theorem 6.3 (Chickering, 1995) Any two Markov equivalent dags

and are

connected in a chain of dags

where any two adjacent dags differ at

most in one covered arc reversal.

This is related to the arc reversal rule:

Rule 6.3 Arc Reversal.

can represent any probability distribution represented

if they contain the same variables and arcs except that:

(a) for some pair of nodes

and , and , and

(b) if this arc is in an uncovered v-structure

in , then it is covered

, i.e., , and

into

, then either or must be added to .

The arc reversal rule allows us to change the temporal ordering between any two

variables. This will either keep us within a Markov equivalence class, because no v-

structures are thereby introduced or destroyed, or else send us into a new equivalence

class by requiring us to introduce a covering arc. The covering arc is necessary either

to reestablish conditional independence between

and given (clause (b)) or else

to reestablish conditional dependence (clause (c)). In either case, the arc reversal

forces us to move from a simpler model

to a denser model , and, hence, to

move from one equivalence class to another.

Many have thought that the ability to reverse arcs arbitrarily while continuing to

represent the same probability distributions demonstrates that Bayesian networks are

intrinsically not about causality at all, that all that matters for understanding them is

that they can represent some class of probability distributions. However, since the

arc reversal rule can only introduce and never eliminate arcs, it clearly suggests that

among all the models which can represent an observed probability distribution, that

model which is the sparsest (has the fewest arcs) is the true causal model. Since the

true model by deﬁnition has no spurious arcs, it will be the sparsest representation

of the observed probability distribution

; any alternative arrived at by reversing the

This rule holds if we restrict the probability distributions to linear Gaussian and multinomial distribu-

tions.

To be sure, there are cases of measure zero where a non-causal linear model can represent a linear causal

distribution more compactly.

temporal ordering of nodes (equivalently, reversing the arc directions relating those

nodes) must be at least as dense as the true model. In any case, a little reﬂection will

reveal that acceptance of Reichenbach’s Principle of the Common Cause implies that

non-causal distributions are strictly derivative from and explained by some causal

model. Of course, it is possible to reject Reichenbach’s principle (as, for example,

Williamson does [299]), but it is not clear that we can reject the principle while still

making sense of scientiﬁc method.

As we shall see below, many causal discovery algorithms evade, or presuppose

some solution to, the problem of identifying the correct variable order. In principle,

these algorithms can be made complete by iterating through all possible orderings

to ﬁnd the sparsest model that the algorithm can identify for each given ordering.

Unfortunately, all possible orderings for

variables number , so this completion

is exponential. Possibly, the problem can be resolved by sampling orderings and

imposing constraints when subsets of nodes have been ordered. So far as we know,

such an approach to causal discovery has yet to be attempted.

6.3.1.2 Markov equivalence summary

In summary, Markov equivalence classes of causal models, or patterns, are related to

each other graphically by Verma and Pearl’s Theorem 6.1: they share skeletons and

v-structures. They are related statistically by having identical maximum likelihoods,

and so, by orthodox statistical criteria, they are not distinguishable. Despite that

limitation, learning the patterns from observational data is an important, and large,

ﬁrst step in causal learning. We do not yet know, however, how close we can get in

practice towards that goal, since the CI algorithm is itself a highly idealized one. So:

in reality, how good can we get at learning causal patterns?

6.3.2 PC algorithm

Verma and Pearl’s CI algorithm appears to depend upon a number of unrealistic fea-

tures. First, it depends upon knowledge of the actual conditional independencies

between variables. How is such knowledge to be gained? Of course, if one has ac-

cess to the true causal structure through an actual oracle, then the independencies

and dependencies can be read off that structure using the d-separation criterion. But

lacking such an oracle, one must somehow infer conditional independencies from

observational data. The second difﬁculty with the algorithm is that it depends upon

examining independencies between all pairs of variables given every subset of vari-

ables not containing the pair in question. But the number of such alternative subsets

is exponential in the number of variables in the problem, making any direct imple-

mentation of this algorithm unworkable for large problems.

The causal discovery program TETRAD II copes with the ﬁrst problem by ap-

plying a statistical signiﬁcance test for conditional independence. For linear mod-

els, conditional independence

is represented by the zero partial correlation

(also described as a vanishing partial correlation), that is, the corre-

lation remaining between

and when the set S is held constant. The standard

signiﬁcance test on sample correlations is used to decide whether or not the partial

correlation is equal to zero. In what Spirtes et al. [264] call the PC algorithm,they

combine this signiﬁcance testing with a small trick to reduce the complexity of the

search. The PC algorithm, because it is easy to understand and implement, has re-

cently been taken up by two of the leading Bayesian network software tools, Hugin

and Netica (see Appendix B).

ALGORITHM 6.2

PC Algorithm

1. Begin with the fully connected skeleton model; i.e., every node is adjacent to

every other node.

2. Set

; identiﬁes the order of the set of variables to be held ﬁxed. For all

pairs of nodes

and set ; this will keep track of nodes

which ought to d-separate the pair in the ﬁnal graph.

3. For every adjacent pair of nodes

and , remove the arc between them if and

only if for all subsets

of order containing nodes adjacent to (but not

containing

) the sample partial correlation is not signiﬁcantlydifferent

from zero. (This corresponds to Principle I of the CI Algorithm.) Add the nodes

to .

4. If any arcs were removed, increment

and goto Step 3.

5. For each triple

in an undirected chain (such that and are con-

nected and

and are connected, but not and ), replace the chain with

if and only if . (This corresponds to Principle

II of the CI Algorithm.)

6. Apply Step 3 of the CI Algorithm.

The PC algorithm thus begins with a fully connected model, removing arcs when-

ever the removal can be justiﬁed on grounds of conditional independence. One

computational trick is to keep track of d-separating nodes (and, therefore, non-d-

separating nodes implicitly) during this removal process, for use in Step 5, avoiding

the need to search for them a second time. A second trick for reducing complexity

of the algorithm is just that, since the partial correlation tests are applied by ﬁxing

the values of sets

of adjacent nodes only and since arcs are removed early on dur-

ing low-order tests, by the time the algorithm reaches larger orders there should be

relatively few such sets to examine. Of course, this depends upon the connectivity

of the true model giving rise to the data: dense models will support relatively fewer

removals of arcs in Step 3, so the algorithm will be relatively complex. If in practice

most models you work with are sparse, then you can reasonably hope for computa-

tional gains from the trick. This is a reasonable hope in general, since highly dense

networks, being hard to interpret and use in principle, are less likely to be of interest

in any case. Certainly the vast majority of networks actually published in the sci-

entiﬁc research literature are very much less dense than, for example, the maximal

density divided by two.

Compared with the metric learning algorithms, discussed in Chapter 8, the CI and

PC Algorithms are very straightforward. Of course, the CI Algorithm is not really

an algorithm, since oracles are not generally available. But statistical signiﬁcance

tests could be substituted directly. The PC Algorithm improves upon that option by

ﬁnding some computational simpliﬁcations that speed things up in ordinary cases.

But there remain computational difﬁculties. For one thing, if an arc is removed

in error early in the search, the error is likely to cascade when looking at partial

correlations of higher order. And as the order of the partial correlations increases,

the number of sample correlations that need to be estimated increases dramatically,

since every pair of variables involved has a correlation that must be estimated for

the signiﬁcance test. This will inevitably introduce errors in the discovery algorithm

for moderately large networks. We should expect such an algorithm to work well on

small models with large samples, but not so well on large models with moderately

sized samples; and, indeed, we have found this to be the case empirically [65].

6.3.3 Causal discovery versus regression

Regression models aim to reduce the unexplained variation in dependent variables;

ordinary least squares regression speciﬁcally parameterizes models by computing

regression coefﬁcients which minimize unexplained variance. Of course, since the

application of Wright’s Rule is numerically equivalent, path modeling does the same

thing. What causal discovery does, on the other hand, is quite a different matter.

Almost no statistician using regression models believes that every independent vari-

able that helps reduce unexplained variation in the dependent variable is actually a

relevant causal factor: it is a truism that whenever there is random variation in one

variable its sample correlation with another variable subject to random variation will

not be identically zero. Hence, the ﬁrst can be used to reduce, however marginally,

“unexplained” variation in the second. So, for a random example, almost certainly

variations in seismic activity on Io are apparently correlated with variations in mug-

gings in New York City. But, since it is useless to run around pointing out tiny

correlations between evidently causally unrelated events, statisticians wish to rule

out such variables.

Regression methods provide no principled way for ruling out such variables. Or-

thodox statistics claims that any correlation that survives a signiﬁcance test is as good

as any other. Of course, the correlation between Io’s seismic events and muggings

may not survive a signiﬁcance test, so we would then be relieved of having to con-

sider it further. However, the probability of a Type I error — getting a signiﬁcant

result when there is no true correlation — is typically set at ﬁve percent of cases; so

examining any large number of variables will result in introducing spurious causal

structure into regressions. This is called the problem of variable selection in statis-

tics. Various heuristics have been invented for identifying spurious causal variables

and throwing them out, including some looking for vanishing partial correlations.

These variable selection methods, however, have been ad hoc rather than principled;

thus, for example, they have not considered the possibility of accidentally inducing

correlations by holding ﬁxed common effect variables.

It is only with the concept of causal discovery, via conditional independence learn-

ing in particular, that the use of vanishing partial correlations has received any clear

justiﬁcation. Because of the relation between d-separation in causal graphs and con-

ditional independence in probability distributions (or, vanishing partial correlations

in linear Gaussian models), we can justify variable selection in causal discovery. In

particular, the arc deletion/addition rules of the CI and PC algorithms are so justi-

ﬁed. The metric causal learners addressed in Chapter 8 have even better claim to

providing principled variable selection, as we shall see.

6.4 Summary

Reichenbach’s Principle of the Common Cause suggests that conditional dependen-

cies and independencies arise from causal structure and therefore that an inverse in-

ference from observations of dependency to causality should be possible. In this

chapter we have considered speciﬁcally the question of whether and how linear

causal models can be inferred from observational data. Using the concept of con-

ditional independence and its relation to causal structure, Sewall Wright was able to

develop exact relations between parameters representing causal forces (path coefﬁ-

cients) and conditional independence (zero partial correlation, in the case of linear

Gaussian models). This allows linear models to be parameterized from observational

data. Verma and Pearl have further applied the relation between causality and con-

ditional independence to develop the CI algorithm for discovering causal structure,

which Spirtes et al. then implemented in a practical way in the PC algorithm. Thus,

the skepticism that has sometimes greeted the idea of inferring causal from correla-

tional structure is defeated pragmatically by an existence proof: a working algorithm

exists. Although these methods are here presented in the context of linear models

only, they are readily extensible to discrete models, as we shall see in Chapter 8.

6.5 Bibliographic notes

Wright’s 1931 paper [303] remains rewarding, especially as an example of lucid ap-

plied mathematics. A simple and good introduction to path modeling can be found

in Asher’s Causal modeling [11]. For a clear and simple introduction to ordinary

least squares regression, correlation and identifying linear models see Edwards [79].

The PC algorithm, causal discovery program TETRAD II and their theoretical un-

derpinnings are described in Spirtes et al. [265]. The current version of this program

is TETRAD IV; details may be found in Appendix B).

6.6 Technical notes

Correlation

The correlation between two variables

and is written and describes the

degree of linear relation between them. That is, the correlation identiﬁes the degree

to which values of one can be predicted by a linear function of changes in value

of the other. For those interested, the relation between linear correlation and linear

algebra is nicely set out by Michael Jordan [134].

Sample correlation is a measurement on a sample providing an estimate of pop-

ulation correlation. Pearson’s product moment correlation coefﬁcient is the most

commonly used statistic:

where

is the estimate of covariance between and inasampleofsizenand

are estimates of standard deviation.

Partial correlation

Partial correlation measures the degree of linear association between two variables

when the values of another variable (or set of variables) are held ﬁxed. The partial

correlation between

and when is held constant is written . The “hold-

ing ﬁxed” of

here is not manipulating the value of to any value; the association

between

and is measured while observed values of are not allowed to vary.

The sample partial correlation is computed from the sample correlations relating

each pair of variables:

This can be extended recursively to accommodate any number of variables being

partialed out.

Signiﬁcance tests for correlation

The PC algorithm (or CI algorithm, for that matter) can be implemented with sig-

niﬁcance tests on sample data reporting whether correlations and partial correlations

are vanishing. The standard

test for the product moment correlation coefﬁcient is:

with degrees of freedom. For partial correlation the test is:

with degrees of freedom. This can be extended to larger sets of partialed

variables in the obvious way.

Given a

value and the degrees of freedom, you can look up the result in a “t

Table,” which will tell you how probable the result is on the assumption that the cor-

relation is vanishing. If that probability is less than some pre-selected value (called

), say 0.05, then orthodox testing theory says to reject the hypothesis that there is

no correlation. The test is said to be “signiﬁcant at the .05 level.” In this case, the PC

algorithm will accept that there is a real positive (or negative) correlation.

6.7 Problems

Problem 1

Consider the model:

1. Identify all of the active paths between pairs of variables.

2. Use Wright’s decomposition rule (in either form) to generate a system of equa-

tions for the correlations in terms of the path coefﬁcients.

3. Solve the system of equations — i.e., convert it into a set of equations for the

path coefﬁcients in terms of the correlations.

4. Given the following correlation table, compute the path coefﬁcients.

WXYZ

W 1

Y .4 .5 1

Z .12 .95 .7 1

5. How much of the variation of is unexplained? How much for ?

Problem 2

Consider the model:

Find all the dags which are Markov equivalent. (You can use the CI algorithm to

do this.)

Problem 3

If you have not already done so, complete the oracle from the problems in Chapter 2.

Run your d-separation oracle on the additional test network for the CI learning

example, LearningEg.dne, which can be found in

http://www.csse.monash.edu.au/bai

Problem 4

Implement the CI algorithm using your d-separation oracle. That is, your CI learning

algorithm should take as input a list of variables and then, using the oracle supplied

with some Bayesian network, discover as much of the Bayesian network structure as

possible. It should output the network structure in some convenient ascii format (not

a graph layout, unless you happen to ﬁnd that easy!). For example, you can generate

an alphabetical list of nodes, with their adjacencies listed and the arc types indicated

(parent, child or undirected).

Note that this algorithm is exponential in the number of variables. You will have

to deal with this in some way to complete the assignment. You may want to imple-

ment some kind of heuristic for which subsets Z are worth looking at relative to a

given pair

and . Or, you might simply implement a cutoff to prevent your pro-

gram from examining any more than

subsets, looking at lower order subsets ﬁrst.

Whatever you decide to do, document it.

Run your algorithm on a set of test networks, including at least

The CI Learning example, LearningEg.dne

Cancer Neapolitan.dne

ALARM.dne

Summarize the results of these experiments.