Korb K.B., Nicholson A.E. Bayesian Artificial Intelligence

Подождите немного. Документ загружается.

FIGURE 6.2

A linear model.

Current GDP

per capita

Future GDP

per capita

Current public

educ $ per capita

FIGURE 6.3

OECD public education spending model.

as discovering better explanatory models than regression methods can, as we shall

see.

Let’s consider a concrete example of a path model

. Figure 6.3 reports the rela-

tions between public educational spending in OECD countries

and per capita eco-

nomic activity at the time of the spending and, again, 15 years later (measured in

“GDP per capita,” which means gross domestic product per person in the economy;

this example is taken, with minor modiﬁcations, from [161]). This model indicates

that a typical OECD nation will receive some amount of future economic beneﬁt

More precisely, we consider here a recursive path model. A path model is called recursive just in

case it is a directed acyclic graph, as are all those we will consider here. It is possible also to deal with

non-recursive path models, which replace some subnetwork relating two variables with a bidirectional

arc parameterized with a correlation. Such models are essentially incomplete recursive models: if the

subnetwork were known, it would be more informative than the simple correlation and so would be used

instead.

OECD stands for Organization for Economic Co-operation and Development and includes all of the

major developed countries.

from an extra $1 added to the current economy in general (for example, via a tax cut)

and some different future economic beneﬁt from the dollar being added to public

education instead. The graphical model can be transformed into a system of linear

equations by writing down one equation per dependent variable in the model.

Focusing on (Future GDP per capita), there are three kinds of linear causal in-

ﬂuence reported by the model: direct,

(as well as and

); indirect, ; and the “spurious” effect of inducing a cor-

relation between

and . Assuming the Markov property, there can be no other

causal inﬂuences between variables in the model. This has the further implication,

applying Reichenbach’s Principle of the Common Cause, that there can also be no

other sources of correlation between the variables in the model. This suggests that

correlation between the variables and the linear causal inﬂuences should be interre-

latable: if correlations can only exist where causal inﬂuences induce them, and if all

causal inﬂuences are accounted for in a model (as required by the Markov property),

then we should be able to transform correlations into causal weights and vice versa.

And so we can.

The interrelation of correlation and linear causal inﬂuence is a key result. The

causal inﬂuences are theoretical: agreeing with David Hume [118], we cannot simply

observe causal forces as such. However, we can observe and measure correlations.

Since the two can be precisely interrelated, as we will see immediately below, we

can use the observations of correlation to discover, and specify the strength of, linear

causal relationships.

6.2.1 Wright’s ﬁrst decomposition rule

Sewall Wright’s ﬁrst Decomposition Rule gives the exact relation between correla-

tions and linear causal coefﬁcients [303]. His rule assumes that all variables have

been standardized; that is, the scale for each variable has been transformed into stan-

dard deviation units. This is easily done for any variable. For example, Future GDP

per capita (

) values in dollars can be replaced by deviations from the OECD av-

erage (

) via the transformation:

(6.2)

where

is the mean Future GDP per capita and is its standard deviation.

Standardized variables all have means of zero and standard deviations of 1. The path

coefﬁcient

is the analog in the standardized model of the linear coefﬁcient in the

non-standardized model (the two are related below in Equation 6.5).

Rule 6.1 Wright’s Decomposition Rule. The correlation

between variables

and ,where is an ancestor of , can be rewritten according to the equation:

(6.3)

where

are path coefﬁcients relating with each of its direct parents .

Application of this rule replaces a correlation,

, with products of a path coefﬁcient

to a parent

and another correlation relating the ancestor to this parent; since

this last correlation must relate two nodes

and more closely connected than

the original pair

and , it is clear that repeated application of the rule will

eventually eliminate reference to correlations. In other words, using Wright’s rule

we can rewrite any correlation in terms of sums of products of path coefﬁcients.

This will give us a system of equations which we can then use to solve for the path

coefﬁcients.

For example, taking each variable in turn from

to to in Figure 6.3 we

can use Rule 6.1 to generate the following equations for modeling the dependent

variables (bearing in mind that

We have three equations in three unknowns, so we can solve for the path coefﬁcients:

In the public education example, the observed correlations were

Plugging these into the above equations gives us

the path coefﬁcients

These are standardized coefﬁcients, reporting the impact of causal interventions on

any parent variables upon their children.

There is an equivalent rule for decomposing correlations into path coefﬁcients

which is even simpler to apply, and which also relates matters back to Reichenbach’s

discussion of causality and the concept of d-separation.

Rule 6.2 Wright’s Second Decomposition Rule. The correlation

between vari-

ables

and ,where is an ancestor of , can be rewritten according to the

equation:

(6.4)

where

is an active path between and and is a valuation of that path.

Intuitively, each active path represents a distinct line of causal inﬂuence, while its

valuation measures the degree of causal inﬂuence. Note that these paths are not

simply undirected paths, instead:

Deﬁnition 6.1 Active Path.

is an active path between and if and only if

it is an undirected path (see Deﬁnition 2.1) connecting

and such that it does

not go against the direction of an arc after having gone forward.

The valuation of a path is

This decomposition of correlation into causal inﬂuences corresponds directly to

Reichenbach’s treatment of causal asymmetry, and therefore also preﬁgures d-separa-

tion. A path traveling always forward from cause to (ultimate) effect identiﬁes a

causal chain, of course; a path traveling ﬁrst backwards and then forwards (but never

again backwards) identiﬁes a common ancestry between each pair of variables along

the two branches; and the prohibition of ﬁrst traveling forwards and then backwards

accounts for the conditional dependencies induced by common effects or their suc-

cessors.

This version of Wright’s rule also has the property of d-separation that has made

that concept so successful: the causal inﬂuences represented by a path model can

easily be read off of the graph representing that model. Applied to Figure 6.3, for

example, we readily ﬁnd:

which is identical, of course, with what we obtained with Rule 6.1 before.

The path coefﬁcients of Wright’s models are directly related to the linear coefﬁ-

cients of regression models: just as the variables are arrived at by standardizing, the

path coefﬁcient

may be arrived at by standardizing the regression coefﬁcient :

(6.5)

As a consequence of the standardization process, the sum of squared path coefﬁcients

convergent on any variable

is constrained to equal one:

This is true assuming that one of the parent variables represents the variance un-

explained by the regression model over the known parents — i.e., the

variable

is included. Since these sum to one, the square of the path coefﬁcient,

, can be

understood as the proportion of the variation in the child variable

attributable to

parent

,andif is the coefﬁcient associated with the residual variable ,then

is the amount of variance in which is left unexplained by the linear model.

Returning again to the public education model of Figure 6.3, the path coefﬁcients

were

Since these are standardized coefﬁcients, they report the square root of the amount of

variation in the child variable that is explained by the parent. In other words, for ex-

ample, variations in public GDP (

) account for the fraction

of the variation in public spending on education across OECD countries ( ).

Current public

$0.98

Current GDP

per capita

$1.98

educ $ per capita

$0.04

Future GDP

per capita

FIGURE 6.4

Non-standardized OECD public education spending model.

Standardization makes it very easy to understand the relation between correlations

and coefﬁcients, but it isn’t everything. By reversing the process of standardization

we may well ﬁnd a more human-readable model. Doing so for public education

results in Figure 6.4. From this model we can read off numbers that make more

intuitive sense to us than the path coefﬁcients. In particular, it asserts that for a

normal OECD country in the period studied, an additional $1 added to the public

education budget will expand the future economy by about $2, whereas an additional

$1 added at random to the economy (via a tax cut, for example) will expand the future

economy by about $1.

6.2.2 Parameterizing linear models

All of this work allows us to parameterize our path models in a straightforward way:

apply Wright’s Decomposition Rule (either one) to obtain a system of equations re-

lating correlation coefﬁcients with the path model’s coefﬁcients; solve the system of

equations for the path coefﬁcients; compute the solution given sample correlations.

This is exactly what was done with the OECD public education model, in fact. It has

been proven that recursive path models will always have a system of equations which

are solvable, so the method is always available [28]. There are equivalent alternative

methods, such as one due to Simon and Blalock [257, 21] and ordinary least squares

regression. As we mentioned above, however, we prefer Wright’s method because it

is more intuitively connected to causal graph representations.

6.2.3 Learning linear models is complex

The ability to parameterize linear models in turn ushers in a very plausible method-

ology of learning linear models, which indeed is quite widely employed in the social

sciences:

1. Find some correlations that are puzzling.

2. Invent a possible causal model that might, if true, explain those correlations

(a step which is variously called abduction or inference to the best explana-

tion).

3. Estimate (“identify”) the model parameters, whether via Wright’s Rule, least

squares or the Simon-Blalock method, in order to complete the model.

4. Test the complete model by predicting new data and comparing reality with

the prediction. If reality diverges from the model’s expectation,

GOTO step 2.

Numerous problems in the social sciences have been investigated in this way, and

arguably many of them have been solved. For example, notwithstanding the ob-

fuscation of some politicians, there is no serious doubt amongst economists who

investigate such matters that investment in public education is a key ingredient in

future economic well-being, partly because of such studies as ours. In other words,

we know for a fact that it is possible to learn causal structure from correlation struc-

ture, because we humans can and do. Implementing such a process in a computer is

another matter, to be sure: the above procedure, in particular, does not immediately

commend itself, since no one knows how we perform step 2 — inventing explanatory

models.

In AI the standard response to the need to invent or create is to employ search. If

we have a clear criterion for evaluating an artifact, but no obvious preferred means for

constructing it, then the standard AI move is to search the space of possible artifacts,

stopping when we have found one that is satisfactory. Such a search might start

with the simplest possible artifact — perhaps a null artifact — and, by testing out all

possible single additions at each step, perform a breadth-ﬁrst search for a satisfactory

artifact. Such a search is perfectly ﬁne when the size of some satisfactory artifact

is small: since small artifacts are examined ﬁrst, an acceptable one will be found

quickly. But the time complexity of such a search is of the order

,where is the

branching factor (the number of possible single additions to an artifact) and

is the

solution depth (the size of the simplest acceptable artifact). So, if we are looking

for a sizable artifact, or working with a very large number of possible additions to

an artifact at any step, then such a simple search will not work. The exponential

time complexity of such brute-force search is what has made world-class computer

chess difﬁcult to achieve and world-class computer go (thus far) unattainable. AI as a

ﬁeld has investigated many other types of search, but given a highly complex search

space, and no very helpful guide in how to search the space, they will all succumb to

exponential search complexity — that is, they will not ﬁnish in reasonable time. We

will now show that part of these conditions for unsolvability exist in the learning of

causal models: the space of causal models is exponentially complex.

How many possible causal models (i.e., dags) are there? The number depends

upon the number of variables you are prepared to entertain. A recursive expression

for the number of dags

given variables is [236]:

(6.6)

with

. With two variables, there are three dags; with three variables

there are 25 dags; with ﬁve variables there are 25,000 dags; and with ten variables

there are about

possible models. As can be seen in (6.6), this grows

exponentially in

This problem of exponential growth of the model space does not go away nicely,

and we will return to it in Chapter 8 when we address metric learners of causal

structure. In the meantime, we will introduce a heuristic search method for learning

causal structure which is effective and useful.

6.3 Conditional independence learners

We can imagine a variety of different heuristic devices that might be brought to bear

upon the search problem, and in particular that might be used to reduce the size of

the space. Thus, if we had partial prior knowledge of some of the causal relations

between variables, or prior knowledge of temporal relations between variables, that

could rule out a great many possible models. We will consider the introduction of

speciﬁc prior information later, in the section on adaptation (

9.4).

But there must also be methods of learning causal structure which do not depend

on any special background knowledge: humans (and other animals), after all, learn

about causality from an early age, and in the ﬁrst instance without much background.

Evolution may have built some understanding into us from the start, but it is also

clear that our individual learning ability is highly ﬂexible, allowing us severally and

communally to adapt ourselves to a very wide range of environments. We should like

to endow our machine learning systems with such abilities, for we should like our

systems to be capable of supporting autonomous agency, as we argued in Chapter 1.

One approach to learning causal structure directly is to employ experimentation

in addition to observation: whereas observing a joint correlation between

and

guarantees that there is some causal relation between them (via the Common Cause

Principle), a large variety of causal relations will sufﬁce. If, however, we intervene

— changing the state of

— and subsequently see a correlated change in the state

, then we can rule out both being a cause of and some common cause being

the sole explanation. So, experimental learning is clearly a more powerful instrument

for learning causal structure.

Our augmented model for causal reasoning of Chapter 3 suggests that learning

from experimental data is a special variety of learning from observational data; it is,

namely, learning from observational samples taken over the augmented model. Note

that adding the intervention variable

to the common causal structure Figure 6.1 (b)

yields

. Observations of without observations of can be

interpreted as causal manipulations of

. And, clearly, in such cases experimental

data interpreted as an observation in the augmented model will ﬁnd no dependency

between the causal intervention and

, since the intervening v-structure blocks the

path.

Thus, the restriction to observational learning is apparently not a restriction at all.

In any case, there is a surprising amount of useful work that can be done with obser-

vational data without augmented models. In particular, there is a powerful heuristic

search method based upon the conditional independencies implied by a dag. This

was introduced by Verma and Pearl [285], in their CI algorithm, and ﬁrst used in a

practical algorithm, the PC algorithm, in TETRAD II [264], presented in

6.3.2.

ALGORITHM 6.1

CI Algorithm

1. Principle I This principle recovers all the direct causal dependencies using

undirected arcs. Let

.Then and are directly causally connected

if and only if for every

such that .

2. Principle II This principle recovers some of the arc directions, namely those

which are discoverable from the dependencies induced by common effect vari-

ables.

and but not (i.e., we have an undirected chain ),

then replace the chain by

if and only if for every such

that

and . The structure is

called a v-structure (or: a collider).

3. Iterate through all undirected arcs

in the graph. Orient if and

only if either

(a) at a previous step

appeared as the middle node in an undirected chain

with

and (so, Principle II failed to indicate should be in v-structure

between

and ) and now the arc between and is directed as

;

(b) if we were to direct

, then a cycle would be introduced.

Continue iterating through all the undirected arcs until one such pass fails to

direct any arcs.

We can illustrate this algorithm with the simple example of Figure 6.5, a causal

model intended to reﬂect inﬂuences on the college plans of high school students

[247]. The variables are: Sex (male or female), IQ (intelligence quotient), CP (col-

lege plans), PE (parental encouragement), SES (socioeconomic status).

We can suppose that we have a causal system in these variables which is governed

by this causal model and that we have sampled the variables jointly some reasonably

large number of times ([247] had a sample of 10,318 students) and that we wish

to learn the causal model which generated the sample. In order to operate the CI

algorithm, let us imagine that we have an oracle who has access to the Truth — who

can peek at the true causal model of Figure 6.5 — and who is forthcoming enough to

answer any questions we pose about conditional dependencies and independencies

SES

SEX

FIGURE 6.5

Causal model for college plans.

— but no more than that. In that case, the CI algorithm allows us to learn the entire

structure of Figure 6.5.

First, Principle I of the CI algorithm will tell us all of the pairs of variables which

are directly connected by some causal arc, without telling us the causal direction.

That is, Principle I gives us Figure 6.6. Principle II of the algorithm might then

examine the chain SES–PE–SEX, leading to SES PE SEX. Subsequently,

Principle II will examine

IQ in combination with PE and either one of SES or SEX,

leading to

IQ PE. Principle II will lead to no orientation for the link PE–CP,

since this step can only introduce v-structures. Step 3(a), however, will subsequently

orient the arc

PE CP, recovering Figure 6.5 exactly.

SES

SEX

FIGURE 6.6

Causal model for college plans learned after CI Principle I.

The CI algorithm will not always recover the true model. If the true model were the

alternative model Figure 6.7, for example, no arc directions would be discoverable:

the only two chains available to Principle II are

SES–PE–CP and SES–IQ–CP.So,CI

will discover the skeleton of Figure 6.7 only.