
These three chapters describe how to apply machine learning to the task of learn-
ing causal models (Bayesian networks) from statistical data. This has become a hot
topic in the data mining community. Modern databases are often so large they are im-
possible to make sense of without some kind of automated assistance. Data mining
aims to render that assistance by discovering patterns in these very large databases
and so making them intelligible to human decision makers and planners. Much of
the activity in data mining concerns rapid learning of simple association rules which
may assist us in predicting a target variable’s value from some set of observations.
But many of these associations are best understood as deriving from causal relations,
hence the interest in automated causal learning, or causal discovery.
The machine learning algorithms we will examine here work best when they have
large samples to learn from. This is just what large databases are: each row in
a relational database of N columns is a joint observation across N variables. The
number of rows is the sample size.
In machine learning samples are typically divided into two sets from the begin-
ning: a training set and a test set. The training set is given to the machine learning
algorithm so that it will learn whatever representation is most appropriate for the
problem; here that means either learning the causal structure (the dag of a Bayesian
network) or learning the parameters for such a structure (e.g., CPTs). Once such a
representation has been learned, it can be used to predict the values of target vari-
ables. This might be done to test how good a representation it is for the domain.
But if we test the model using the very same data employed to learn it in the first
place, we will reward models which happen to fit noise in the original data. This is
called overfitting. Since overfitting almost always leads to inaccurate modeling and
prediction when dealing with new cases, it is to be avoided. Hence, the test set is
isolated from the learning process and is used strictly for testing after learning has
completed.
Almost all work in causal discovery has been looking at learning from observa-
tional data — that is, simultaneous observations of the values of the variables in the
network. There has also been work on how to deal with joint observations where
some values are missing, which we discuss in Chapter 7. And there has been some
work on how to infer latent structure, meaning causal structure involving variables
that have not been observed (also called hidden variables). That topic is beyond the
scope of this text (see [200] for a discussion). Another relatively unexplored topic is
how to learn from experimental data. Experimental data report observations under
some set of causal interventions; equivalently, they report joint observations over the
augmented models of
3.8, where the additional nodes report causal interventions.
The primary focus of Part II of this book will be the presentation of methods that are
relatively well understood, namely causal discovery with observational data. That is
already a difficult problem involving two parts: searching through the causal model
space
, looking for individual (causal) Bayesian networks to evaluate; eval-
uating each such
relative to the data, perhaps using some score or metric, as in
Chapter 8. Both parts are hard. The model space, in particular, is exponential in the
number of variables (
6.2.3).
We present these causal discovery methods in the following way. We start in Chap-
© 2004 by Chapman & Hall/CRC Press LLC