
8.7.1 Qualitative evaluation
The TETRAD II publications (e.g., [265, 244]) have used stochastic sampling from
known causal models to generate artificial data and then reported percentages of
errors of four types:
Arc omission, when the learned model fails to have an arc in the true model
Arc commission, when the learned model has an arc not in the true model
Direction omission, when the learned model has not directed an arc required
by the pattern of the true model
Direction commission, when the learned model orients an arc incorrectly ac-
cording to the pattern of the true model
This is an attempt to quantify qualitative errors in causal discovery. It is, however,
quite crude. For example, some arcs will be far more important determiners of the
values of variables of interest than others, but these metrics assume all arcs, and all
arc directions within a pattern, are of equal importance.
Regardless, this kind of metric is by far the most common in the published lit-
erature. Indeed, the most common evaluative report consists of using the ALARM
network (Figure 5.2) to generate an artificial sample, applying the causal discovery
algorithm of interest, and counting the number of errors of omission and commis-
sion. Every algorithm reported in this chapter is capable of recovering the ALARM
network to within a few arcs and arc directions, so this “test” is of little interest in
differentiating between them. Cooper and Herskovits’s K2, for example, recovered
the network with one arc missing and one spurious arc added, from a sample size of
10,000 [54]. TETRAD II also recovers the ALARM network to within a few arcs
[265], although this is more impressive than the K2 result, since it needed no prior
temporal ordering of the variables. Again, Suzuki’s MDL algorithm recovered the
original network to within 6 arcs on a sample size of only 1000 [273].
Perhaps slightly more interesting is our own empirical study comparing TETRAD
II and CaMML on linear models, systematically varying arc strengths and sample
sizes [65]. The result was a nearly uniform superiority in CaMML’s ability to recover
the original network to within its pattern faster than (i.e., on smaller samples than)
TETRAD II.
8.7.2 Quantitative evaluation
Because of the maximum-likelihood equivalence of dags within a single pattern, it is
clear that two algorithms selecting identical models, or Markov equivalent models,
will be scored alike on ordinary evaluative metrics. But it should be equally clear that
non-equivalent models may well deserve equal scores as well, which the qualitative
scores above do not reflect. Thus, if a link reflects a nearly vanishing magnitude of
causal impact, a model containing it and another lacking it but otherwise the same
may properly receive (nearly) the same score. Again, the parameters for an associ-
ation between parents may lead to a simpler v-structure representing the very same
probability distribution as a fully connected network of three variables (see [293] for
© 2004 by Chapman & Hall/CRC Press LLC