198 CORE TECHNIQuES
for models with approximate fit index values very close to their suggested thresholds.
Yuan (2005) studied properties of approximate fit indexes based on model test statistics
when distributional assumptions were violated. Under these less than ideal but probably
more realistic conditions, (1) expected values of approximate fit indexes had little rela-
tion to their threshold values; and (2) shapes of their distributions varied as functions
of sample size, model size, and the degree of misspecification. Yuan (2005) also noted
that we generally do not know the exact distributions of approximate fit indexes even
for correctly specified models. Beauducel and Wittman (2005) studied the behavior of
approximate fit indexes for a relatively small range of measurement models of a kind
fairly typical in personality research. They found that the accuracy of thresholds was
affected by the relative sizes of factor loadings and whether unidimensional or multidi-
mensional measurement was specified. There were also relatively low intercorrelations
among different approximate fit indexes calculated for the same model and data. That is,
different indexes did not generally agree with each other.
Given results of the kind just summarized, Barrett (2007) suggested an outright ban
on approximate fit indexes. Hayduk et al. (2007) argue that thresholds for such indexes
are so untrustworthy and of such dubious utility that it is only model test statistics (and
their degrees of freedom and p values) that should be reported and interpreted. These
arguments have theoretical and empirical bases and cannot be blithely dismissed. Oth-
ers argue that there is a place for such indexes in model testing (e.g., Mulaik, 2007, 2009),
but there is general agreement that treating thresholds for approximate fit indexes as
“golden rules” is no longer up to standard. Barrett (2007) also suggested that research-
ers pay more attention to the accuracy of predictions generated by the model as a crucial
way of assessing its scientific value. True prediction studies in SEM are rare. A kind of
proxy prediction analysis concerns the reporting of R
2
-type statistics or effect decom-
positions for outcome variables. For pure measurement models of the kind estimated
and analyzed in CFA, however, there are no external criteria predicted by the model, so
reporting R
2
s for the indicators is about the only way to address this issue.
My own view is that (1) model test statistics provide invaluable information about
model–data discrepancies taking sampling error into account especially when the sam-
ple size is not large and (2) there are no grounds for ignoring evidence against the model
as indicated by a statistically significant result. This is because model test statistics can
provide the first detectable sign of possible severe specification (Hayduk et al., 2007), and
approximate fit indexes can do no such thing. If the model fails a statistical test, then
this result should be taken seriously. This means that the researcher should report more
specific diagnostic information about the apparent sources of model–data discrepancy.
In a very large sample, the magnitudes of these discrepancies could be slight but never-
theless large enough to trigger a statistically significant test result. If so, then (1) failing
the statistical test may have been due more to the very large sample size than to absolute
magnitudes of model–data discrepancies, and (2) it may be possible to retain the model
despite a significant model test statistic. Otherwise, the model should be respecified in
a theoretically meaningful way. If no such respecifications exist, then no model should
be retained.
I also argue that diagnostic information about fit is needed even if a model “passes”