Table 6.5. One will find that the three most influential faculty members in the dataset
have one element in common: They are all “eminent scholars.” These are professors
with particularly luminous reputations in their fields who had been recently hired
under a statewide program to enhance the quality of scholarship at Ohio universities.
Their salaries were, as a consequence, considerably higher than those of other faculty
with comparable rank and experience at BGSU. In this case, however, it is easy to
model this phenomenon by including in the model a dummy variable that identifies
these types of scholars (see, e.g., Balzer et al., 1996). Once the dummy is included,
these cases no longer exert such influence, because they are better fitted by the model.
If influential observations represent neither “bad” data nor flaws in model
specification, nothing further can be done about them. There is no rationale for delet-
ing legitimate data from the analysis, regardless of their influence. It may nevertheless
be fruitful to know that the results are largely due to perhaps only a few “interesting”
cases in the data. The analyst may then wish to be more cautious in attempting to gen-
eralize beyond the current sample until the findings can be replicated in other datasets.
At any rate, let’s consider some tools to use in the exploration of such cases.
Building Blocks of Influence: Outliers and Leverage
The degree to which a case has the ability to influence the regression analysis
depends on two characteristics: the extent to which it is an outlier, and the extent to
which it has leverage. An outlier is a case that is far from the regression trend exhib-
ited by the other data points. That is, if the regression of Y on the X’s is represented
by a “swarm” of points in p-dimensional space, an outlier is a point that is at a
noticeable distance from this swarm, in the Y direction. An outlier is typically
identified by having a comparatively large residual, indicating that its actual Y value
is nowhere near where it is “supposed to be” [i.e., the regression line (plane) that
runs through the swarm]. An example of an outlier can be seen in Figure 2.1, dis-
cussed earlier, showing the regression of the first exam score on the math diagnostic
for 213 students. It is the lowest point in this swarm of points and occurs in the mid-
dle of the plot, the point with an X score of 37 and a Y score of 17. It is also the point
having the largest residual in the data (in absolute value), or the largest standardized
residual (see also Figures 2.6 and 2.8).
However, being an outlier by itself does not give an observation the power to
affect the regression. The location of this particular outlier in the middle of the data
does not allow it to exert much “pull” on the regression line. We say that it lacks
leverage to affect the regression. An observation has leverage to the extent that its
covariate pattern is atypical. That is, its combination of X scores is far from the
centroid, or vector of means, of the X’s. For the data in Figure 2.1, in which there
is only one X, the centroid is 40.925, the mean on the math diagnostic. Since a
score of 37 is relatively close to this, the outlier has little leverage. The two obser-
vations with math diagnostic scores of 28 (at the far left in the figure), on the other
hand, have considerable leverage. However, their location in the Y direction is con-
sistent with the trend in the swarm of points—their Y scores are 37 and 45, respec-
tively. Hence one can imagine that the regression line would not change much
REGRESSION DIAGNOSTICS I: INFLUENTIAL OBSERVATIONS 219