MODELS WITH EXCLUSIVELY CATEGORICAL PREDICTORS
Dummy Coding
Categorical predictors cannot simply be entered as is into a regression equation.
One obvious reason is that the values may not convey any real quantitative informa-
tion, as in the case of a nominal variable. Even with a quantitative variable, however,
its relationship with Y may not be linear. What is needed is a system of coding
that is invariant to both the qualitative nature of a covariate’s values and to the func-
tional form of its relationship with Y. One such system is called dummy coding. The
name comes from the fact that the codes—ones and zeros—only represent whether
or not a case is in a given category of the variable, and otherwise convey no quanti-
tative meaning. As an example, regard Table 4.1, which presents average academic-
year salaries for 725 faculty members at Bowling Green State University (BGSU)
according to college and to whether they are on graduate faculty. Suppose that we
wish to regress academic year salary on whether or not someone is on graduate
faculty (a status that depends on research productivity and when conferred, allows
one to teach graduate classes). We create a variable, GRAD, coded 1 if the person
is on graduate faculty and 0 otherwise. This is called a dummy variable. Letting
Y academic year salary, the model is E(Y) β
0
δ GRAD (I like to use deltas to
denote the coefficients of dummy variables). How is this interpreted? Well, for those
who are not on graduate faculty, the mean salary is E(Y) β
0
δ(0) β
0
. Thus, the
intercept is the mean of Y for those in the group coded 0, which is called the
contrast, reference, or omitted group. The mean salary for those on graduate faculty
is E(Y ) β
0
δ(1) β
0
δ. I refer to this group as the interest category. The
difference in means between these two groups is E(Y 冟 on graduate faculty) E(Y 冟not
on graduate faculty) β
0
δ β
0
δ. A test of whether or not this mean difference
is significant is a test of H
0
: δ 0. This is just the usual test for the significance of
a regression coefficient, consisting of the parameter estimate, d, divided by its
estimated standard error. Least squares estimates of the parameters are obtained in
the usual fashion—by minimizing SSE with respect to the parameters. The least
squares estimate of β
0
is the sample mean for the omitted group, while the least
squares estimate of δ is the difference in sample means for the interest and omitted
groups. The estimated regression equation in this case is yˆ 39582 11393 GRAD.
From Table 4.1 it is evident that the intercept here is just the mean salary for those
not on graduate faculty, and the slope is the difference in mean salaries for the two
groups: 50975.061 39581.895 11393.166. The test statistic for the slope
(not shown) is a t value of 10.552, which is highly significant (p .0001). Recall
that the regression model assumes equal error variance, implying equal Y variance,
at each covariate pattern. There are only two covariate patterns here, 1 and 0.
The assumption, therefore, is equal Y variance in each group—those on graduate
faculty and those not on graduate faculty—in the population. In other words, in
this case, regression accomplishes a test for the difference between group means
under the assumption of equal Y variance and is therefore equivalent to the two-
sample t test.
MODELS WITH EXCLUSIVELY CATEGORICAL PREDICTORS 127