DUMMY VARIABLES
22
regression. The question is whether the difference in the fit is significant, and we test this with an F
test. The numerator is the improvement in fit on splitting the sample, divided by the cost (having to
estimate two sets of parameters instead of only one). In this case it is (8.9160 – 4.7045)
×
10
11
divided
by 2 (we have had to estimate two intercepts and two slope coefficients, instead of only one of each).
The denominator is the joint RSS remaining after splitting the sample, divided by the joint number of
degrees of freedom remaining. In this case it is 4.7045
×
10
11
divided by 70 (74 observations, less four
degrees of freedom because two parameters were estimated in each equation). When we calculate the
F statistic the 10
11
factors cancel out and we have
1.31
70107045.4
2102115.4
)70,2(
11
11
=
×
×
=
F (6.38)
The critical value of F(2,70) at the 0.1 percent significance level is a little below 7.77, the critical
value for F(2,60), so we come to the conclusion that there is a significant improvement in the fit on
splitting the sample and that we should not use the pooled regression.
Relationship between the Chow Test and the F Test of
the Explanatory Power of a Set of Dummy Variables
In this chapter we have used both dummy variables and a Chow test to investigate whether there are
significant differences in a regression model for different categories of a qualitative characteristic.
Could the two approaches have led to different conclusions? The answer is no, provided that a full set
of dummy variables for the qualitative characteristic has been included in the regression model, a full
set being defined as an intercept dummy, assuming that there is an intercept in the model, and a slope
dummy for each of the other variables. The Chow test is then equivalent to an F test of the
explanatory power of the dummy variables as a group.
To simplify the discussion, we will suppose that there are only two categories of the qualitative
characteristic, as in the example of the cost functions for regular and occupational schools. Suppose
that you start with the basic specification with no dummy variables. The regression equation will be
that of the pooled regression in the Chow test, with every coefficient a compromise for the two
categories of the qualitative variable. If you then add a full set of dummy variables, the intercept and
the slope coefficients can be different for the two categories. The basic coefficients will be chosen so
as to minimize the sum of the squares of the residuals relating to the reference category, and the
intercept dummy and slope dummy coefficients will be chosen so as to minimize the sum of the
squares of the residuals for the other category. Effectively, the outcome of the estimation of the
coefficients is the same as if you had run separate regressions for the two categories.
In the school cost function example, the implicit cost functions for regular and operational
schools with a full set of dummy variables (in this case just an intercept dummy and a slope dummy
for N), shown in Figure 6.5, are identical to the cost functions for the subsample regressions in the
Chow test, shown in Figure 6.6. It follows that the improvement in the fit, as measured by the
reduction in the residual sum of squares, when one adds the dummy variables to the basic specification
is identical to the improvement in fit on splitting the sample and running subsample regressions. The
cost, in terms of degrees of freedom, is also the same. In the dummy variable approach you have to
add an intercept dummy and a slope dummy for each variable, so the cost is k if there are k – 1
variables in the model. In the Chow test, the cost is also k because you have to estimate 2k parameters