MULTIPLE REGRESSION ANALYSIS
18
important variable that you have omitted, and is therefore contributing to u, you will reduce the
population variance of the disturbance term if you add it to the regression equation.
By way of illustration, we will take earnings function discussed in the previous section, where a
high correlation between ASVABC, the composite cognitive ability score, and ASVAB5, the score on a
numerical computation speed test, gave rise to a problem of multicollinearity. We now add three new
variables that are often found to be determinants of earnings: length of tenure with the current
employer, here measured in weeks, sex of respondent, and whether the respondent was living in an
urban or a rural area. The last two variables are qualitative variables and their treatment will be
explained in Chapter 6. All of these new variables have high t statistics and as a consequence the
estimate of
2
u
σ
falls, from 59.17 to 54.50 (see the calculation of the residual sum of squares divided
by the number of degrees of freedom in the top right quarter of the regression output). However the
joint contribution of the new variables to the explanatory power of the model is small, despite being
highly significant, and the reduction in the standard errors of the coefficients of S, ASVABC, and
ASVAB5 is negligible. They might even have increased. The new variables happen to have very low
correlations with S, ASVABC, and ASVAB5. If they had been linearly related to one or more of the
variables already in the equation, their inclusion could have made the problem of multicollinearity
worse. Note how unstable the coefficients are, another sign of multicollinearity.
. reg EARNINGS S ASVABC ASVAB5 TENURE MALE URBAN
Source | SS df MS Number of obs = 570
---------+------------------------------ F( 6, 563) = 23.60
Model | 7715.87322 6 1285.97887 Prob > F = 0.0000
Residual | 30681.1638 563 54.4958505 R-squared = 0.2009
---------+------------------------------ Adj R-squared = 0.1924
Total | 38397.0371 569 67.4816117 Root MSE = 7.3821
------------------------------------------------------------------------------
EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
S | .8137184 .1563975 5.203 0.000 .5065245 1.120912
ASVABC | .0442801 .049716 0.891 0.373 -.0533714 .1419317
ASVAB5 | .1113769 .0458757 2.428 0.016 .0212685 .2014853
TENURE | .287038 .0676471 4.243 0.000 .1541665 .4199095
MALE | 3.123929 .64685 4.829 0.000 1.853395 4.394463
URBAN | 2.061867 .7274286 2.834 0.005 .6330618 3.490672
_cons | -10.60023 2.195757 -4.828 0.000 -14.91311 -6.287358
------------------------------------------------------------------------------
The next factor to consider is n, the number of observations. If you are working with cross-
section data (individuals, households, enterprises, etc) and you are undertaking a survey, you could
increase the size of the sample by negotiating a bigger budget. Alternatively, you could make a fixed
budget go further by using a technique known as clustering. You divide the country geographically
into localities. For example, the National Longitudinal Survey of Youth, from which the EAEF data
are drawn, divides the country into counties, independent cities and standard metropolitan statistical
areas. You select a number of localities randomly, perhaps using stratified random sampling to make
sure that metropolitan, other urban and rural areas are properly represented. You then confine the
survey to the localities selected. This reduces the travel time of the fieldworkers, allowing them to
interview a greater number of respondents.
If you are working with time series data, you may be able to increase the sample by working with
shorter time intervals for the data, for example quarterly or even monthly data instead of annual data.
This is such an obvious thing to do that most researchers working with time series almost