Dougherty С. Introduction to Econometrics, 3Ed

Подождите немного. Документ загружается.

MULTIPLE REGRESSION ANALYSIS

females

INGSNEAR

= –5.9010 + 0.8803

+ 0.1088

ASVABC

(2.6315) (0.1910) (0.0577)

Using equation (4.30), explain why the standard errors of the coefficients of

and

ASVABC

are

greater for the male subsample than for the female subsample, and why the difference in the

standard errors is relatively large for

Further data:

males females

8.47 6.23

325 245

S,ASVABC

0.61 0.55

Var(

) 5.88 6.26

Var(

ASVABC

) 96.65 68.70

4.9*

Demonstrate that

is equal to 0 in multiple regression analysis. (

Note

: The proof is a

generalization of the proof for the simple regression model, given in Section 2.7.)

4.10

Investigate whether you can extend the determinants of weight model using your

EAEF

data set,

taking

WEIGHT94

as the dependent variable, and

HEIGHT

and other continuous variables in

the data set as explanatory variables. Provide an interpretation of the coefficients and perform

tests on them.

4.4 Multicollinearity

In the previous section, in the context of a model with two explanatory variables, it was seen that the

higher is the correlation between the explanatory variables, the larger are the population variances of

the distributions of their coefficients, and the greater is the risk of obtaining erratic estimates of the

coefficients. If the correlation causes the regression model to become unsatisfactory in this respect, it

is said to be suffering from multicollinearity.

A high correlation does not necessarily lead to poor estimates. If all the other factors determining

the variances of the regression coefficients are helpful, that is, if the number of observations and the

sample variances of the explanatory variables are large, and the variance of the disturbance term small,

you may well obtain good estimates after all. Multicollinearity therefore must be caused by a

combination

of a high correlation and one or more of the other factors being unhelpful. And it is a

matter of

degree

, not kind. Any regression will suffer from it to some extent, unless all the

explanatory variables are uncorrelated. You only start to talk about it when you think that it is

affecting the regression results seriously.

It is an especially common problem in time series regressions, that is, where the data consists of a

series of observations on the variables over a number of time periods. If two or more of the

explanatory variables have a strong time trend, they will be highly correlated and this condition may

give rise to multicollinearity.

It should be noted that the presence of multicollinearity does not mean that the model is

misspecified. Accordingly, the regression coefficients remain unbiased and the standard errors remain

MULTIPLE REGRESSION ANALYSIS

ABLE

4.2

Change in

Approximate

change in Y

10 19 51+

12 5

11 21 56+

12 5

12 23 61+

12 5

13 25 66+

12 5

14 27 71+

12 5

15 29 76+

12 5

valid. The standard errors will be larger than they would have been in the absence of

multicollinearity, warning you that the regression estimates are unreliable.

We will consider first the case of exact multicollinearity where the explanatory variables are

perfectly correlated. Suppose that the true relationship is

= 2 + 3

. (4.32)

Suppose that there is a linear relationship between

and

= 2

– 1, (4.33)

and suppose that

increases by one unit in each observation.

will increase by 2 units, and

approximately 5 units, for example as shown in Table 4.2.

Looking at the data, you could come to any of the following conclusions:

1. the correct one, that

is determined by (4.32)

2. that

is irrelevant and

is determined by the relationship

= 1 + 5

(4.34)

3. that

is irrelevant and

is determined by the relationship

= 3.5 + 2.5

(4.35)

In fact these are not the only possibilities. Any relationship that is a weighted average of (4.34) and

(4.35) would also fit the data. [(4.32) may be regarded as such a weighted average, being (4.34)

multiplied by 0.6 plus (4.35) multiplied by 0.4.]

In such a situation it is impossible for regression analysis, or any other technique for that matter,

to distinguish between these possibilities. You would not even be able to calculate the regression

coefficients because both the numerator and the denominator of the regression coefficients would

collapse to 0. This will be demonstrated with the general two-variable case. Suppose

(4.36)

MULTIPLE REGRESSION ANALYSIS

and

(4.37)

Substituting for X

in (4.11), one obtains

[]

2222

22222

2222

22222

3232

32332

),(Cov))Var((Var

),(Cov),(Cov-)()Var(Cov

])[,(Cov))Var((Var

])[,(Cov)],([Cov-)()Var(Cov

),(Cov))Var(X(Var

),(Cov),(Cov-)()Var(Cov

XXXX

XXYXX,YX

XXXX

XXYXX,YX

XXX

XXYXX,YX

−

+−+

+++

−

(4.38)

By virtue of Variance Rule 4, the additive

in the variances can be dropped. A similar rule could be

developed for covariances, since an additive

does not affect them either. Hence

[]

)(Var))Var((Var

)(Var),(Cov-)()Var(Cov

),(Cov)Var(X)(Var

),(Cov),(Cov-)(Var)(Cov

222

2222

−

XXX

XYXX,YX

XXX

XXYXX,YX

(4.39)

It is unusual for there to be an exact relationship among the explanatory variables in a regression.

When this occurs, it is typically because there is a logical error in the specification. An example is

provided by Exercise 4.13. However, it often happens that there is an approximate relationship. Here

is a regression of EARNINGS on S, ASVABC, and ASVAB5. ASVAB5 is the score on a speed test of the

ability to perform very simple arithmetical computations. Like ASVABC, the scores on this test were

scaled so that they had mean 50 and standard deviation 10.

. reg EARNINGS S ASVABC ASVAB5

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 3, 566) = 27.66

Model | 4909.11468 3 1636.37156 Prob > F = 0.0000

Residual | 33487.9224 566 59.1659406 R-squared = 0.1279

---------+------------------------------ Adj R-squared = 0.1232

Total | 38397.0371 569 67.4816117 Root MSE = 7.6919

------------------------------------------------------------------------------

EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

S | .7115506 .1612235 4.413 0.000 .3948811 1.02822

ASVABC | .1104595 .0504223 2.191 0.029 .0114219 .2094972

ASVAB5 | .0770794 .0463868 1.662 0.097 -.0140319 .1681908

_cons | -5.944977 2.161409 -2.751 0.006 -10.19034 -1.699616

------------------------------------------------------------------------------

MULTIPLE REGRESSION ANALYSIS

. cor ASVABC ASVAB5

(obs=570)

| ASVABC ASVAB5

--------+------------------

ASVABC| 1.0000

ASVAB5| 0.6371 1.0000

The regression result indicates that an extra year of schooling increases hourly earnings by $0.71.

An extra point on ASVABC increases hourly earnings by $0.11. An individual with a score one

standard deviation above the mean would therefore tend to earn an extra $1.10 per hour, compared

with an individual at the mean. An extra point on the numerical computation speed test increases

hourly earnings by $0.08.

Does ASVAB5 belong in the earnings function? A t test reveals that its coefficient is just

significantly different from 0 at the 5 percent level, using a one-tailed test. (A one-tailed test is

justified by the fact that it is unlikely that a good score on this test would adversely affect earnings.)

In this regression, the coefficient of ASVABC is significant only at the 5 percent level. In the

regression without ASVAB5, reported in Section 4.1, its t statistic was 3.60, making it significantly

different from 0 at the 0.1 percent level. The reason for the reduction in its t ratio is that it is highly

correlated with ASVAB5. This makes it difficult to pinpoint the individual effects of ASVABC and

ASVAB5. As a consequence the regression estimates tend to be erratic. The high correlation causes

the standard errors to be larger than they would have been if ASVABC and ASVAB5 had been less

highly correlated, warning us that the point estimates are unreliable. In this regression,

multicollinearity is making it difficult to determine whether ASVAB5 is a determinant of earnings. It is

possible that it is not, and that its marginally-significant t statistic has occurred as a matter of chance.

Multicollinearity in Models with More Than Two Explanatory Variables

The foregoing discussion of multicollinearity was restricted to the case where there are two

explanatory variables. In models with a greater number of explanatory variables, multicollinearity

may be caused by an approximate linear relationship among them. It may be difficult to discriminate

between the effects of one variable and those of a linear combination of the remainder. In the model

with two explanatory variables, an approximate linear relationship automatically means a high

correlation, but when there are three or more, this is not necessarily the case. A linear relationship

does not inevitably imply high pairwise correlations between any of the variables. The effects of

multicollinearity are the same as in the case with two explanatory variables, and, as in that case, the

problem may not be serious if the population variance of the disturbance term is small, the number of

observations large, and the variances of the explanatory variables large.

What Can You Do About Multicollinearity?

The various ways of trying to alleviate multicollinearity fall into two categories: direct attempts to

improve the four conditions responsible for the reliability of the regression estimates, and indirect

methods.

First, you may try to reduce

. The disturbance term is the joint effect of all the variables

influencing Y that you have not included explicitly in the regression equation. If you can think of an

MULTIPLE REGRESSION ANALYSIS

important variable that you have omitted, and is therefore contributing to u, you will reduce the

population variance of the disturbance term if you add it to the regression equation.

By way of illustration, we will take earnings function discussed in the previous section, where a

high correlation between ASVABC, the composite cognitive ability score, and ASVAB5, the score on a

numerical computation speed test, gave rise to a problem of multicollinearity. We now add three new

variables that are often found to be determinants of earnings: length of tenure with the current

employer, here measured in weeks, sex of respondent, and whether the respondent was living in an

urban or a rural area. The last two variables are qualitative variables and their treatment will be

explained in Chapter 6. All of these new variables have high t statistics and as a consequence the

estimate of

falls, from 59.17 to 54.50 (see the calculation of the residual sum of squares divided

by the number of degrees of freedom in the top right quarter of the regression output). However the

joint contribution of the new variables to the explanatory power of the model is small, despite being

highly significant, and the reduction in the standard errors of the coefficients of S, ASVABC, and

ASVAB5 is negligible. They might even have increased. The new variables happen to have very low

correlations with S, ASVABC, and ASVAB5. If they had been linearly related to one or more of the

variables already in the equation, their inclusion could have made the problem of multicollinearity

worse. Note how unstable the coefficients are, another sign of multicollinearity.

. reg EARNINGS S ASVABC ASVAB5 TENURE MALE URBAN

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 6, 563) = 23.60

Model | 7715.87322 6 1285.97887 Prob > F = 0.0000

Residual | 30681.1638 563 54.4958505 R-squared = 0.2009

---------+------------------------------ Adj R-squared = 0.1924

Total | 38397.0371 569 67.4816117 Root MSE = 7.3821

------------------------------------------------------------------------------

EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

S | .8137184 .1563975 5.203 0.000 .5065245 1.120912

ASVABC | .0442801 .049716 0.891 0.373 -.0533714 .1419317

ASVAB5 | .1113769 .0458757 2.428 0.016 .0212685 .2014853

TENURE | .287038 .0676471 4.243 0.000 .1541665 .4199095

MALE | 3.123929 .64685 4.829 0.000 1.853395 4.394463

URBAN | 2.061867 .7274286 2.834 0.005 .6330618 3.490672

_cons | -10.60023 2.195757 -4.828 0.000 -14.91311 -6.287358

------------------------------------------------------------------------------

The next factor to consider is n, the number of observations. If you are working with cross-

section data (individuals, households, enterprises, etc) and you are undertaking a survey, you could

increase the size of the sample by negotiating a bigger budget. Alternatively, you could make a fixed

budget go further by using a technique known as clustering. You divide the country geographically

into localities. For example, the National Longitudinal Survey of Youth, from which the EAEF data

are drawn, divides the country into counties, independent cities and standard metropolitan statistical

areas. You select a number of localities randomly, perhaps using stratified random sampling to make

sure that metropolitan, other urban and rural areas are properly represented. You then confine the

survey to the localities selected. This reduces the travel time of the fieldworkers, allowing them to

interview a greater number of respondents.

If you are working with time series data, you may be able to increase the sample by working with

shorter time intervals for the data, for example quarterly or even monthly data instead of annual data.

This is such an obvious thing to do that most researchers working with time series almost

MULTIPLE REGRESSION ANALYSIS

automatically use quarterly data, if they are available, instead of annual data, even if there does not

appear to be a problem of multicollinearity, simply to minimize the population variances of the

regression coefficients. There are, however, potential problems. You may introduce, or aggravate,

autocorrelation (see Chapter 13), but this can be neutralized. Also you may introduce, or aggravate,

measurement error bias (see Chapter 9) if the quarterly data are less accurately measured than the

corresponding annual data. This problem is not so easily overcome, but it may be a minor one.

. reg EARNINGS S ASVABC ASVAB5

Source | SS df MS Number of obs = 2868

---------+------------------------------ F( 3, 2864) = 183.45

Model | 36689.8765 3 12229.9588 Prob > F = 0.0000

Residual | 190928.139 2864 66.664853 R-squared = 0.1612

---------+------------------------------ Adj R-squared = 0.1603

Total | 227618.016 2867 79.3924017 Root MSE = 8.1649

------------------------------------------------------------------------------

EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

S | 1.002693 .0787447 12.733 0.000 .8482905 1.157095

ASVABC | .1448345 .0241135 6.006 0.000 .097553 .1921161

ASVAB5 | .0483846 .0218352 2.216 0.027 .0055703 .091199

_cons | -9.654593 1.033311 -9.343 0.000 -11.6807 -7.628485

------------------------------------------------------------------------------

The output shows the result of running the regression with all 2,867 observations in the EAEF

data set. Comparing this result with that using Data Set 21, we see that the standard errors are much

smaller, as expected. As a consequence, the t statistics are higher. In the case of ASVABC, this is

partly due to the fact that the point estimate of the coefficient is higher. However, in the case of

ASVAB5, the t statistic is higher despite the fact that the coefficient is smaller.

A third possible way of reducing the problem of multicollinearity might be to increase the

variance of the explanatory variables. This is possible only at the design stage of a survey. For

example, if you were planning a household survey with the aim of investigating how expenditure

patterns vary with income, you should make sure that the sample included relatively rich and relatively

poor households as well as middle-income households by stratifying the sample. (For a discussion of

sampling theory and techniques, see, for example, Moser and Kalton, 1985, or Fowler, 1993.)

The fourth direct method is the most direct of all. If you are still at the design stage of a survey,

you should do your best to obtain a sample where the explanatory variables are less related (more

easily said than done, of course).

Next, indirect methods. If the correlated variables are similar conceptually, it may be reasonable

to combine them into some overall index. That is precisely what has been done with the three

cognitive ASVAB variables. ASVABC has been calculated as a weighted average of ASVAB2

(arithmetic reasoning), ASVAB3 (word knowledge), and ASVAB4 (paragraph comprehension). Here is

a regression of EARNINGS on S and the three components of ASVABC. ASVAB2 has a highly

significant coefficient, but ASVAB3 does not and the coefficient of ASVAB4 has the wrong sign. This

is not surprising, given the high correlations between the ASVAB variables.

MULTIPLE REGRESSION ANALYSIS

. reg EARNINGS S ASVAB2 ASVAB3 ASVAB4

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 4, 565) = 25.68

Model | 5906.47726 4 1476.61931 Prob > F = 0.0000

Residual | 32490.5598 565 57.5054156 R-squared = 0.1538

---------+------------------------------ Adj R-squared = 0.1478

Total | 38397.0371 569 67.4816117 Root MSE = 7.5832

------------------------------------------------------------------------------

EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

S | .7362439 .1586812 4.640 0.000 .4245668 1.047921

ASVAB2 | .2472668 .0472249 5.236 0.000 .154509 .3400246

ASVAB3 | .0137422 .058716 0.234 0.815 -.1015861 .1290705

ASVAB4 | -.1051868 .0544682 -1.931 0.054 -.2121716 .001798

_cons | -4.734303 2.06706 -2.290 0.022 -8.794363 -.6742428

------------------------------------------------------------------------------

. cor ASVAB2 ASVAB3 ASVAB4

(obs=570)

| ASVAB2 ASVAB3 ASVAB4

--------+---------------------------

ASVAB2| 1.0000

ASVAB3| 0.6916 1.0000

ASVAB4| 0.6536 0.7628 1.0000

Comparing this regression with the regression with ASVABC, it can be seen that the standard

errors of the coefficients of ASVAB2, ASVAB3, and ASVAB4 are larger than that of ASVABC, as you

would expect. The t statistic of ASVAB2 is larger than that of ASVABC, but that is because its

coefficient is larger.

Another possible solution to the problem of multicollinearity is to drop some of the correlated

variables, if they have insignificant coefficients. If we drop ASVAB3 and ASVAB4, we obtain the

output shown. As expected, the standard error of the coefficient of ASVAB2 is smaller than in the

regression including ASVAB3 and ASVAB4. However, this approach to alleviating the problem of

multicollinearity involves the risk that some of the variables dropped may truly belong in the model

and their omission may cause omitted variable bias (see Chapter 7).

. reg EARNINGS S ASVAB2

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 2, 567) = 48.81

Model | 5639.37111 2 2819.68556 Prob > F = 0.0000

Residual | 32757.666 567 57.7736613 R-squared = 0.1469

---------+------------------------------ Adj R-squared = 0.1439

Total | 38397.0371 569 67.4816117 Root MSE = 7.6009

------------------------------------------------------------------------------

EARNINGS | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

S | .6449415 .1519755 4.244 0.000 .3464378 .9434452

ASVAB2 | .2019724 .0376567 5.364 0.000 .1280086 .2759361

_cons | -5.796398 1.957987 -2.960 0.003 -9.642191 -1.950605

------------------------------------------------------------------------------

A further way of dealing with the problem of multicollinearity is to use extraneous information, if

available, concerning the coefficient of one of the variables.

uPXY

+++=

321

(4.40)

MULTIPLE REGRESSION ANALYSIS

For example, suppose that Y in equation (4.40) is the aggregate demand for a category of

consumer expenditure, X is aggregate disposable personal income, and P is a price index for the

category. To fit a model of this type you would use time series data. If X and P possess strong time

trends and are therefore highly correlated, which is often the case with time series variables,

multicollinearity is likely to be a problem. Suppose, however, that you also have cross-section data on

Y and X derived from a separate household survey. These variables will be denoted Y' and X' to

indicate that the data are household data, not aggregate data. Assuming that all the households in the

survey were paying roughly the same price for the commodity, one would fit the simple regression

XbbY

(4.41)

Now substitute

b for

in the time series model,

uPXbY

+++=

, (4.42)

subtract Xb

from both sides,

uPXbY

++=−

(4.43)

and regress Z = Y – Xb

on price. This is a simple regression, so multicollinearity has been

eliminated.

There are, however, two possible problems with this technique. First, the estimate of

in (4.43)

depends on the accuracy of the estimate of

b , and this of course is subject to sampling error. Second,

you are assuming that the income coefficient has the same meaning in time series and cross-section

contexts, and this may not be the case. For many commodities the short-run and long-run effects of

changes in income may differ because expenditure patterns are subject to inertia. A change in income

can affect expenditure both directly, by altering the budget constraint, and indirectly, through causing

a change in lifestyle, and the indirect effect is much slower than the direct one. As a first

approximation, it is commonly argued that time series regressions, particularly those using short

sample periods, estimate short-run effects while cross-section regressions estimate long-run ones. For

a discussion of this and related issues, see Kuh and Meyer, 1957.

Last, but by no means least, is the use of a theoretical restriction, which is defined as a

hypothetical relationship among the parameters of a regression model. It will be explained using an

educational attainment model as an example. Suppose that we hypothesize that years of schooling, S,

depends on ASVABC, and the years of schooling of the respondent's mother and father, SM and SF,

respectively:

uSFSMASVABCS

++++=

4321

(4.44)

Fitting the model using EAEF Data Set 21, we obtain the following output:

MULTIPLE REGRESSION ANALYSIS

. reg S ASVABC SM SF

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 3, 566) = 110.83

Model | 1278.24153 3 426.080508 Prob > F = 0.0000

Residual | 2176.00584 566 3.84453329 R-squared = 0.3700

---------+------------------------------ Adj R-squared = 0.3667

Total | 3454.24737 569 6.07073351 Root MSE = 1.9607

------------------------------------------------------------------------------

S | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

ASVABC | .1295006 .0099544 13.009 0.000 .1099486 .1490527

SM | .069403 .0422974 1.641 0.101 -.013676 .152482

SF | .1102684 .0311948 3.535 0.000 .0489967 .1715401

_cons | 4.914654 .5063527 9.706 0.000 3.920094 5.909214

------------------------------------------------------------------------------

The regression coefficients imply that S increases by 0.13 years for every one-point increase in

ASVABC, by 0.07 years for every extra year of schooling of the mother and by 0.11 years for every

extra year of schooling of the father. Mother's education is generally held to be at least as important as

father's education for educational attainment, so the relatively low coefficient of SM is unexpected. It

is also surprising that the coefficient is not significant, even at the 5 percent level, using a one-tailed

test. However assortive mating leads to a high correlation between SM and SF and the regression

appears to be suffering from multicollinearity.

Suppose that we hypothesize that mother's and father's education are equally important. We can

then impose the restriction

This allows us to write the equation as

uSFSMASVSABCS

++++=

)(

321

(4.45)

Defining SP to be the sum of SM and SF, the equation may be rewritten with ASVABC and SP as

the explanatory variables:

uSPASVABCS

+++=

321

(4.46)

. g SP=SM+SF

. reg S ASVABC SP

Source | SS df MS Number of obs = 570

---------+------------------------------ F( 2, 567) = 166.22

Model | 1276.73764 2 638.368819 Prob > F = 0.0000

Residual | 2177.50973 567 3.84040517 R-squared = 0.3696

---------+------------------------------ Adj R-squared = 0.3674

Total | 3454.24737 569 6.07073351 Root MSE = 1.9597

------------------------------------------------------------------------------

S | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

ASVABC | .1295653 .0099485 13.024 0.000 .1100249 .1491057

SP | .093741 .0165688 5.658 0.000 .0611973 .1262847

_cons | 4.823123 .4844829 9.955 0.000 3.871523 5.774724

------------------------------------------------------------------------------

The estimate of

is now 0.094. Not surprisingly, this is a compromise between the coefficients

of SM and SF in the previous specification. The standard error of SP is much smaller than those of SM

and SF, indicating that the use of the restriction has led to a gain in efficiency, and as a consequence

MULTIPLE REGRESSION ANALYSIS

the t statistic is very high. Thus the problem of multicollinearity has been eliminated. However, it is

possible that the restriction may not be valid. We should test it. We shall see how to do this in

Chapter 7.

Exercises

4.11

Using your EAEF data set, regress S on SM, SF, ASVAB2, ASVAB3, and ASVAB4, the three

components of the ASVABC composite score. Compare the coefficients and their standard

errors with those of ASVABC in a regression of S on SM, SF, and ASVABC. Calculate

correlation coefficients for the three ASVAB components.

4.12

Investigate the determinants of family size by regressing SIBLINGS on SM and SF using your

EAEF data set. SM and SF are likely to be highly correlated (find the correlation in your data

set) and the regression may be subject to multicollinearity. Introduce the restriction that the

theoretical coefficients of SM and SF are equal and run the regression a second time replacing

SM and SF by their sum, SP. Evaluate the regression results.

4.13*

A researcher investigating the determinants of the demand for public transport in a certain city

has the following data for 100 residents for the previous calendar year: expenditure on public

transport, E, measured in dollars; number of days worked, W; and number of days not worked,

NW. By definition NW is equal to 365 – W. He attempts to fit the following model

E =

W +

NW + u

Explain why he is unable to fit this equation. (Give both intuitive and technical explanations.)

How might he resolve the problem?

4.14

Years of work experience in the labor force is generally found to be an important determinant of

earnings. There is no direct measure of work experience in the EAEF data set, but potential

work experience, PWE, defined by

PWE = AGE – S – 5

may approximate it. This is the maximum number of years since the completion of full-time

education, assuming that an individual enters first grade at the age of 6. Using your EAEF data

set, first regress EARNINGS on S and PWE, and then run the regression a second time adding

AGE as well. Comment on the regression results.

4.5 Goodness of Fit:

As in simple regression analysis, the coefficient of determination, R

, measures the proportion of the

variance of Y explained by the regression and is defined equivalently by Var(Y

)/Var(Y), by [1 –

Var(e)]/Var(Y), or by the square of the correlation coefficient for Y and Y

. It can never decrease, and

generally will increase, if you add another variable to a regression equation, provided that you retain