As an example, consider testing for discrimination in loan approvals. If we can collect
data on, say, individual mortgage applications, then we can define the dummy dependent
variable approved as equal to one if a mortgage application was approved, and zero oth-
erwise. A systematic difference in approval rates across races is an indication of discrim-
ination. However, since approval depends on many other factors, including income,
wealth, credit ratings, and a general ability to pay back the loan, we must control for them
if there are systematic differences in these factors across race. A linear probability model
to test for discrimination might look like the following:
approved
0
1
nonwhite
2
income
3
wealth
4
credrate other factors.
Discrimination against minorities is indicated by a rejection of H
0
:
1
0 in favor of
H
0
:
1
0, because
1
is the amount by which the probability of a nonwhite getting an
approval differs from the probability of a white getting an approval, given the same levels of
other variables in the equation. If income, wealth, and so on, are systematically different
across races, then it is important to control for these factors in a multiple regression analysis.
Another problem that often arises in policy and program evaluation is that individuals
(or firms or cities) choose whether or not to participate in certain behaviors or programs.
For example, individuals choose to use illegal drugs or drink alcohol. If we want to exam-
ine the effects of such behaviors on unemployment status, earnings, or criminal behavior,
we should be concerned that drug usage might be correlated with other factors that can
affect employment and criminal outcomes. Children eligible for programs such as Head
Start participate based on parental decisions. Since family background plays a role in Head
Start decisions and affects student outcomes, we should control for these factors when
examining the effects of Head Start (see, for example, Currie and Thomas [1995]). Indi-
viduals selected by employers or government agencies to participate in job training pro-
grams can participate or not, and this decision is unlikely to be random (see, for example,
Lynch [1992]). Cities and states choose whether to implement certain gun control laws,
and it is likely that this decision is systematically related to other factors that affect violent
crime (see, for example, Kleck and Patterson [1993]).
The previous paragraph gives examples of what are generally known as self-selection
problems in economics. Literally, the term comes from the fact that individuals self-select
into certain behaviors or programs: participation is not randomly determined. The term is
used generally when a binary indicator of participation might be systematically related to
unobserved factors. Thus, if we write the simple model
y
0
1
partic u, (7.34)
where y is an outcome variable and partic is a binary variable equal to unity if the individ-
ual, firm, or city participates in a behavior or a program or has a certain kind of law, then
we are worried that the average value of u depends on participation: E(upartic 1)
E(upartic 0). As we know, this causes the simple regression estimator of
1
to be biased,
and so we will not uncover the true effect of participation. Thus, the self-selection problem
is another way that an explanatory variable (partic in this case) can be endogenous.
By now, we know that multiple regression analysis can, to some degree, alleviate the
self-selection problem. Factors in the error term in (7.34) that are correlated with
Chapter 7 Multiple Regression Analysis with Qualitative Information 259